Shoreline closes $35M series B - Read the details

Networking Issues

< Solutions Library

Highlights

Customer Experience Impact:

High – potentially hours of downtime

Frequency of Occurrence:

Weekly for fleets with hundreds of nodes

Manual repair elapsed time:

~ 1-6 hours

Shoreline repair elapsed time:

~ 1-2 minutes

Provision from Terraform RegistryFree Trial

The Problem

There can be a number of network related issues that are very hard to diagnose because they don’t occur consistently across the entire network. In many situations, basic checks of the fleet will make it look like 99% of the fleet is performing normally and that there is just some mild variability in network connectivity. In reality, there may be a small number of nodes that can no longer connect to the network.

This could lead to a very bad experience for a small number of customers. These types of incidents can often be hard to diagnose because they are literally like searching for a needle in a haystack.

The larger the fleet, the more likely companies are to experience this type of incident.

The Solution

Typically, Shoreline does not trigger an automated repair for this type of incident. Instead, Shoreline provides a series of diagnostics that help on-call teams more quickly pin-point the specific network issue and nodes affected by the issue. These diagnostics eliminate hours of wasted time that operators would otherwise spend trying to manually uncover the issue.

Here are the diagnostics run by Shoreline:

  • 1. Curl an HTTP endpoint in parallel across the fleet and return a status code. Way to check if services your system is depending on, can each instance of your application connect and authenticate to the service.
  • 2. DNS lookup. Checking to see if each instance of our application resolves domains to IP addresses in the same way. Sometimes a portion of the fleet might have stale entries and therefore might have failed requests.
  • 3. Ping. Used to check connectivity at the network layer. Can the nodes in one region or availability zone connect to nodes in another region. Sometimes you don’t have connectivity, sometimes there is high latency and other times there is high packet loss.
  • 4. Measure the number of outbound requests to a specific port. Can help you detect if you are connecting to APIs or ports that are unexpected. Sometimes there are too many processes running, generating an unexpected number of connections.
  • 5. Measure the number of inbound requests to a specific port. Can help you detect if you are receiving an unexpected number of connections or API calls from external sources.

Ready to get started?

Shoreline helps you eliminate repetitive tickets and increase your availability at the same time. Get started today with a free trial.