Shoreline closes $35M series B - Read the details

Restart CoreDNS Service

< Solutions Library


Customer Experience Impact:

High. Can bring down an entire cluster

Frequency of Occurrence:

Depends on # and size of clusters single digit clusters ~ quarterly double digit clusters ~ monthly

Manual SRE time spent on diagnosis:

1-4 hours

Shoreline repair elapsed time:

Shoreline fixes this with zero downtime

Provision from Terraform RegistryFree Trial

The Problem

Networking is complex with Kubernetes and often the most common problems and outages in a Kubernetes cluster come from DNS issues. CoreDNS, the default Kubernetes DNS service, can degrade in performance with too many calls to it causing massive latency. Once latency between the pod and CoreDNS reaches one second or more, it impacts both the customer and ultimately their SLA. However, most organizations merely monitor CoreDNS and continue to manually address the issue, causing unacceptable delays and potentially system outages. This issue is sometimes hard to diagnose because DNS issues have broad impact, and the underlying cause is often unclear. Services may be running fine, but can't communicate with each other.

The Solution

The Shoreline CoreDNS Op Pack monitors metrics and automatically triggers an Action that restarts the CoreDNS pod once latency exceeds a configurable threshold. Shoreline can gather CoreDNS metrics as frequently as once per second, using multiple data points and different percentiles when deciding if CoreDNS resolves slowly or not. Shoreline focuses on DNS latency rather than overall measured latency because it helps disambiguate latency from the network versus latency from DNS resolution. This practice prevents false positives when clusters experience high network latency. The Alarm threshold is configured in milliseconds so that you can tightly control the latency tolerance. Once an issue is identified, Shoreline triggers rolling restarts of CoreDNS pods to prevent service outages.

In addition, the CoreDNS Op Pack offers the ability to automatically create a PagerDuty incident or Slack message when an Alarm or Action is triggered. This capability eliminates another manual step by automating the notification process for further root cause analysis.

Ready to get started?

Shoreline helps you eliminate repetitive tickets and increase your availability at the same time. Get started today with a free trial.