Solutions for Production Incidents

Pre-built automations that fix the most common issues impacting availability

Stop reinventing the wheel!

Shoreline’s Op Pack library offers open source blueprints for automating away your most common incidents.

  • Our community figures out what to monitor, what alarms to set, or what scripts to run to complete the repair.
  • All Op Packs are totally configurable. You can also decide to fully automate or use an interactive Notebook to guide your team.
  • The library grows each month as new Op Packs are added by the community.

What’s in an Op Pack?

Shoreline’s Op Packs are Terraform modules that contain multiple elements:

  • 1. Metrics - time series data about fleet resources — examples include cpu, disk, memory, and network utilization, or service latency and error rates
  • 2. Alarms - monitors that frequently check Metrics, often every second, against your custom threshold levels
  • 3. Actions - shell commands that are executed on specific resources
  • 4. Scripts - pre-written code to execute specific actions on your fleet
  • 5. Bots - connectors between Alarms and Actions — when an Alarm is raised, the Bot fires all associated Actions to automatically resolve the issue causing the alarm
  • 6. Notebooks - interactive interface pre-populated with diagnostics and step-by-step recipes for incident resolution
  • 7. Tests - actions designed to reproduce error conditions for the purpose of validating automations

Quickly fix the most common issues.

Shoreline provides a powerful library of alarms, debugging commands, and remediation actions that address common problems.

Save time using solutions vetted by a community of experts.

Instead of having to start from a blank slate, begin with solutions that others in the community have already solved.

Rest easy knowing safety and controls are built in.

Every Op pack has tests integrated, and safety built-in, including access controls, blast radius controls, and circuit breakers.

Shoreline Op Pack Library

See below for a listing of our most popular Op Packs, and follow the link on each one to learn more about the specific issue and solution.

The community is actively adding to the library every month, so please contact us if you don’t see the issue that’s been driving you nuts. If we haven’t already built it, we’d love to add it to our list.

Intermittent JVM Memory Issues

JVMs often face memory issues that can lead to hours of SSH-ing into box after box trying to catch the issue as it happens.

Networking Issues

Network related issues are often hard to diagnose, and can lead to a very bad experience for customers.

Kubernetes Node Retirement

When AWS Systems Manager marks a node for retirement, companies must gracefully terminate work on that node.

Disk Resize / Disk Clean

Disk full incidents can lead to wide-spread outages and data loss that can damage customer experiences and lose revenue.

Restart CoreDNS Service

CoreDNS, the default Kubernetes DNS service, can degrade in performance with too many calls causing massive latency.

Delete Old Argo Pods

Argo makes declaratively managing workflows easy, but it can leave behind many stale pods after workflow execution.

Pods Stuck in Terminating

When Kubernetes pods won’t leave the terminating state, they must be identified and safely drained.

Detect Cryptocurrency Mining Operations

Unauthorized cryptocurrency miners must be stopped from abusing free tiers of cloud service providers.


When the length of your Kafka topic is too long, applications may begin to break.

Process List

Server environments can often be challenging to run. Sometimes processes silently die. Other times old versions of processes are left running.

Log Processing at the Edge

Many production incidents are caused by issues that can be identified by analyzing log files. Unfortunately, centralized logging can be very expensive.

Pod Out of Memory (OOM)

Many different types of application errors can lead to out of memory errors (OOMs) in Kubernetes.

Certificate Rotation

Sooner or later every company gets bitten by expired certificates and when they do, it can cause a catastrophic outage.

