Back to Blog

4 Tactics to Ensure Power & Safety in Production Ops

How can we establish powerful production operations that avoid allowing SREs unrestricted SSH access to production environments? Here are four measures we implement to safeguard services.

Anurag Gupta, CEO of Shoreline.io on Safer Production Operations
Anurag Gupta, CEO of Shoreline.io on Safer Production Operations
https://www.youtube.com/watch?v=GvMeBLG8Veo

Introduction | Creating Safer Production Operations
Shoreline.io lets you run a command across 10,000 hosts (clouds, VMs, containers, etc.) in about the same time it takes to run against just 10. That's powerful but also a bit scary! So, how do we create safer production operations than letting SREs just log into production using SSH and do whatever they want? In this video, I share 4 tactics.

Tactic 1: Using Terraform Scripts

We run everything as TerraForm scripts, allowing you to deploy to dev and staging before going to production. This process ensures your scripts undergo the same rigorous testing as your code, with fault injection and other processes.

Tactic 2: Limiting the Blast Radius

You can limit the blast radius of an action. For example, a user can only execute across "x" many resources at a time.

Tactic 3: Implementing Circuit Breakers

Whether for automated or human-guided actions, limits are placed on the number of times an action can occur within a certain timeframe. For example, you can limit an action to 100 at a time, pause, and allow the next 100. Or you can limit it to 100, and beyond that, human intervention is required.

Tactic 4: Fine-Grained Access Control

You can ensure that only authorized personnel within specific groups can modify certain environments and only under specific conditions. You can set conditions like: 'my Kafka environment can be modified only by the people in my Kafka group, only when they're on call, and only these particular actions can be changed'.

Conclusion

This structured approach ensures that many can run diagnostic actions, but only a limited, high-judgment group can run more critical tasks that are not inherently safe. This enables a lot of power and safety in your operations.