System Failures: Causes and Solutions by AWS Expert Anurag Gupta

This week, Anurag spoke at the CTO Summit on Reliability to share his new talk “Why systems fail and what you can do about it.” The talk covers four categories of system failures and mitigation approaches for each based on Anurag’s background at AWS running analytic and database services.

At AWS, operations leaders met weekly to discuss the prior week's issues, the cause of errors, and to discuss how to mitigate errors in the future. In these meetings they would categorize the previous week’s failures, and propose solutions.

Reason #1 that systems fail: we perturb them

Deployments are the most common source of outage minutes for most companies.

The blast radius is large
Changes are complex
It’s difficult to get failure rate below 0.5% of deployments
Detecting, debugging, addressing failures takes time

Since deployment challenges are a process problem you can also solve them with process.

We have a strong belief at Amazon about moving from good intentions towards mechanisms, because mechanisms can be iteratively improved and maintain a collective memory.

Via process improvements, AWS was able to reduce deployment failures by ~50x.

One artifact of this process was a deployment doc that would be reviewed by a skilled operator outside the service team.

The doc included

What's being changed
Downstream services that should be notified and may see impact
Deployment schedule by availability zone and region to limit the blast radius of a change
Which metrics were being monitored
How rollback was going to be automated if a metric was out of band

How this helped

First, deploy to canaries to validate performance, resource usage, and functionality against known workload
Limit the blast radius of deployment by sequencing the rollout into smaller steps
Automate rollback to ensure critical decisions were made up front, not during on-call
Aimed for 5-5-5 (5 minutes to deploy, 5 to evaluate success, 5 to rollback)
Built a template “collective memory” of issues seen and to be avoided in future

This created a virtuous cycle. As deployments became more reliable, they were done more often, which made them smaller and therefore more reliable.

What if you can’t rollback automatically?

There’s a debate between the auto-rollback and only roll-forward camps. Anurag’s take is if you can rollback automatically, why wouldn’t you? Many companies make rollbacks work by doing things like splitting a change into multiple deployments:

Make a database schema change
Start writing to that new table from the app tier
Start reading from the new table
Clean up old artifacts no longer being called

This is a general application of the methodology where you make an initial interface change in the provider, then the consumer, then remove the stale interface in the provider. Distributed systems require this type of thinking in any case. You can’t update everything simultaneously - you need to support old and new interfaces and make gradual transitions to new versions.

Reasons #2 systems fail: operator error

The largest outages Anurag has seen were either operator errors or cascading failures with bad remediations.

Examples

Taking 10,000 load balancers out of service rather than 100
Incorrect WHERE clause in a manual DELETE operation in a control plane database
Replication storm causing disks to fall over (requiring further replication) How do you limit the blast radius of human errors?

Humans intrinsically have a 1% error rate, particularly when doing repetitive mundane tasks.

Here’s how to limit errors

Manual changes should be rare, but if you make them they need to be “pair programmed” with multiple eyes evaluating each command before it gets issued
Tool-based changes should limit the blast radius they impact
Per-resource changes should be heuristically rate-limited and escalated to an operator
For example, RDS Multi-AZ will stop failing over after X instances in Y period, raising a ticket instead

Ops orchestration tools should handle the above by default.

Reason #3 systems fail: black box components

25% of Large-Scale Events Anurag saw involved databases.

Databases:

Have a large blast radius
Take a long time to recover
Change query plans unexpectedly based on inaccurate statistics
Are easily under-administered (e.g. PostgreSQL vacuum and transaction id wrap-around)

This is true of a lot of things other than databases, like edge routers, cloud services, etc, so take a look inside your environment for systems with similar characteristics and take appropriate actions.

AWS likes to avoid relational databases in their control planes, opting for DynamoDB or other home-grown tech. For example, use a NoSQL system instead of a relational database, not because NoSQL is intrinsically more reliable, but because it fails in pieces, one table at a time rather than as a whole. It's less functional, it's less expressive, but that means that the remaining functionality is expressed in your own code, which gives you control.

Try to build “escalators” not “elevators”. Escalators are systems that reliably perform at a lower rate but also degrade to a lower rate rather than elevators. Elevators perform better in normal cases, but fail absolutely and degrade under load.

Examples

pDNS used to fail a lot. By caching IP addresses, most control plane APIs continued to work
You can use a ”warm pool” to buffer EC2 control plane failures
You can keep enough local disk to buffer a 1-2 hour S3 outage

Reason #4 systems fail: everything eventually fails

In 2020, Google reported 200 minutes of downtime across 150 large-scale events and GCP. That's a lot of downtime for a well-run, SRE-driven organization.

Everything eventually fails, and this is where we separate failures into commonplace failures and first-time failures.

Runbooks reduce human error during commonplace failure scenarios. But runbooks still leave humans in the loop where the time to respond is an hour or two, even for a well understood issue like a full disk.

For well-understood problems, Anurag found the only way around long remediation periods and downtime is to automate remediation. Every week they chose a problem or set of problems that would yield the greatest gains in productivity or availability and automated them.

Solving first time failures is challenging because observability tools have lag, and dashboards and logs often lack data for a new event. To debug, operators often end up opening up a blizzard of SSH windows.

Differences

Production ops is a real-time distributed systems problem, and requires a platform that:

Supports real-time views into resources and metrics
Supports per-second metrics
Integrates resources, metrics, and Linux commands together so you can view and modify the environment
Controls the blast radius, partial failures, retries, etc

Automating Remediation

The challenge with automating remediation common failures is that each remediation tends to be a custom, multi-month project. Shoreline’s platform makes it easy to build automated remediations with only shell scripting skills. In the same time it takes to debug the system, you can create an automation that handles the problem forever.

Shoreline is the product Anurag wishes he’d had managing large fleets at AWS.

Understanding and Mitigating System Failures

Reason #1 that systems fail: we perturb them

The doc included

How this helped

What if you can’t rollback automatically?

Reasons #2 systems fail: operator error

Examples

Here’s how to limit errors

Reason #3 systems fail: black box components

Databases:

Examples

Reason #4 systems fail: everything eventually fails

Differences

Automating Remediation

Product

Resources

Support

Company