We can learn something from every outage, regardless of whether it comes from a start-up or a hyperscaler like Amazon or Google. In this webinar, we hear from two reliability experts, Niall Murphy (former head of SRE at Microsoft and Google) and Anurag Gupta (former VP of AWS database and analytic services). This session includes technical lessons learned large outages at Amazon and Google, including:
- The importance of automation and how to build circuit breakers to mitigate risk
- Scaling automation: what works at 1,000 customers may not work at 1 million customers.
- Botched rollouts: don’t forget to check the failure rate of distributed jobs
- The anti-80/20 rule. Avoiding your biggest potential Achilles heel by building redundancy into the systems you use the least.
- How to conduct blameless post-mortems that maximize the lessons learned from any outage