Shoreline closes $35M series B - Read the details
Routine problems arise daily in IT operations—and solutions aren't always easy to implement.
While experienced subject matter experts (SMEs) may know how to diagnose and solve these issues quickly, they're not always available to help as they're working on new product features and higher-value projects. Without their documented fixes, other developers and staff can waste countless hours searching logs and/or Google for answers.
That's where runbooks come into play.
With runbooks, anyone on your IT operations team can promptly remediate recurring problems using a comprehensive, step-by-step guide. A well-written runbook can even assist with automating remediations to issues, saving you and your SMEs valuable time.
In this article, you'll discover the basics of runbooks and how to build one that sets you up to resolve persistent problems through automation.
A runbook is a collection of processes and remediation steps that IT operations staff and other employees use to solve frequent technical problems. The goal of a runbook is to share SMEs' knowledge so that other members can more quickly and consistently resolve issues on their own without escalating to SMEs.
A well-written runbook empowers DevOps, site reliability engineers (SREs), and support staff to execute routine fixes consistently and more efficiently. Not only can runbooks improve solution quality and productivity, but they also provide a roadmap for building automated repairs in the future so that the runbook itself is no longer necessary.
There are three main types of runbooks a company may implement depending on their level of technical expertise. These types include:
Regardless of which type you use, you'll be far better off than companies not using runbooks. With runbooks, you'll be able to operate more efficiently and spend more time creating innovative products that out-perform others in the market. Runbooks will also help reduce the time to repair issues, creating a better customer experience.
Naturally, SMEs will be the ones to support, diagnose, and resolve new or unique technical issues. When this happens, they'll want to document the steps they take to fix a problem into a repeatable guide—i.e., the runbook—so that other team members can resolve the same issue again. This act of recording ensures that incident response doesn't create bottlenecks or pull SMEs away from higher-value projects.
While specific runbook usage may vary from company to company, there are a few typical situations for which IT operations should consider creating and using a runbook:
An SME doesn't necessarily need to write every runbook for these situations, but their knowledge and expertise will be invaluable resources to fill in the blanks and ensure accuracy.
People often use the terms runbooks and playbooks interchangeably, but they can be quite different. A playbook is a broader concept that focuses on more extensive strategic action than specific tactical methods, often containing multiple runbooks within its contents.
For example, an IT operations team may have a playbook for deploying a security patch to a fleet of servers. Within this playbook, there may be individual runbooks for how to test the patch, deploy it, update the server configurations, and safely restart the applications.
You can think of a playbook as a novel and the runbooks as the chapters. The book and chapters have a narrative and flow, but the former is broader and tells an interconnected story.
Constructing a runbook is the first step toward solving routine issues through automation—helping you operate more efficiently and accurately.
Here's a look at four concrete steps you can take to write a runbook:
As you might recall, ideal candidates for runbooks are processes that IT operations team members execute frequently, have high error rates, or have significant risk. Consider your team's methods and if any of them fall into one of these categories.
Typical examples of processes that benefit from runbooks include resizing, archiving, or deleting files from full disks, restarting Java virtual machines (JVMs) that have CPUs maxed out due to runaway garbage collection, and terminating stuck Kubernetes pods. Having a clear runbook—and ideally automation—can minimize the risk of developers, SREs, and other support staff missing a step.
Incident reports and post mortems can also be helpful materials for identifying fitting candidate processes for runbooks. These documents include detailed analyses of what happened during an incident and any recommended follow-ups. IT operations teams can use this data to determine root causes and how to prevent the issue in the future with better documentation via a runbook.
Once you've identified an ideal task, you'll need to determine each step required to fix a problem manually. Answer questions like, “Will a ticket need to be created and/or closed?” and “Will the user have all the right security credentials and permissions to repair this issue?”
After you've determined the fix, you'll want to document it in a runbook and share it with the relevant engineering team (along with any past debug data). They'll be able to see if they need to implement any changes to solve the root cause of the issue. While an SRE or support staff member may fix the problem at the moment of the incident, additional data may be required so that engineering can fix it forever.
In addition to providing a solution to the problem, you'll want to include any relevant diagnostic steps to help readers identify the issue quickly in the first place. Proper diagnosis can help prevent support delays and enable readers to find the correct runbook more quickly.
With the ideal task, solution, and diagnostic steps in hand, it’s time to write the runbook! While this may seem like a simple step with all the research done, the document must be to-the-point, easy to understand, and accurate. It should contain as much information as is necessary to diagnose and resolve the problem without unnecessary fluff.
A typical runbook will contain the following key sections:
Some runbooks can be quite text-heavy, which can be challenging to go through. Consider including screenshots, diagrams, and flow charts to make your runbook easier to understand.
After you've written your runbook, think through how on-call teams will find the right runbook for the right incident. It's not uncommon to have multiple, similar runbooks, so having a strategy to deal with this is key.
Lastly, you need to have a plan for keeping runbooks up to date. Products and services are constantly evolving which can lead to unforeseen consequences for how on-call teams will maintain them. So, frequent maintenance is key and gets harder as the number of runbooks increases.
A runbook can be an essential tool for sharing knowledge about incident remediation, helping you avoid costly escalation chains. However, runbooks are just the first step toward efficient workflows. Runbook automation can prevent the need for human intervention altogether, saving you valuable time and resources.
Shoreline's RBA Platform supports semi-automatic and fully automatic runbooks with precise alarms, so you can easily automate remediations in seconds, not weeks.