DevOps automation tools are more critical than ever with everything migrating to the cloud, emphasizing Agile methodologies and the explosion of microservices. But the DevOps tools space is crowded, and the tech stack is convoluted. So how might we pinpoint the value of a particular tool and prioritize its use?
This Guide to DevOps Automation Tools in 2021 offers an authoritative overview of many of the tools in the DevOps Automation world. First, we'll classify existing DevOps automation tools based on the problems they're attempting to solve. Then, we'll explore the domain for each group and consider a few examples of tools targeting that problem.
After reading this guide, you'll have a solid understanding of the tools that fit into each of these categories, their use cases, and which tool can best help you to solve your problems.
Before we dive into the DevOps automation tools themselves, it helps to understand the different phases of the software development life cycle and their challenges so that we can more easily categorize those tools. So we'll break down the life cycle into three phases, or "days":
Day 0 is the planning phase, which involves gathering requirements, defining the architecture, and testing early assumptions. Therefore, DevOps tools are typically not needed in this phase.
Day 1 is the development phase. This phase builds on the requirements and designs of Day 0. It involves developing, building, testing, and deploying the software -- an entire application or just a single feature -- to a suitable environment. These operations often employ several well-established DevOps tools.
Day 2 is the maintenance phase, ensuring that the software is available and performant. This phase benefits greatly from DevOps automation tools.
Before the adoption of agile methodologies, when a waterfall approach to software development was in vogue, the software development life cycle strictly progressed through the phases from left to right. However, the boundaries between phases are blurred with agile development, resulting in some processes and tools straddling multiple phases.
Our guide considers the main aspects of Day 1 and Day 2 operations, along with notable tools in each category.
Now that we have a way to categorize the tools let's dive in!
In the build and launch phase of the software development life cycle, a development team's main concerns are code management, testing, and deployment. We've broken these concerns into six major categories:
Keeping track of code as it evolves is critical to a development team's success. Automated building and testing are essential, too. Thankfully, there are tools dedicated to source control, CI/CD, and testing.
A decade ago, "code" referred only to application code. Today, configuration and infrastructure also involve code. The ability to define infrastructure-as-code and configuration-as-code allows engineers to apply the same scrutiny and auditing to their application setup as they do to application code.
The players in the Day 1 DevOps automation tools space are numerous. These tools provide DevOps teams with many options but determining the best solution is complicated. Below we'll explore each category in more detail to simplify this process.
Whether it's application code, configuration-as-code, or infrastructure-as-code, proper versioning and tracking of all code is essential to Day 1 operations. Choosing the right source control tool is often one of the first decisions in your process. Git is the most popular version control system today.
Several services host and manage Git repositories; each one varies when it comes to user and repository management, code review features, and third-party tool integrations like notifications and security scanning.
GitHub (Saas) is not only the most widely-used Git service available, but it also boasts the integration of many other tools, making it more than just a source control offering. For example, GitHub includes a lightweight issue tracker and wiki for each repository, and it offers its own CI/CD tool, GitHub Actions, as a premium add-on. GitHub also has a collection of security scanning tools to help with security vulnerabilities in third-party libraries.
On the other hand, GitHub Enterprise is self-hosted. Because of that, integrations with third-party tools are limited.
Bitbucket is a Git offering that has native integration with the Atlassian suite of tools. The cloud version of Bitbucket offers a cloud CI/CD as an add-on. Repositories can be grouped by project, which helps with team organization. Bitbucket has a cloud offering as well as self-managed and dedicated server offerings.
GitLab advertises itself as a "complete DevOps lifecycle tool." Along with providing source control through Git repository management, GitLab includes an issue tracker, project wiki, an embedded CI/CD tool, and security testing and scanning.
Though always joined together in a single acronym, continuous integration (CI) and continuous delivery (CD) make up two separate but related steps in Day 1 operations. Continuous integration is an automated process that focuses on the testing and successfully integrating new code. Continuous delivery (or sometimes, continuous deployment) is the automated building and deployment of newly integrated code.
A CI/CD server listens for specific events -- for example, a Git commit or a scheduled job -- and responds by triggering pipelines (or "builds") on CI/CD executor nodes. A pipeline is an automated sequence of commands defined by DevOps engineers on a per-repository basis. These commands may include the running of a test suite, the building of an application or assets, and the delivery/deployment of that application.
After the pipeline finishes, the CI/CD server stores the execution result for review by the DevOps team.
CI/CD tools typically differ in the following areas:
There are several other notable CI/CD tools, in addition to those that provide CI/CD as an add-on.
While Jenkins is very flexible, it lacks deployment helpers. There are many boilerplate scripts available that are necessary for basic operations. Jenkins also has basic third-party integrations and a large plugin contributor community.
CircleCI is a standalone CI/CD tool that integrates tightly with both GitHub and Bitbucket Cloud. It provides deployment helpers for all major cloud providers and services, allowing the near-effortless creation of complex integration and deployment pipelines.
Spinnaker is an open-source tool that focuses solely on continuous delivery. Spinnaker has an extensive library of deployment helpers for complex cloud-native and container-orchestrated deployments.
Automated code testing at the CI stage can involve all kinds of tests -- unit tests, regression tests, smoke tests, and end-to-end tests (also known as integration tests). However, when it comes to end-to-end tests, a specific need unique to web application development is the end-to-end browser test. End-to-end browser tests mimic actual user interaction within a browser to ensure that code meets end-user acceptance criteria. With the right tools, browser testing automation runs as part of the CI pipeline.
Selenium is an object-oriented API to interact with headless browsers (such as Firefox or Chrome). Selenium supports several languages and browsers, making it an extremely comprehensive tool. However, the downside for development teams is a fairly steep learning curve and high initial effort to integrate with existing processes.
Puppeteer is a Node.js API for interacting within headless Chrome. While Puppeteer is not explicitly a dedicated testing tool, developers can use Puppeteer to build an in-browser testing suite, then run that suite within a CI/CD pipeline. Puppeteer conforms to the DevTools Protocol but only supports Chrome.
A decade ago, your options for deploying a web application were limited. You might've paid for a shared host virtual machine, pointed a domain name to the host, and spun up Apache. There was little need to capture this process to make it reproducible or automated.
The rise of microservices has changed all that.
Today's application deployments involve a complicated web of infrastructure components, including bare-metal servers, virtual machines, containers, API gateways, load balancers, and more. Therefore, code must define how to provision infrastructure components and establish communication between them.
Infrastructure-as-code (IaC) improves the consistency and auditability of infrastructure deployments while increasing reliability. Mature DevOps teams use IaC in conjunction with tools for provisioning infrastructure.
Most cloud providers have their own vendor-specific infrastructure provisioning tools. For example, AWS has CloudFormation, while Google has Cloud Deployment Manager. Provider-based tools might be enough to meet your business requirements, but you also run the risk of vendor lock-in.
Some infrastructure provisioning tools support several cloud providers.
Terraform uses a domain-specific language (known as HCL) to describe infrastructure. It has dozens of plugins that cover several infrastructure providers from AWS to Heroku and cloud services like Datadog or Bitbucket. Terraform plugins interact directly with the provider APIs to create, update, and destroy resources.
Terraform defines infrastructure objects as resources and groups them using modules. There's a public registry for providers and modules, and you can even create your own. The Terraform CLI locally stores the state changes then remotely modifies the infrastructure as defined in HCL templates.
While very similar to Terraform, Pulumi works as an SDK and is available in multiple languages. Pulumi supports dozens of providers and cloud services. It has native packages and an API for working with all major cloud providers, infrastructure resource providers, and database and monitoring services.
Much like Terraform, Pulumi also has the concept of state files.
The Serverless Framework provides an advanced helper to build, debug, and deploy serverless functions. It supports function-as-a-service (FaaS) offerings from multiple cloud providers (such as AWS Lambda, GCP Cloud Functions) and multiple languages.
Serverless works as a wrapper for vendor-specific infrastructure provisioning tools. For example, in AWS, Serverless generates and deploys a CloudFormation stack. Serverless abstracts away most of the complexity involved in deploying cloud functions.
Alongside infrastructure provisioning, Day 1 operations also include configuration management. This step defines, captures, and manages the software requirements for each component of the infrastructure. For example, it's not enough simply to provision a Linux virtual machine and then expect a Rails application to run on that machine without further configuration. Instead, you must first install software dependencies, define environment variables, and start services.
Depending on your infrastructure, you may need one or more of the following tools to ensure all requirements are adequately met across all of your servers and components.
Docker is a tool to define and run containers. Containers are lightweight, standalone packages containing the application and dependencies (such as runtime, libraries, configuration files, and binaries).
Contrary to a virtual machine, containers use the same kernel as the host, making containers faster to start and less resource-consuming. Docker uses native kernel features to isolate a container away from other processes and resources.
Docker containers are created from Docker images, and you can find countless public and pre-built images in Docker Registries (for example, Docker Hub). Additionally, DevOps teams can build their own Docker images by using a Dockerfile.
Running containers from Docker images usually requires additional runtime configuration to set environment variables, mount volumes, or apply network configurations. Docker compose is a tool for such configuration, suitable for local setups or single-server deployments.
For production workloads and to ensure high reliability, it's advisable to use either managed container services specific to each cloud vendor or to use container orchestration, which we will cover later in this article.
Packer allows virtual machine image creation for several platforms from a single configuration file.
For example, in AWS, Packer allows DevOps teams to create new Amazon Machine Images programmatically. Packer will create a new virtual machine from a pre-defined AMI, run the defined instructions and commands, and save the result as a new, deployable AMI. Then, operators can execute those exact instructions with other providers like Azure or GCP.
You can use Packer to generate a "golden image," a shared machine image deployed in multiple hosts and adjusted at startup with a configuration management tool. Alternatively, the image can generate immutable infrastructure, where a new machine image defines each new application version.
Ansible is an open-source tool that advertises itself as the configuration management tool with the shortest learning curve and requirements. All data and configuration is written as YAML files.
Ansible can apply a configuration to multiple hosts, which are called nodes. A playbook defines multiple installation instructions, and each instruction is called a task. Ansible playbooks merely require Python. Subsequently, Ansible connects to each node it manages via SSH and then runs through its playbook of tasks as if it were a user manually connecting to that host and running commands.
Ansible's manner of connecting via SSH removes the need for any additional client to be installed on each node machine.
Puppet is an open-source tool for configuration management and automation. While it could be used as a standalone tool for a single machine, Puppet shines when a DevOps team runs a central Puppet server and installs a Puppet agent on each managed machine. Code changes are pushed to the Puppet server; each managed machine runs the Puppet agent periodically to check in with the Puppet server to see the latest expected state. The Puppet agent subsequently adjusts its machine's current state to match the expected state.
Because of these continuous adjustments, Puppet helps to prevent configuration drift. Puppet also allows DevOps teams to generate reports on changes and retrieve advanced information for each managed server. In addition, Puppet Enterprise, a premium service, offers enterprise-grade security, automation, and compliance features.
Puppet instructions (called resources) are written in Puppet's domain-specific language, which resembles Ruby. Resources can be grouped together to make modules. Forge hosts several open-source modules for various requirements.
In some instances, due to security trade-offs, secrets and sensitive data cannot be stored on disk by configuration management tools; instead, dedicated secret management tools retrieve secrets at runtime.
One such tool is Vault, a cloud-agnostic and open source solution (self-hosted or SaaS) for password and secrets management. Vault also provides granular control and automatic password rotation in some use cases. Moreover, depending on the storage backend used, Vault is capable of high availability.
DevOps teams need to ensure extremely high availability when using password and secrets management, as all other components will depend on this information. Many cloud providers have their own similar service for secret management. It's crucial to examine if these cloud offerings' feature set and availability guarantee are sufficient for your business requirements.
When deploying multiple containers (for example, Docker containers) in production, DevOps teams need to use container orchestration to ensure the desired number of healthy containers -- spread across a minimum amount of infrastructure -- are running at any given time.
All major cloud providers have their own managed container orchestrators, but there are self-hosted options as well.
Nomad is an open-source tool that focuses on container scheduling and cluster management. Nomad was developed intentionally with simplicity in mind, making it easy to deploy and maintain. Nomad covers all sorts of container technologies, and it supports multi-region and multi-cluster orchestration across private and public clouds.
Kubernetes is a feature-rich, open-source tool that provides comprehensive support (including scheduling and cluster management) for Linux containers. Unfortunately, first-time Kubernetes users may find it daunting -- particularly if Kubernetes is self-hosted -- as the learning curve is steep due to the flexibility and multitude of features supported.
On the other hand, Kubernetes has a huge operator community, and several major DevOps tools officially support it. Kubernetes enjoys a vast ecosystem with the accompaniment of hundreds of helper tools for cluster configuration, deployment definition files, observability, service mesh, and more.
While DevOps automation tools for Day 1 operations are focused on getting work from a developer's machine out to production, Day 2 tools ensure that production is stable and, if there are problems, that those problems can be found and fixed quickly.
Feature flagging, or feature toggling, is a software deployment technique that allows teams to continuously deliver and deploy new application functionality while mitigating deployment risks.
With feature flagging, code releases to a production environment in a hidden or disabled state. The changed or new functionality can then be selectively enabled (or disabled) based on a "flag." This provides a low-risk approach to deploying code without service interruption or time-consuming rollbacks.
Feature flagging is also used in A/B testing or canary deployments, in which a new feature is enabled for some users but disabled for others. Then, based on user feedback or monitoring, a DevOps or product team can decide whether or not to roll out the feature to the remaining users.
Among the feature flag services available, two are worthy of note.
LaunchDarkly is a SaaS-based platform for engineers to integrate feature flags in their applications.
Customers set feature flags and user segments in LaunchDarkly. When a flag rule changes, it sends a one-way real-time message to the target application to reflect the change. This improves performance and reliability, as your application doesn't need to reach LaunchDarkly servers on every request.
LaunchDarkly comes with several SDKs to integrate with different types of applications and languages.
The main benefit of using LaunchDarkly is that DevOps and Product teams can find and set all feature flags from a clean, easy-to-understand web dashboard. LaunchDarkly also supports an approval workflow for changing flags.
LaunchDarkly provides in-depth auditing and broad integration with the following tools:
Split offers similar features as LaunchDarkly, and it also supports several languages. It allows creating and setting up feature flags based on custom targeting rules (for example, users of an ecommerce site who are returning customers). As features are made available to target users, their usage data is also measured and captured.
Other integrations for Split include issue tracking systems (Jira, ServiceNow DevOps, Azure DevOps), observability (Dynatrace, Datadog, New Relic), deployment (Jenkins), and messaging (Slack).
Observability and telemetry applications collect metrics, logs, events, and other critical information from a target system, and provide insights into that system's current status.
These tools collect real-time streaming data from both infrastructure components and applications. Data can include logs, application traces, performance metrics from application performance monitoring (APM), and synthetics (for example, by regularly pinging target systems). Plus, some observability tools can aggregate different metrics to provide a better picture of a system's health.
There are many observability and telemetry tools in the market. Open source tools are specialized and have been around for many years. More recently, however, SaaS tools in the space typically focus on one of two approaches.
First, there are all-in-one aggregation tools. These tools store sampled and aggregated data (metrics) and issue alerts based on pre-defined expectations of how system components are expected to behave.
Second, there are high-cardinality event correlation tools. These tools don't use aggregators but instead store the entire data set of events, working more like a business intelligence tool for application and infrastructure events.
Most observability and telemetry tools allow users to establish performance thresholds and interface with alert management systems. This enables the sending of automatic alerts to relevant teams when a performance threshold is breached.
The purpose of using observability applications in Day 2 operations is to proactively identify any issues before customers have problems. The better insight a tool can provide, the easier it becomes for engineers to apply remediation steps.
Prometheus is a prevalant open-source metrics storage tool. It comes with a few components for metric collection, processing, and alerting. Data is pulled into Prometheus via dedicated exporters, then stored as time-series data.
Query data using a flexible query language called PromQL and used by API consumers to create meaningful visualizations. Prometheus is typically used with Grafana, an open-source analytics and visualization tool.
The ELK stack -- which contains ElasticSearch, Logstash, and Kibana -- is a popular open-source solution to store logs in a centralized fashion. It allows for quick search across all logs with user-generated dashboards and managed search filters.
Please note that keeping an Elasticsearch cluster (particularly self-hosted) might require specialized knowledge to balance cost, scalability, and resilience.
Splunk is an all-in-one SaaS solution for observability. It provides metrics, logs, traces (via APM), infrastructure, synthetics, and alarms. Pricing grows proportionally with data storage.
Splunk provides some integrations to receive data from other systems (like Nagios) or send notifications to other systems (such as on-call systems). Splunk also has its own on-call system.
Datadog is an all-in-one SaaS solution for observability, with multiple integrations with third-party systems. It provides advanced support for several DevOps tools, and it presents a consistent, user-friendly experience across its entire platform. In addition, Datadog provides a core interface to observe service level objectives, incidents, and some security features.
Pricing grows proportionally with data storage.
Contrary to the all-in-one tools covered above, Lightstep assumes that observability is best achieved through the ingestion and storage of all events rather than through arbitrarily aggregated metrics and sampled data. Therefore, Lightstep evaluates all telemetry data available and stores them in a cost-efficient time-series database.
Pricing scales with the number of deployed microservices.
Even with proactive monitoring, systems sometimes will still encounter failure, outage, or degraded performance. When this happens, it's called an incident. The only course of action is to find the cause of the incident and resolve it to bring the affected system online. But proper action can only occur when:
The incident management tools employed by DevOps teams in Day 2 operations can vary in their capabilities. Most incident management systems work alongside (or are a part of) an observability tool and receive an initial alert when an incident occurs. When an alert is sent, the incident management tool might send a message to a designated "on-call" person.
Here are some examples of incident management tools, all of which are SaaS offerings.
PagerDuty offers advanced features for on-call scheduling, as well as several ways to configure and notify people on-call (for example, phone calls, SMS, or third-party integrations). Additionally, it offers dozens of integrations with monitoring and observability tools.
Kintaba offers on-call scheduling, supports complex incident coordination workflows, and has bi-directional integration with Slack.
Blameless is a tool optimized to detect incidents (via SLOs) and minimize the need to manually run the incident process. Data from on-call and observability tools are centralized in Blameless, minimizing time spent in other tools during incidents.
Blameless interacts with multiple tools (messaging systems, observability systems) to ensure the incident process runs smoothly without the manual overhead.
FireHydrant aims to minimize manual steps for incident processes by following a pre-defined workflow. It integrates with several tools (such as Jira or Slack), allowing incident responders to focus on the incident at hand. It also provides a public status page, which can be automatically updated as part of the incident workflow.
Transposit aims to minimize manual steps for incident processes driven by pre-defined runbooks. Runbooks are coded either as checklists or as actions invoked directly from Transposit.
Unlike all the aforementioned tools, Shoreline prevents incidents by automating toil to reduce manual intervention. When Shoreline detects a known bad state, it automatically acts and applies the defined mitigation process.
Many DevOps tools are available, which can make it hard to start your DevOps journey or understand how to prioritize work. As a rule of thumb, consider:
As mentioned above, for each category of DevOps automation tools, several competitors weren't mentioned in this guide. As you proceed to develop your kit of tools, competitor offerings might warrant additional research. It's also worth noting that our classification did not cover all types of automation and infrastructure tools.
The DevOps automation tools space is littered with options. Evaluating which tools fit your workflow can feel like drifting blindly in a sea of TLAs (three-letter acronyms) and buzzwords. However, with a proper framework for evaluation in place, settling on a good set of tools might not be so daunting after all. For each tool, determine:
Our goal in this guide was to walk you through these concerns as we cover a broad range of DevOps automation tools available today -- across each aspect of both Day 1 and Day 2 operations. From here, you're well-equipped to navigate the DevOps automation space with confidence.