Runbook

Airflow server resource exhaustion during peak workflow executions.

Back to Runbooks

Overview

This incident type describes an issue where the Airflow server, which is used to manage and schedule workflows, is running out of resources, such as CPU, memory, or disk space, during periods of high workflow executions. This can cause delays in workflow execution or even complete failures. It is important to monitor the server's resource usage and allocate sufficient resources to ensure smooth and uninterrupted workflow execution.

Parameters

Debug

Check if there are any zombie processes

Check CPU usage

Check memory usage

Check disk space usage

Check Airflow logs for errors or warnings

Check Airflow configuration for resource limits

Check if Airflow workers are consuming too many resources

Check if there are any blocked I/O operations

The Airflow server is running on a machine with insufficient resources, such as low memory or CPU capacity, to handle peak workflow loads.

Repair

Scaling up server resources: One possible remediation strategy is to increase the resources available to the Airflow server during peak execution periods. This can be done by adding more CPU, memory, or disk space to the server or by moving to a more powerful server altogether.