Runbook
Bottleneck in Airflow DAG Scheduler Causing Task Execution Delays
Back to Runbooks
Overview
This incident type refers to a bottleneck within the Apache Airflow DAG (Directed Acyclic Graph) Scheduler that causes delays in task execution. Apache Airflow is an open-source platform used to programmatically author, schedule, and monitor workflows. The DAG Scheduler is responsible for scheduling tasks based on their dependencies and availability of resources. When a bottleneck occurs within this component, it leads to delays in task executions, which can impact the overall workflow and potentially cause failures. This type of incident requires investigation and resolution to ensure optimal performance and reliability of the Apache Airflow platform.
Parameters
Debug
List all Apache Airflow Pods running in the cluster
Check the logs of the DAG Scheduler Pod
Check the resource usage of the DAG Scheduler Pod
Check the status of the Kubernetes Nodes
Check the resource usage of the Kubernetes Nodes
Check the CPU and memory limits set for the DAG Scheduler Pod
Check the CPU and memory usage of the DAG Scheduler Pod over time
Check the network connectivity between the DAG Scheduler Pod and other Pods
Check the Kubernetes events related to the DAG Scheduler Pod
Repair
Learn more
Related Runbooks
Check out these related runbooks to help you debug and resolve similar issues.