This incident type refers to situations where Apache Airflow workers have exhausted their resources. Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows. Workers in Apache Airflow are responsible for executing tasks and are essential to the platform's functionality. Resource exhaustion in workers can cause a significant impact on the performance of the platform and can result in workflow failures. This incident type requires immediate attention to ensure the workers have sufficient resources to execute tasks, and the platform is functioning correctly.
Parameters
Debug
Check the CPU usage of the affected worker node
Check the disk usage of the affected worker node
Check the memory usage of the affected worker node
Check the currently running processes on the affected worker node
Identify any specific Airflow tasks that may be causing the resource exhaustion
Check the Airflow logs for any errors or warnings that may be related to the resource exhaustion
Restart the Airflow worker process on the affected node
The worker may not have been configured with enough resources to handle the workload it was given.
Repair
Scale up the number of airflow workers to ensure sufficient resources are available to handle the workload.
Learn more
Related Runbooks
Check out these related runbooks to help you debug and resolve similar issues.