Runbook

YARN ResourceManager Failure Impacting Spark Jobs.

Back to Runbooks

Overview

This incident type involves the failure of the YARN ResourceManager, which can impact the performance of Spark jobs. The ResourceManager is responsible for managing resources in a Hadoop cluster, and when it fails, it can prevent Spark jobs from running properly. This incident requires investigation to determine the root cause of the failure, and recovery steps to restore the ResourceManager and prevent similar failures from occurring in the future.

Parameters

Debug

Check the status of the YARN ResourceManager service

Check the logs of the YARN ResourceManager service

Check the resource usage of the YARN ResourceManager service

Check if the Spark jobs are running

Check the logs of the Spark jobs

Check the resource usage of the Spark jobs

The YARN ResourceManager may have been overloaded with requests, causing it to fail.

Repair

Increase the number of YARN ResourceManager nodes in the cluster to provide redundancy and reduce the impact of a single node failure.