This incident type involves the failure of the YARN ResourceManager, which can impact the performance of Spark jobs. The ResourceManager is responsible for managing resources in a Hadoop cluster, and when it fails, it can prevent Spark jobs from running properly. This incident requires investigation to determine the root cause of the failure, and recovery steps to restore the ResourceManager and prevent similar failures from occurring in the future.
Parameters
Debug
Check the status of the YARN ResourceManager service
Check the logs of the YARN ResourceManager service
Check the resource usage of the YARN ResourceManager service
Check if the Spark jobs are running
Check the logs of the Spark jobs
Check the resource usage of the Spark jobs
The YARN ResourceManager may have been overloaded with requests, causing it to fail.
Repair
Increase the number of YARN ResourceManager nodes in the cluster to provide redundancy and reduce the impact of a single node failure.
Learn more
Related Runbooks
Check out these related runbooks to help you debug and resolve similar issues.