---
id: ab00d9fa-b10f-4642-8fb3-d3b8b66764aa
---

# YARN ResourceManager Failure Impacting Spark Jobs.
---

This incident type involves the failure of the YARN ResourceManager, which can impact the performance of Spark jobs. The ResourceManager is responsible for managing resources in a Hadoop cluster, and when it fails, it can prevent Spark jobs from running properly. This incident requires investigation to determine the root cause of the failure, and recovery steps to restore the ResourceManager and prevent similar failures from occurring in the future.

### Parameters
```shell
export YARN_RESOURCEMANAGER="PLACEHOLDER"

export SPARK="PLACEHOLDER"

export PATH_TO_SPARK_LOGS="PLACEHOLDER"

export RESOURCE_MANAGER_HOST="PLACEHOLDER"

export NUMBER_OF_NODES_TO_INCREASE_TO="PLACEHOLDER"
```

## Debug

### Check the status of the YARN ResourceManager service
```shell
systemctl status ${YARN_RESOURCEMANAGER}
```

### Check the logs of the YARN ResourceManager service
```shell
journalctl -u ${YARN_RESOURCEMANAGER}
```

### Check the resource usage of the YARN ResourceManager service
```shell
top -p $(pidof ${YARN_RESOURCEMANAGER})
```

### Check if the Spark jobs are running
```shell
ps aux | grep ${SPARK}
```

### Check the logs of the Spark jobs
```shell
cat ${PATH_TO_SPARK_LOGS}
```

### Check the resource usage of the Spark jobs
```shell
top -p $(pidof ${SPARK})
```

### The YARN ResourceManager may have been overloaded with requests, causing it to fail.
```shell


#!/bin/bash



# Check for overloaded YARN ResourceManager



# Set the threshold for the number of running applications

THRESHOLD=500



# Get the number of running applications from the YARN ResourceManager

NUM_APPS=$(curl -s -X GET http://${RESOURCE_MANAGER_HOST}:8088/ws/v1/cluster/apps | grep -oP '(?<="runningApplications":)[^,]*')



# Check if the number of running applications exceeds the threshold

if [ "$NUM_APPS" -gt "$THRESHOLD" ]; then

  echo "YARN ResourceManager may be overloaded with requests."

else

  echo "YARN ResourceManager is not overloaded with requests."

fi


```

## Repair

### Increase the number of YARN ResourceManager nodes in the cluster to provide redundancy and reduce the impact of a single node failure.
```shell


#!/bin/bash



# Set the number of nodes to increase to

new_nodes=${NUMBER_OF_NODES_TO_INCREASE_TO}



# Get the current number of ResourceManager nodes

current_nodes=$(yarn node -list | grep "ResourceManager" | wc -l)



# Check if the current number of nodes is less than the desired number of nodes

if [ $current_nodes -lt $new_nodes ]; then

  # Calculate the number of nodes to add

  nodes_to_add=$((new_nodes - current_nodes))

  

  # For each node to add, create a new ResourceManager node

  for i in $(seq 1 $nodes_to_add); do

    yarn rmadmin -refreshNodes

  done

  

  # Verify that the desired number of nodes has been added

  current_nodes=$(yarn node -list | grep "ResourceManager" | wc -l)

  

  if [ $current_nodes -lt $new_nodes ]; then

    echo "Error: Failed to add all required ResourceManager nodes."

    exit 1

  else

    echo "Successfully added $nodes_to_add ResourceManager nodes."

    exit 0

  fi

else

  echo "The current number of ResourceManager nodes is already equal to or greater than the desired number of nodes."

  exit 0

fi


```

This incident type involves the failure of the YARN ResourceManager, which can impact the performance of Spark jobs. The ResourceManager is responsible for managing resources in a Hadoop cluster, and when it fails, it can prevent Spark jobs from running properly. This incident requires investigation to determine the root cause of the failure, and recovery steps to restore the ResourceManager and prevent similar failures from occurring in the future.


This incident type refers to a situation where Spark tasks are failing due to out of memory errors. Spark is a distributed computing system used for big data processing. When the data volume exceeds the allocated memory, the Spark tasks fail, and the system generates an out of memory error. This type of incident can cause data processing delays or even system downtime, which can impact the overall performance of the application.


Spark tasks failing due to out of memory errors.

This incident type typically occurs in distributed computing systems, where Spark tasks are experiencing high disk I/O and shuffle spills. Spark is a popular distributed computing engine that uses shuffle operations to move data between nodes in a cluster, which can sometimes result in performance issues due to spills. The spills occur when the data being shuffled exceeds the memory capacity allocated for the shuffle operations. This incident requires optimization of the shuffle operations to reduce spills and improve overall performance.


Spark tasks experiencing shuffle spills and high disk I/O.

This incident type refers to a situation where Spark jobs are failing due to resource contentions in the cluster. When multiple Spark jobs are trying to access the same resources or data at the same time, it can cause a bottleneck that leads to job failures. This can happen when the resources in the cluster are not properly allocated or when the number of jobs running simultaneously exceeds the cluster's capacity to handle them. The result is that Spark jobs fail, leading to disruptions in data processing and analysis.


Spark job failures due to cluster resource contentions.

This incident type refers to a failure in one or more Spark executors during the execution of a job. Spark executors are worker processes that run computations and store data in memory or on disk. When an executor fails, it can cause the entire job to fail or result in degraded performance. This type of incident can occur for a variety of reasons, such as hardware or network issues, memory errors, or software bugs.


Spark executor failure during job execution.

This incident type refers to an unexpected termination of the Spark driver program during the runtime of a job. The driver program is responsible for coordinating the execution of a Spark job and if it crashes, the entire job is affected. This can result in data loss and downtime, and requires investigation and troubleshooting to identify the root cause of the issue.


Spark driver program crash during job runtime.

```shell
export YARN_RESOURCEMANAGER="PLACEHOLDER"

export SPARK="PLACEHOLDER"

export PATH_TO_SPARK_LOGS="PLACEHOLDER"

export RESOURCE_MANAGER_HOST="PLACEHOLDER"

export NUMBER_OF_NODES_TO_INCREASE_TO="PLACEHOLDER"
```


### Check the status of the YARN ResourceManager service

```shell
systemctl status ${YARN_RESOURCEMANAGER}
```

### Check the logs of the YARN ResourceManager service

```shell
journalctl -u ${YARN_RESOURCEMANAGER}
```

### Check the resource usage of the YARN ResourceManager service

```shell
top -p $(pidof ${YARN_RESOURCEMANAGER})
```

### Check if the Spark jobs are running

```shell
ps aux | grep ${SPARK}
```

### Check the logs of the Spark jobs

```shell
cat ${PATH_TO_SPARK_LOGS}
```

### Check the resource usage of the Spark jobs

```shell
top -p $(pidof ${SPARK})
```

### The YARN ResourceManager may have been overloaded with requests, causing it to fail.

```shell


#!/bin/bash



# Check for overloaded YARN ResourceManager



# Set the threshold for the number of running applications

THRESHOLD=500



# Get the number of running applications from the YARN ResourceManager

NUM_APPS=$(curl -s -X GET http://${RESOURCE_MANAGER_HOST}:8088/ws/v1/cluster/apps | grep -oP '(?<="runningApplications":)[^,]*')



# Check if the number of running applications exceeds the threshold

if [ "$NUM_APPS" -gt "$THRESHOLD" ]; then

  echo "YARN ResourceManager may be overloaded with requests."

else

  echo "YARN ResourceManager is not overloaded with requests."

fi


```


### Increase the number of YARN ResourceManager nodes in the cluster to provide redundancy and reduce the impact of a single node failure.

```shell


#!/bin/bash



# Set the number of nodes to increase to

new_nodes=${NUMBER_OF_NODES_TO_INCREASE_TO}



# Get the current number of ResourceManager nodes

current_nodes=$(yarn node -list | grep "ResourceManager" | wc -l)



# Check if the current number of nodes is less than the desired number of nodes

if [ $current_nodes -lt $new_nodes ]; then

  # Calculate the number of nodes to add

  nodes_to_add=$((new_nodes - current_nodes))

  

  # For each node to add, create a new ResourceManager node

  for i in $(seq 1 $nodes_to_add); do

    yarn rmadmin -refreshNodes

  done

  

  # Verify that the desired number of nodes has been added

  current_nodes=$(yarn node -list | grep "ResourceManager" | wc -l)

  

  if [ $current_nodes -lt $new_nodes ]; then

    echo "Error: Failed to add all required ResourceManager nodes."

    exit 1

  else

    echo "Successfully added $nodes_to_add ResourceManager nodes."

    exit 0

  fi

else

  echo "The current number of ResourceManager nodes is already equal to or greater than the desired number of nodes."

  exit 0

fi


```


YARN ResourceManager Failure Impacting Spark Jobs.

Overview

Parameters

Debug

Check the status of the YARN ResourceManager service

Check the logs of the YARN ResourceManager service

Check the resource usage of the YARN ResourceManager service

Check if the Spark jobs are running

Check the logs of the Spark jobs

Check the resource usage of the Spark jobs

The YARN ResourceManager may have been overloaded with requests, causing it to fail.

Repair

Increase the number of YARN ResourceManager nodes in the cluster to provide redundancy and reduce the impact of a single node failure.

Learn more

Related Runbooks

Spark tasks failing due to out of memory errors.

Spark tasks experiencing shuffle spills and high disk I/O.

Spark job failures due to cluster resource contentions.

Spark executor failure during job execution.

Support