---
id: 6d9e1485-adc1-4596-917f-ee184777a5b5
---

# Data Skew Slowing Down Spark Job.
---

Data skew slowing down a Spark job is a common incident that occurs when there is an uneven distribution of data across partitions. This can lead to some partitions handling a much larger amount of data than others, causing delays and slowing down the entire job. The result is a significant decrease in performance and a longer processing time for the Spark job.

### Parameters
```shell
export SPARK_JOB_LOG_FILE="PLACEHOLDER"

export SPARK_JOB_CONFIG_FILE="PLACEHOLDER"

export SPARK_JOB_JAR_FILE="PLACEHOLDER"

export INSERT_INPUT_PATH="PLACEHOLDER"

export SPARK_MASTER_PORT="PLACEHOLDER"

export PATH_TO_SPARK_HOME="PLACEHOLDER"

export NUMBER_OF_WORKER_INSTANCES="PLACEHOLDER"
```

## Debug

### Check Spark job logs for errors
```shell
grep -i error ${SPARK_JOB_LOG_FILE}
```

### Check Spark job logs for warnings
```shell
grep -i warning ${SPARK_JOB_LOG_FILE}
```

### Check Spark job logs for information
```shell
grep -i info ${SPARK_JOB_LOG_FILE}
```

### Check for data skew in Spark job logs
```shell
grep -i "data skew" ${SPARK_JOB_LOG_FILE}
```

### Check the Spark job configuration for any issues
```shell
cat ${SPARK_JOB_CONFIG_FILE}
```

### Check the Spark job DAG for any issues
```shell
spark-submit --class org.apache.spark.sql.execution.ui.SparkUI ${SPARK_JOB_JAR_FILE}
```

### Check the Spark job task execution time and distribution
```shell
spark-submit --class org.apache.spark.sql.execution.ui.SparkUI ${SPARK_JOB_JAR_FILE}
```

### Check the Spark job task failures
```shell
spark-submit --class org.apache.spark.sql.execution.ui.SparkUI ${SPARK_JOB_JAR_FILE}
```

### Check the Spark job task metrics
```shell
spark-submit --class org.apache.spark.sql.execution.ui.SparkUI ${SPARK_JOB_JAR_FILE}
```

### Imbalanced data distribution: If the data is not evenly distributed across the nodes in the cluster, some nodes may have to process more data than others, leading to data skew and slower processing times.
```shell


#!/bin/bash



# Define variables

APP_NAME=${SPARK_JOB_JAR_FILE}

INPUT_PATH=${INSERT_INPUT_PATH}



# Check data distribution

echo "Checking data distribution..."

hdfs dfs -ls $INPUT_PATH | awk '{print $5}' | sort -rn | awk '{total += $1} END {print "Total size: " total/1024/1024/1024 " GB"; print "Average size: " total/NR/1024/1024 " MB"; print "Max size: " $1/1024/1024 " MB"}'



# Check node distribution

echo "Checking node distribution..."

yarn application -status $APP_NAME | grep 'Total vCores' | awk '{print "Total vCores: " $3}'

yarn application -status $APP_NAME | grep 'Total MB' | awk '{print "Total memory: " $3/1024 " GB"}'

yarn application -status $APP_NAME | grep 'Node Resources' -A 10 | grep 'Memory' | awk '{print "Node memory: " $3/1024 " GB"}'

yarn application -status $APP_NAME | grep 'Node Resources' -A 10 | grep 'vCores' | awk '{print "Node vCores: " $3}'



# Check processing times

echo "Checking processing times..."

yarn application -list | grep $APP_NAME | awk '{print $1}' | xargs -I {} yarn application -appStates FINISHED -list {} | grep 'FINISHED' | awk '{print "Duration: " $2 " seconds"}'


```

## Repair

### Increase the number of executors or worker nodes in the Spark cluster to distribute the workload evenly.
```shell


#!/bin/bash



# Set the necessary environment variables

SPARK_HOME=${PATH_TO_SPARK_HOME}

SPARK_MASTER=${SPARK_MASTER_PORT}

SPARK_WORKER_INSTANCES=${NUMBER_OF_WORKER_INSTANCES}



# Stop the Spark cluster
sudo systemctl stop spark



# Update the Spark configuration with the new number of worker nodes

sed -i "s/spark.worker.instances=[0-9]*/spark.worker.instances=$SPARK_WORKER_INSTANCES/g" $SPARK_HOME/conf/spark-defaults.conf



# Start the Spark cluster with the updated configuration
sudo systemctl start spark
```

Data skew slowing down a Spark job is a common incident that occurs when there is an uneven distribution of data across partitions. This can lead to some partitions handling a much larger amount of data than others, causing delays and slowing down the entire job. The result is a significant decrease in performance and a longer processing time for the Spark job.


This incident type involves the failure of the YARN ResourceManager, which can impact the performance of Spark jobs. The ResourceManager is responsible for managing resources in a Hadoop cluster, and when it fails, it can prevent Spark jobs from running properly. This incident requires investigation to determine the root cause of the failure, and recovery steps to restore the ResourceManager and prevent similar failures from occurring in the future.


YARN ResourceManager Failure Impacting Spark Jobs.

This incident type refers to a situation where Spark tasks are failing due to out of memory errors. Spark is a distributed computing system used for big data processing. When the data volume exceeds the allocated memory, the Spark tasks fail, and the system generates an out of memory error. This type of incident can cause data processing delays or even system downtime, which can impact the overall performance of the application.


Spark tasks failing due to out of memory errors.

This incident type typically occurs in distributed computing systems, where Spark tasks are experiencing high disk I/O and shuffle spills. Spark is a popular distributed computing engine that uses shuffle operations to move data between nodes in a cluster, which can sometimes result in performance issues due to spills. The spills occur when the data being shuffled exceeds the memory capacity allocated for the shuffle operations. This incident requires optimization of the shuffle operations to reduce spills and improve overall performance.


Spark tasks experiencing shuffle spills and high disk I/O.

This incident type refers to a situation where Spark jobs are failing due to resource contentions in the cluster. When multiple Spark jobs are trying to access the same resources or data at the same time, it can cause a bottleneck that leads to job failures. This can happen when the resources in the cluster are not properly allocated or when the number of jobs running simultaneously exceeds the cluster's capacity to handle them. The result is that Spark jobs fail, leading to disruptions in data processing and analysis.


Spark job failures due to cluster resource contentions.

This incident type refers to a failure in one or more Spark executors during the execution of a job. Spark executors are worker processes that run computations and store data in memory or on disk. When an executor fails, it can cause the entire job to fail or result in degraded performance. This type of incident can occur for a variety of reasons, such as hardware or network issues, memory errors, or software bugs.


Spark executor failure during job execution.

```shell
export SPARK_JOB_LOG_FILE="PLACEHOLDER"

export SPARK_JOB_CONFIG_FILE="PLACEHOLDER"

export SPARK_JOB_JAR_FILE="PLACEHOLDER"

export INSERT_INPUT_PATH="PLACEHOLDER"

export SPARK_MASTER_PORT="PLACEHOLDER"

export PATH_TO_SPARK_HOME="PLACEHOLDER"

export NUMBER_OF_WORKER_INSTANCES="PLACEHOLDER"
```


### Check Spark job logs for errors

```shell
grep -i error ${SPARK_JOB_LOG_FILE}
```

### Check Spark job logs for warnings

```shell
grep -i warning ${SPARK_JOB_LOG_FILE}
```

### Check Spark job logs for information

```shell
grep -i info ${SPARK_JOB_LOG_FILE}
```

### Check for data skew in Spark job logs

```shell
grep -i "data skew" ${SPARK_JOB_LOG_FILE}
```

### Check the Spark job configuration for any issues

```shell
cat ${SPARK_JOB_CONFIG_FILE}
```

### Check the Spark job DAG for any issues

```shell
spark-submit --class org.apache.spark.sql.execution.ui.SparkUI ${SPARK_JOB_JAR_FILE}
```

### Check the Spark job task execution time and distribution

```shell
spark-submit --class org.apache.spark.sql.execution.ui.SparkUI ${SPARK_JOB_JAR_FILE}
```

### Check the Spark job task failures

```shell
spark-submit --class org.apache.spark.sql.execution.ui.SparkUI ${SPARK_JOB_JAR_FILE}
```

### Check the Spark job task metrics

```shell
spark-submit --class org.apache.spark.sql.execution.ui.SparkUI ${SPARK_JOB_JAR_FILE}
```

### Imbalanced data distribution: If the data is not evenly distributed across the nodes in the cluster, some nodes may have to process more data than others, leading to data skew and slower processing times.

```shell


#!/bin/bash



# Define variables

APP_NAME=${SPARK_JOB_JAR_FILE}

INPUT_PATH=${INSERT_INPUT_PATH}



# Check data distribution

echo "Checking data distribution..."

hdfs dfs -ls $INPUT_PATH | awk '{print $5}' | sort -rn | awk '{total += $1} END {print "Total size: " total/1024/1024/1024 " GB"; print "Average size: " total/NR/1024/1024 " MB"; print "Max size: " $1/1024/1024 " MB"}'



# Check node distribution

echo "Checking node distribution..."

yarn application -status $APP_NAME | grep 'Total vCores' | awk '{print "Total vCores: " $3}'

yarn application -status $APP_NAME | grep 'Total MB' | awk '{print "Total memory: " $3/1024 " GB"}'

yarn application -status $APP_NAME | grep 'Node Resources' -A 10 | grep 'Memory' | awk '{print "Node memory: " $3/1024 " GB"}'

yarn application -status $APP_NAME | grep 'Node Resources' -A 10 | grep 'vCores' | awk '{print "Node vCores: " $3}'



# Check processing times

echo "Checking processing times..."

yarn application -list | grep $APP_NAME | awk '{print $1}' | xargs -I {} yarn application -appStates FINISHED -list {} | grep 'FINISHED' | awk '{print "Duration: " $2 " seconds"}'


```


### Increase the number of executors or worker nodes in the Spark cluster to distribute the workload evenly.

```shell


#!/bin/bash



# Set the necessary environment variables

SPARK_HOME=${PATH_TO_SPARK_HOME}

SPARK_MASTER=${SPARK_MASTER_PORT}

SPARK_WORKER_INSTANCES=${NUMBER_OF_WORKER_INSTANCES}



# Stop the Spark cluster
sudo systemctl stop spark



# Update the Spark configuration with the new number of worker nodes

sed -i "s/spark.worker.instances=[0-9]*/spark.worker.instances=$SPARK_WORKER_INSTANCES/g" $SPARK_HOME/conf/spark-defaults.conf



# Start the Spark cluster with the updated configuration
sudo systemctl start spark
```


Data Skew Slowing Down Spark Job.

Overview

Parameters

Debug

Check Spark job logs for errors

Check Spark job logs for warnings

Check Spark job logs for information

Check for data skew in Spark job logs

Check the Spark job configuration for any issues

Check the Spark job DAG for any issues

Check the Spark job task execution time and distribution

Check the Spark job task failures

Check the Spark job task metrics

Imbalanced data distribution: If the data is not evenly distributed across the nodes in the cluster, some nodes may have to process more data than others, leading to data skew and slower processing times.

Repair

Increase the number of executors or worker nodes in the Spark cluster to distribute the workload evenly.

Learn more

Related Runbooks

YARN ResourceManager Failure Impacting Spark Jobs.

Spark tasks failing due to out of memory errors.

Spark tasks experiencing shuffle spills and high disk I/O.

Spark job failures due to cluster resource contentions.

Support