Runbook

Spark job failures due to cluster resource contentions.

Back to Runbooks

Overview

This incident type refers to a situation where Spark jobs are failing due to resource contentions in the cluster. When multiple Spark jobs are trying to access the same resources or data at the same time, it can cause a bottleneck that leads to job failures. This can happen when the resources in the cluster are not properly allocated or when the number of jobs running simultaneously exceeds the cluster's capacity to handle them. The result is that Spark jobs fail, leading to disruptions in data processing and analysis.

Parameters

Debug

Get the status of the Spark cluster

Check the resource usage of the Spark jobs

Check if any nodes in the cluster are overloaded

Analyze the Spark logs for any errors or exceptions

Check the network usage of the cluster nodes

Not enough resources allocated to the Spark job causing it to compete with other jobs running on the cluster.

Repair

Increasing the resources allocated to the cluster, like memory and CPU, to avoid contention.