Runbook

Spark driver program crash during job runtime.

Back to Runbooks

Overview

This incident type refers to an unexpected termination of the Spark driver program during the runtime of a job. The driver program is responsible for coordinating the execution of a Spark job and if it crashes, the entire job is affected. This can result in data loss and downtime, and requires investigation and troubleshooting to identify the root cause of the issue.

Parameters

Debug

Check if Spark is running

Check if the driver program is running

Check for any system-level errors or warnings

Check for any resource issues, such as memory usage or CPU utilization

Insufficient memory resources allocated to the Spark driver program.

Repair

Increase the resources allocated to the Spark driver program, such as increasing the memory or the number of CPU cores.