Runbook

Kafka Broker Failure Causing Partition Unavailability

Back to Runbooks

Overview

This incident type refers to a situation where the Kafka broker, which is responsible for managing and storing messages in a Kafka cluster, has failed. This failure results in the unavailability of one or more partitions, which are used to distribute messages across the cluster. As a result, messages cannot be sent or received, leading to disruptions in the system's operations. This type of incident requires immediate attention to restore the Kafka broker and ensure that messages can be processed as expected.

Parameters

Debug

Check if Kafka broker is running

Check if all Kafka partitions are available

Check if there are any disk space issues

Check if there are any issues with Zookeeper

Check if there are any errors in Kafka logs

Check if there are any network issues between Kafka brokers

Resource exhaustion due to high traffic or large message sizes

Repair

Increase the number of Kafka brokers in the cluster to ensure that there are enough replicas and that the cluster can handle the required load.

Learn more

Related Runbooks

Check out these related runbooks to help you debug and resolve similar issues.