Runbook

High Partition Lag on Kafka Cluster

Back to Runbooks

Overview

This incident type refers to an issue with a Kafka cluster where one or more partitions are experiencing high lag. Lag is the difference between the latest message produced to a partition and the latest message consumed from that partition. When a partition lags behind, it means that messages are not being consumed as quickly as they are being produced. This can lead to a backlog of messages and potential data loss if not addressed. The incident description suggests checking for hot partitions, which are partitions that receive a disproportionate amount of traffic compared to others. Identifying and resolving high partition lag is critical to ensure the stability and reliability of a Kafka cluster.

Parameters

Debug

List all topics and their partition count

List the partition lag for a specific consumer group

List the current offset for a specific partition

List the end offset for a specific partition

List the number of messages in a specific partition

List the number of messages consumed by a specific consumer group for a specific topic

Repair

Increase the number of consumer instances for the high-traffic partitions to reduce the lag. This can be done by adding more consumers to the consumer group or creating a new group.

Learn more

Related Runbooks

Check out these related runbooks to help you debug and resolve similar issues.