Runbook

Kafka High Watermark Lag Incident

Back to Runbooks

Overview

Kafka High Watermark Lag Incident is an incident that occurs in a Kafka environment when the high watermark, which is the highest offset that has been replicated to all in-sync replicas, lags behind the low watermark, which is the offset of the last message written to the partition. This happens when Kafka consumers are not consuming messages at the same rate as they are being produced, causing a backlog of messages that have not been processed. This lag can lead to data loss, as messages that have not been replicated to all in-sync replicas may be lost if a replica fails. It is important to monitor Kafka high watermark lag and take corrective actions to ensure that it does not exceed a certain threshold.

Parameters

Debug

List all topics in Kafka

Describe a topic to get its configuration, including replication factor and partition count

Get the current high watermark for a partition

Get the current low watermark for a partition

Get the lag for a partition by subtracting the high watermark from the low watermark

Check if the lag for a partition exceeds a certain threshold

Repair

Increase the replication factor to ensure that messages are replicated to more in-sync replicas.

Learn more

Related Runbooks

Check out these related runbooks to help you debug and resolve similar issues.