Runbook

Kafka In-Sync Replica Count Drops Incident

Back to Runbooks

Overview

This incident type refers to a situation where the in-sync replica count of a Kafka cluster drops below the expected value. In simple terms, an in-sync replica (ISR) is a replica that is up-to-date with the leader partition. When the ISR count drops, it means that one or more replicas have fallen behind the leader partition, which could result in data loss or inconsistency across the Kafka cluster. This incident requires immediate attention and investigation to identify the root cause and take necessary actions to prevent further impact.

Parameters

Debug

Check the current ISR count for the given topic

Check the current leader for the given topic

Check the current ISR count for all topics in the cluster

Check the current leader for all topics in the cluster

Repair

Increase the replica fetch maximum wait time: If the ISR count drops due to high replica lag, consider increasing the replica fetch maximum wait time. This parameter determines how long a broker should wait for a replica to catch up with the leader partition before returning the results to the consumer.

Learn more

Related Runbooks

Check out these related runbooks to help you debug and resolve similar issues.