---
id: abab014a-3f9b-4eb8-b874-918cb4861875
---

# Kubernetes Statefulset Replicas Monitoring Incident
---

This incident type involves monitoring the replicas of a Kubernetes Statefulset, which is a type of workload in Kubernetes used for stateful applications. The incident is triggered when more than one replica's pods are down, creating an unsafe situation for manual operations. This incident is critical and requires immediate attention to resolve the issue and ensure the smooth functioning of the stateful applications.

### Parameters
```shell
# Environment Variables
export KUBE_STATEFUL_SET="PLACEHOLDER"
export KUBE_NAMESPACE="PLACEHOLDER"
export POD_NAME="PLACEHOLDER"
export CPU_THRESHOLD_IN_MILLICORES="PLACEHOLDER"
export DISK_THRESHOLD_IN_PERCENT="PLACEHOLDER"
export MEMORY_THRESHOLD_IN_BYTES="PLACEHOLDER"
export SERVICE="PLACEHOLDER"

```

## Debug

### Get the desired number of replicas for the specified Statefulset
```shell
kubectl get statefulset ${KUBE_STATEFUL_SET} -n ${KUBE_NAMESPACE} -o=jsonpath='{.spec.replicas}'
```

### Get the number of ready replicas for the specified Statefulset
```shell
kubectl get statefulset ${KUBE_STATEFUL_SET} -n ${KUBE_NAMESPACE} -o=jsonpath='{.status.readyReplicas}'
```

### Get the number of currently running replicas for the specified Statefulset
```shell
kubectl get statefulset ${KUBE_STATEFUL_SET} -n ${KUBE_NAMESPACE} -o=jsonpath='{.status.currentReplicas}'
```

### Get the number of replicas that are currently unavailable for the specified Statefulset
```shell
kubectl get statefulset ${KUBE_STATEFUL_SET} -n ${KUBE_NAMESPACE} -o=jsonpath='{.status.unavailableReplicas}'
```

### Get the status of all the pods belonging to the specified Statefulset
```shell
kubectl get pods -n ${KUBE_NAMESPACE} -l app=${KUBE_STATEFUL_SET} -o=custom-columns=NAME:.metadata.name,STATUS:.status.phase
```

### Get the logs of the specified pod
```shell
kubectl logs ${POD_NAME} -n ${KUBE_NAMESPACE}
```

### Resource constraints: Resource constraints such as CPU, memory, or disk space issues can cause the Kubernetes Statefulset replicas to stop functioning. This can lead to the triggering of the incident mentioned above.
```shell
bash
#!/bin/bash

# Kubernetes Statefulset name
statefulset_name=${KUBE_STATEFUL_SET}

# Namespace where the Statefulset is deployed
namespace=${NAMESPACE}

# Check CPU usage of Statefulset replicas
cpu=$(kubectl top pods -n $namespace | grep $statefulset_name | awk '{total += $2} END {print total}')
cpu_threshold=${CPU_THRESHOLD_IN_MILLICORES}

if (( $cpu > $cpu_threshold )); then
  echo "CPU usage of $cpu millicores is higher than threshold of $cpu_threshold millicores."
  echo "Resource constraints may be causing the Statefulset replicas to stop functioning."
fi

# Check memory usage of Statefulset replicas
memory=$(kubectl top pods -n $namespace | grep $statefulset_name | awk '{total += $3} END {print total}')
memory_threshold=${MEMORY_THRESHOLD_IN_BYTES}

if (( $memory > $memory_threshold )); then
  echo "Memory usage of $memory bytes is higher than threshold of $memory_threshold bytes."
  echo "Resource constraints may be causing the Statefulset replicas to stop functioning."
fi

# Check disk usage of Statefulset replicas
disk=$(kubectl exec $statefulset_name-0 -n $namespace -- df --output=pcent /data | tail -1 | tr -dc '0-9')
disk_threshold=${DISK_THRESHOLD_IN_PERCENT}

if (( $disk > $disk_threshold )); then
  echo "Disk usage of $disk% is higher than threshold of $disk_threshold%."
  echo "Resource constraints may be causing the Statefulset replicas to stop functioning."
fi

```

### Network issues: Network issues such as DNS resolution failure, network connectivity issues, or firewall configuration errors can cause the Kubernetes Statefulset replicas to stop functioning. As a result, this could trigger the incident mentioned above.
```shell
bash
#!/bin/bash

# Set variables
CLUSTER_NAME=${CLUSTER_NAME}
NAMESPACE=${NAMESPACE}
STATEFULSET=${STATEFULSET}
SERVICE=${SERVICE}

# Check if the service is running
SERVICE_STATUS=$(kubectl get service $SERVICE -n $NAMESPACE | awk 'FNR == 2 {print $3}')
if [[ $SERVICE_STATUS != "Running" ]]; then
  echo "Service $SERVICE in namespace $NAMESPACE is not running. Exiting..."
  exit 1
fi

# Check if all the replicas are running
EXPECTED_REPLICAS=$(kubectl get statefulset $STATEFULSET -n $NAMESPACE | awk 'FNR == 2 {print $2}')
READY_REPLICAS=$(kubectl get statefulset $STATEFULSET -n $NAMESPACE | awk 'FNR == 2 {print $Ready}')
if [[ $EXPECTED_REPLICAS -ne $READY_REPLICAS ]]; then
  echo "Expected $EXPECTED_REPLICAS replicas of statefulset $STATEFULSET in namespace $NAMESPACE. But only $READY_REPLICAS are ready. Exiting..."
  exit 1
fi

# Check network connectivity
POD_NAMES=$(kubectl get pods -l app=$STATEFULSET -n $NAMESPACE | awk 'FNR > 1 {print $1}')
for POD_NAME in $POD_NAMES; do
  POD_IP=$(kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.status.podIP}')
  if [[ $(curl --write-out %{http_code} --silent --output /dev/null $POD_IP) != 200 ]]; then
    echo "Network connectivity issue with pod $POD_NAME in namespace $NAMESPACE. Exiting..."
    exit 1
  fi
done

echo "No network issues found with Kubernetes Statefulset $STATEFULSET in namespace $NAMESPACE."

```
## Repair
---
### Scale up the number of replicas to ensure that the desired state is achieved and the workload is available.
```shell

#!/bin/bash

# Define variables
STATEFULSET_NAME=${NAME_OF_DEPLOYMENT}
NEW_REPLICAS="PLACEHOLDER"

# Scale up the deployment to the desired number of replicas
kubectl scale sts $STATEFULSET_NAME --replicas=$NEW_REPLICAS

# Check the status of the deployment
DEPLOYMENT_STATUS=$(kubectl get sts $STATEFULSET_NAME -o jsonpath='{.status.conditions[?(@.type=="Available")].status}')

# Check if the deployment is available
if [ $STATEFULSET_NAME == "True" ]
then
  echo "$STATEFULSET_NAME is available."
else
  echo "$STATEFULSET_NAME is not available."
fi

```

This incident type involves monitoring the replicas of a Kubernetes Statefulset, which is a type of workload in Kubernetes used for stateful applications. The incident is triggered when more than one replica's pods are down, creating an unsafe situation for manual operations. This incident is critical and requires immediate attention to resolve the issue and ensure the smooth functioning of the stateful applications.


The Vault cluster health incident is related to the health of a Vault cluster instance. This incident type is triggered when the cluster instance is not healthy and requires attention to ensure it is functioning properly. The incident typically involves evaluating the current state of the cluster instance, diagnosing the issue, and taking corrective action to restore the health of the instance.


Vault cluster health incident on kubernetes

The incident type of "Kubernetes deployment with multiple restarts" indicates that a Kubernetes deployment has experienced multiple restarts within a certain timeframe, which is usually indicative of a problem. Kubernetes is a popular container orchestration platform that automates the deployment, scaling, and management of containerized applications. When a deployment experiences multiple restarts, it can impact the availability and performance of the application, and can be a sign of underlying issues that need to be addressed. This incident type is typically monitored and managed by DevOps teams responsible for ensuring the health and reliability of Kubernetes-based applications.


Kubernetes deployment with multiple restarts

A Kubernetes Replicaset Incomplete incident typically occurs when a specific number of pods that should be running are not, due to reasons such as failed pod initialization, unavailability of resources in the cluster, or inability to pull the image. This incident is usually triggered when the difference between desired and running pods is greater than zero, and it can be detected through monitoring tools like Datadog.


Kubernetes Replicaset Incomplete

A Kubernetes Pod Restarting Monitoring incident is triggered when a pod running on a Kubernetes cluster restarts multiple times within a certain time frame. This incident type is usually used to detect issues with the application or infrastructure running on the cluster, and can be caused by various factors such as resource constraints, misconfigurations, or bugs in the application code. The incident is typically resolved by identifying and addressing the underlying cause of the pod restarts.


Kubernetes Pod Restarting Monitoring

The Kubernetes Nodes with Memorypressure incident type occurs when a Kubernetes cluster node is running low on memory, which can be caused by a memory leak in an application. This incident type requires immediate attention to prevent any downtime and ensure the proper functioning of the Kubernetes cluster. Typically, this incident type is monitored by DevOps teams using various monitoring tools, including PagerDuty, to identify and address memory pressure issues quickly.


Kubernetes Nodes with Memorypressure incident

```shell
# Environment Variables
export KUBE_STATEFUL_SET="PLACEHOLDER"
export KUBE_NAMESPACE="PLACEHOLDER"
export POD_NAME="PLACEHOLDER"
export CPU_THRESHOLD_IN_MILLICORES="PLACEHOLDER"
export DISK_THRESHOLD_IN_PERCENT="PLACEHOLDER"
export MEMORY_THRESHOLD_IN_BYTES="PLACEHOLDER"
export SERVICE="PLACEHOLDER"

```


### Get the desired number of replicas for the specified Statefulset

```shell
kubectl get statefulset ${KUBE_STATEFUL_SET} -n ${KUBE_NAMESPACE} -o=jsonpath='{.spec.replicas}'
```

### Get the number of ready replicas for the specified Statefulset

```shell
kubectl get statefulset ${KUBE_STATEFUL_SET} -n ${KUBE_NAMESPACE} -o=jsonpath='{.status.readyReplicas}'
```

### Get the number of currently running replicas for the specified Statefulset

```shell
kubectl get statefulset ${KUBE_STATEFUL_SET} -n ${KUBE_NAMESPACE} -o=jsonpath='{.status.currentReplicas}'
```

### Get the number of replicas that are currently unavailable for the specified Statefulset

```shell
kubectl get statefulset ${KUBE_STATEFUL_SET} -n ${KUBE_NAMESPACE} -o=jsonpath='{.status.unavailableReplicas}'
```

### Get the status of all the pods belonging to the specified Statefulset

```shell
kubectl get pods -n ${KUBE_NAMESPACE} -l app=${KUBE_STATEFUL_SET} -o=custom-columns=NAME:.metadata.name,STATUS:.status.phase
```

### Get the logs of the specified pod

```shell
kubectl logs ${POD_NAME} -n ${KUBE_NAMESPACE}
```

### Resource constraints: Resource constraints such as CPU, memory, or disk space issues can cause the Kubernetes Statefulset replicas to stop functioning. This can lead to the triggering of the incident mentioned above.

```shell
bash
#!/bin/bash

# Kubernetes Statefulset name
statefulset_name=${KUBE_STATEFUL_SET}

# Namespace where the Statefulset is deployed
namespace=${NAMESPACE}

# Check CPU usage of Statefulset replicas
cpu=$(kubectl top pods -n $namespace | grep $statefulset_name | awk '{total += $2} END {print total}')
cpu_threshold=${CPU_THRESHOLD_IN_MILLICORES}

if (( $cpu > $cpu_threshold )); then
  echo "CPU usage of $cpu millicores is higher than threshold of $cpu_threshold millicores."
  echo "Resource constraints may be causing the Statefulset replicas to stop functioning."
fi

# Check memory usage of Statefulset replicas
memory=$(kubectl top pods -n $namespace | grep $statefulset_name | awk '{total += $3} END {print total}')
memory_threshold=${MEMORY_THRESHOLD_IN_BYTES}

if (( $memory > $memory_threshold )); then
  echo "Memory usage of $memory bytes is higher than threshold of $memory_threshold bytes."
  echo "Resource constraints may be causing the Statefulset replicas to stop functioning."
fi

# Check disk usage of Statefulset replicas
disk=$(kubectl exec $statefulset_name-0 -n $namespace -- df --output=pcent /data | tail -1 | tr -dc '0-9')
disk_threshold=${DISK_THRESHOLD_IN_PERCENT}

if (( $disk > $disk_threshold )); then
  echo "Disk usage of $disk% is higher than threshold of $disk_threshold%."
  echo "Resource constraints may be causing the Statefulset replicas to stop functioning."
fi

```

### Network issues: Network issues such as DNS resolution failure, network connectivity issues, or firewall configuration errors can cause the Kubernetes Statefulset replicas to stop functioning. As a result, this could trigger the incident mentioned above.

```shell
bash
#!/bin/bash

# Set variables
CLUSTER_NAME=${CLUSTER_NAME}
NAMESPACE=${NAMESPACE}
STATEFULSET=${STATEFULSET}
SERVICE=${SERVICE}

# Check if the service is running
SERVICE_STATUS=$(kubectl get service $SERVICE -n $NAMESPACE | awk 'FNR == 2 {print $3}')
if [[ $SERVICE_STATUS != "Running" ]]; then
  echo "Service $SERVICE in namespace $NAMESPACE is not running. Exiting..."
  exit 1
fi

# Check if all the replicas are running
EXPECTED_REPLICAS=$(kubectl get statefulset $STATEFULSET -n $NAMESPACE | awk 'FNR == 2 {print $2}')
READY_REPLICAS=$(kubectl get statefulset $STATEFULSET -n $NAMESPACE | awk 'FNR == 2 {print $Ready}')
if [[ $EXPECTED_REPLICAS -ne $READY_REPLICAS ]]; then
  echo "Expected $EXPECTED_REPLICAS replicas of statefulset $STATEFULSET in namespace $NAMESPACE. But only $READY_REPLICAS are ready. Exiting..."
  exit 1
fi

# Check network connectivity
POD_NAMES=$(kubectl get pods -l app=$STATEFULSET -n $NAMESPACE | awk 'FNR > 1 {print $1}')
for POD_NAME in $POD_NAMES; do
  POD_IP=$(kubectl get pod $POD_NAME -n $NAMESPACE -o jsonpath='{.status.podIP}')
  if [[ $(curl --write-out %{http_code} --silent --output /dev/null $POD_IP) != 200 ]]; then
    echo "Network connectivity issue with pod $POD_NAME in namespace $NAMESPACE. Exiting..."
    exit 1
  fi
done

echo "No network issues found with Kubernetes Statefulset $STATEFULSET in namespace $NAMESPACE."

```


### Scale up the number of replicas to ensure that the desired state is achieved and the workload is available.

```shell

#!/bin/bash

# Define variables
STATEFULSET_NAME=${NAME_OF_DEPLOYMENT}
NEW_REPLICAS="PLACEHOLDER"

# Scale up the deployment to the desired number of replicas
kubectl scale sts $STATEFULSET_NAME --replicas=$NEW_REPLICAS

# Check the status of the deployment
DEPLOYMENT_STATUS=$(kubectl get sts $STATEFULSET_NAME -o jsonpath='{.status.conditions[?(@.type=="Available")].status}')

# Check if the deployment is available
if [ $STATEFULSET_NAME == "True" ]
then
  echo "$STATEFULSET_NAME is available."
else
  echo "$STATEFULSET_NAME is not available."
fi

```


Kubernetes Statefulset Replicas Monitoring Incident

Overview

Parameters

Debug

Get the desired number of replicas for the specified Statefulset

Get the number of ready replicas for the specified Statefulset

Get the number of currently running replicas for the specified Statefulset

Get the number of replicas that are currently unavailable for the specified Statefulset

Get the status of all the pods belonging to the specified Statefulset

Get the logs of the specified pod

Resource constraints: Resource constraints such as CPU, memory, or disk space issues can cause the Kubernetes Statefulset replicas to stop functioning. This can lead to the triggering of the incident mentioned above.

Network issues: Network issues such as DNS resolution failure, network connectivity issues, or firewall configuration errors can cause the Kubernetes Statefulset replicas to stop functioning. As a result, this could trigger the incident mentioned above.

Repair

Scale up the number of replicas to ensure that the desired state is achieved and the workload is available.

Learn more

Related Runbooks

Vault cluster health incident on kubernetes

Kubernetes deployment with multiple restarts

Kubernetes Replicaset Incomplete

Kubernetes Pod Restarting Monitoring

Support