---
id: 8d1af136-9a46-439c-b6c1-903acb539291
---

# Kubernetes Node Status Not OK
---

This incident type occurs when the Kubernetes node status is not OK. It means that the scheduler cannot place pods on the node due to some underlying issue with the node's health. This incident can impact the availability and performance of the applications running on the Kubernetes cluster. Immediate attention is required to resolve this incident to restore the normal functioning of the Kubernetes cluster.

### Parameters
```shell
# Environment Variables
export NODE_NAME="PLACEHOLDER"
export CONTROL_PLANE_IP="PLACEHOLDER"
export POD_NAMESPACE="PLACEHOLDER"
export POD_NAME="PLACEHOLDER"
export CPU_THRESHOLD="PLACEHOLDER"
export MEMORY_THRESHOLD="PLACEHOLDER"
```

## Debug

### List all nodes in the Kubernetes cluster
```shell
kubectl get nodes
```

### Check the status of a specific node <node-name>
```shell
kubectl describe node ${NODE_NAME}
```

### Check the events associated with a specific node <node-name>
```shell
kubectl get events --field-selector involvedObject.kind=Node,involvedObject.name=${NODE_NAME}
```

### Check the health status of the kubelet service on the node <node-name>
```shell
systemctl status kubelet.service --node ${NODE_NAME}
```

### Check the logs for the kubelet service on the node <node-name>
```shell
journalctl -u kubelet.service --node ${NODE_NAME}
```

### Check the status of the Docker service on the node <node-name>
```shell
systemctl status docker.service --node ${NODE_NAME}
```

### Check the logs for the Docker service on the node <node-name>
```shell
journalctl -u docker.service --node ${NODE_NAME}
```

### Network or connectivity issues between the Kubernetes nodes and the control plane.
```shell

#!/bin/bash

# Get the list of nodes in the Kubernetes cluster
NODES=$(kubectl get nodes -o jsonpath='{.items[*].status.addresses[?(@.type=="InternalIP")].address}')

# Loop over each node and check the connectivity to the control plane
for NODE in $NODES; do
  echo "Checking connectivity to control plane from node $NODE..."
  kubectl exec -it ${POD_NAME} -n ${POD_NAMESPACE} -- ping -c 3 ${CONTROL_PLANE_IP}
done

```

### Resource constraints on the node due to excessive resource utilization by the applications running on it.
```shell

#!/bin/bash

# Set the Kubernetes node name
NODE_NAME=${NODE_NAME}

# Set the resource threshold for CPU utilization
CPU_THRESHOLD=${CPU_THRESHOLD}

# Set the resource threshold for memory utilization
MEMORY_THRESHOLD=${MEMORY_THRESHOLD}

# Get the CPU utilization for the node
CPU_UTILIZATION=$(kubectl top nodes $NODE_NAME | awk 'NR==2{print$2}' | sed 's/%//')

# Get the memory utilization for the node
MEMORY_UTILIZATION=$(kubectl top nodes $NODE_NAME | awk 'NR==2{print$3}' | sed 's/Mi//')

# Check if the CPU utilization is above the threshold
if [ $CPU_UTILIZATION -gt $CPU_THRESHOLD ]; then
  echo "CPU utilization for node $NODE_NAME is above the threshold of $CPU_THRESHOLD%"
fi

# Check if the memory utilization is above the threshold
if [ $MEMORY_UTILIZATION -gt $MEMORY_THRESHOLD ]; then
  echo "Memory utilization for node $NODE_NAME is above the threshold of $MEMORY_THRESHOLD Mi"
fi

```
## Repair
---

### Check the health of the affected Kubernetes node. Identify and fix any underlying issues with the node, such as hardware failure or resource exhaustion.
```shell

#!/bin/bash

# Get the name of the affected Kubernetes node
node_name=${NODE_NAME}

# Check the health of the node
kubectl describe node $node_name | grep -i conditions

# Check the hardware resources of the node
kubectl describe node $node_name | grep -i capacity

# Check the resource utilization of the node
kubectl top node $node_name

# Identify and fix any underlying issues
# Depending on the issue, additional steps may be required here

# Restart the node
kubectl delete node $node_name

```

This incident type occurs when the Kubernetes node status is not OK. It means that the scheduler cannot place pods on the node due to some underlying issue with the node's health. This incident can impact the availability and performance of the applications running on the Kubernetes cluster. Immediate attention is required to resolve this incident to restore the normal functioning of the Kubernetes cluster.


This incident type is characterized by the detection of unauthorized access to the Kubernetes API server. This unauthorized access potentially enables attackers to manipulate cluster resources. Such incidents can lead to serious consequences, as unauthorized access allows attackers to make changes to the Kubernetes cluster, which may compromise the security and integrity of the entire system. Therefore, it is crucial to promptly detect and address such incidents to ensure the security and proper functioning of the Kubernetes cluster.


Unauthorized Access to Kubernetes API Server Detected

This incident type involves monitoring the replicas of a Kubernetes Statefulset, which is a type of workload in Kubernetes used for stateful applications. The incident is triggered when more than one replica's pods are down, creating an unsafe situation for manual operations. This incident is critical and requires immediate attention to resolve the issue and ensure the smooth functioning of the stateful applications.


Kubernetes Statefulset Replicas Monitoring Incident

A Kubernetes Replicaset Incomplete incident typically occurs when a specific number of pods that should be running are not, due to reasons such as failed pod initialization, unavailability of resources in the cluster, or inability to pull the image. This incident is usually triggered when the difference between desired and running pods is greater than zero, and it can be detected through monitoring tools like Datadog.


Kubernetes Replicaset Incomplete

Kubernetes Pods Pending incident indicates that one or more pods in a Kubernetes cluster are not running as expected and are in a pending state. This can happen due to various reasons such as resource constraints, scheduling issues, or network problems. This incident can impact the availability and performance of the application running on the Kubernetes cluster. It requires immediate attention to diagnose and resolve the underlying issue to ensure the pods are running as expected.


Kubernetes Pods Pending

This incident type involves nodes in a Kubernetes cluster that are experiencing network unavailability, meaning they are not accessible. This could be due to a misconfiguration, route exhaustion, or a physical problem with the network connection to the hardware. It is a high urgency incident that requires immediate attention to restore network connectivity to the affected nodes.


### List all nodes in the Kubernetes cluster

```shell
kubectl get nodes
```

### Check the status of a specific node <node-name>

```shell
kubectl describe node ${NODE_NAME}
```

### Check the events associated with a specific node <node-name>

```shell
kubectl get events --field-selector involvedObject.kind=Node,involvedObject.name=${NODE_NAME}
```

### Check the health status of the kubelet service on the node <node-name>

```shell
systemctl status kubelet.service --node ${NODE_NAME}
```

### Check the logs for the kubelet service on the node <node-name>

```shell
journalctl -u kubelet.service --node ${NODE_NAME}
```

### Check the status of the Docker service on the node <node-name>

```shell
systemctl status docker.service --node ${NODE_NAME}
```

### Check the logs for the Docker service on the node <node-name>

```shell
journalctl -u docker.service --node ${NODE_NAME}
```

### Network or connectivity issues between the Kubernetes nodes and the control plane.

```shell

#!/bin/bash

# Get the list of nodes in the Kubernetes cluster
NODES=$(kubectl get nodes -o jsonpath='{.items[*].status.addresses[?(@.type=="InternalIP")].address}')

# Loop over each node and check the connectivity to the control plane
for NODE in $NODES; do
  echo "Checking connectivity to control plane from node $NODE..."
  kubectl exec -it ${POD_NAME} -n ${POD_NAMESPACE} -- ping -c 3 ${CONTROL_PLANE_IP}
done

```

### Resource constraints on the node due to excessive resource utilization by the applications running on it.

```shell

#!/bin/bash

# Set the Kubernetes node name
NODE_NAME=${NODE_NAME}

# Set the resource threshold for CPU utilization
CPU_THRESHOLD=${CPU_THRESHOLD}

# Set the resource threshold for memory utilization
MEMORY_THRESHOLD=${MEMORY_THRESHOLD}

# Get the CPU utilization for the node
CPU_UTILIZATION=$(kubectl top nodes $NODE_NAME | awk 'NR==2{print$2}' | sed 's/%//')

# Get the memory utilization for the node
MEMORY_UTILIZATION=$(kubectl top nodes $NODE_NAME | awk 'NR==2{print$3}' | sed 's/Mi//')

# Check if the CPU utilization is above the threshold
if [ $CPU_UTILIZATION -gt $CPU_THRESHOLD ]; then
  echo "CPU utilization for node $NODE_NAME is above the threshold of $CPU_THRESHOLD%"
fi

# Check if the memory utilization is above the threshold
if [ $MEMORY_UTILIZATION -gt $MEMORY_THRESHOLD ]; then
  echo "Memory utilization for node $NODE_NAME is above the threshold of $MEMORY_THRESHOLD Mi"
fi

```


### Check the health of the affected Kubernetes node. Identify and fix any underlying issues with the node, such as hardware failure or resource exhaustion.

```shell

#!/bin/bash

# Get the name of the affected Kubernetes node
node_name=${NODE_NAME}

# Check the health of the node
kubectl describe node $node_name | grep -i conditions

# Check the hardware resources of the node
kubectl describe node $node_name | grep -i capacity

# Check the resource utilization of the node
kubectl top node $node_name

# Identify and fix any underlying issues
# Depending on the issue, additional steps may be required here

# Restart the node
kubectl delete node $node_name

```


Kubernetes Node Status Not OK

Overview

Parameters

Debug

List all nodes in the Kubernetes cluster

Check the status of a specific node <node-name>

Check the events associated with a specific node <node-name>

Check the health status of the kubelet service on the node <node-name>

Check the logs for the kubelet service on the node <node-name>

Check the status of the Docker service on the node <node-name>

Check the logs for the Docker service on the node <node-name>

Network or connectivity issues between the Kubernetes nodes and the control plane.

Resource constraints on the node due to excessive resource utilization by the applications running on it.

Repair

Check the health of the affected Kubernetes node. Identify and fix any underlying issues with the node, such as hardware failure or resource exhaustion.

Learn more

Related Runbooks

Unauthorized Access to Kubernetes API Server Detected

Kubernetes Statefulset Replicas Monitoring Incident

Kubernetes Replicaset Incomplete

Kubernetes Pods Pending

Support