---
id: 6796032e-52c1-466c-a270-9afb1095c444
---

# Host systemd service crashed (instance) incident.
---

This incident type refers to the failure of a systemd service on a particular host instance. The incident could be triggered by various causes such as a software bug, hardware failure, or system overload. This type of incident can cause downtime or service disruption to the affected host instance, which may require immediate resolution to restore normal operations.

### Parameters
```shell
export SERVICE_NAME="PLACEHOLDER"

export IP_ADDRESS="PLACEHOLDER"

export SYSTEMD_SERVICE_NAME="PLACEHOLDER"

export HOST_INSTANCE="PLACEHOLDER"
```

## Debug

### Check the status of the systemd service on the affected host instance
```shell
systemctl status ${SERVICE_NAME}
```

### Check the systemd journal for logs related to the service crash
```shell
journalctl -u ${SERVICE_NAME} -b
```

### Check the system logs for any relevant error messages
```shell
dmesg | grep ${SERVICE_NAME}
```

### Check the CPU and memory usage on the affected host instance
```shell
top
```

### Check the disk usage and available space on the affected host instance
```shell
df -h
```

### Check the network connectivity on the affected host instance
```shell
ping ${IP_ADDRESS}
```

### Check the firewall rules on the affected host instance
```shell
iptables -L
```

### Check the hardware status of the affected host instance
```shell
sensors
```

### The host system's resources were overloaded due to high usage or traffic, causing the systemd service to fail.
```shell


#!/bin/bash



# Get the current CPU usage

CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2 + $4}')



# Get the current memory usage

MEMORY_USAGE=$(free | awk '/Mem/{printf("%.2f"), $3/$2*100}')



# Check if CPU or memory usage is over 90%

if (( $(echo "$CPU_USAGE > 90" | bc -l) )) || (( $(echo "$MEMORY_USAGE > 90" | bc -l) )); then

    # If usage is high, print an error message and restart the systemd service

    echo "High CPU or memory usage detected. Restarting systemd service..."

    systemctl restart ${SYSTEMD_SERVICE_NAME}

else

    # If usage is normal, print a success message

    echo "CPU and memory usage is normal."

fi


```

## Repair

### Restart the systemd service on the affected host instance: This can be done to try and resolve the issue by manually restarting the systemd service on the affected host instance. If the failure was due to a temporary issue, the service should resume normal operations after restarting.
```shell


#!/bin/bash



# Replace ${HOST_INSTANCE} with the name of the affected host instance.

HOST_INSTANCE=${HOST_INSTANCE}



# Restart the systemd service on the affected host instance.

systemctl restart ${SYSTEMD_SERVICE_NAME}@${HOST_INSTANCE}.service



# Check the status of the systemd service to verify if it has resumed normal operations.

systemctl status ${SYSTEMD_SERVICE_NAME}@${HOST_INSTANCE}.service


```

This incident type refers to the failure of a systemd service on a particular host instance. The incident could be triggered by various causes such as a software bug, hardware failure, or system overload. This type of incident can cause downtime or service disruption to the affected host instance, which may require immediate resolution to restore normal operations.


The Host Out of Memory (OOM) Incident occurs when a server or system runs out of memory, causing it to crash or become unresponsive. This can be caused by various factors, such as an unexpected surge in traffic or insufficient resources allocated to the system. Resolving this type of incident requires identifying the root cause of the memory issue and taking appropriate measures such as optimizing system resources or increasing memory capacity.


Host Out of Memory (OOM) Incident

This incident type refers to an increase in the number of errors per second on a Tomcat server, which could indicate an issue with the server itself, the host, a deployed application, or an application servlet. This could include errors generated when the Tomcat server runs out of memory, can't find a requested file or servlet, or is unable to serve a JSP due to syntax errors in the servlet codebase. This incident type requires immediate attention to diagnose and address the underlying issue.


Increase of the errors/second rate for Tomcat server

This incident type refers to a failure in one or more Spark executors during the execution of a job. Spark executors are worker processes that run computations and store data in memory or on disk. When an executor fails, it can cause the entire job to fail or result in degraded performance. This type of incident can occur for a variety of reasons, such as hardware or network issues, memory errors, or software bugs.


Spark executor failure during job execution.

The "High error rate on NGINX" incident type refers to a situation where the error rate on the NGINX server is above 1% for the last 5 minutes. This can result in degraded performance or downtime of the affected service, impacting user experience and potentially leading to lost revenue. The incident requires immediate attention and resolution to minimize the impact on users and prevent further damage.


High error rate on NGINX incident

This incident type refers to the restart of a MySQL instance that has caused an alert to trigger. It may be related to issues with the MySQL database or the server hosting the instance. The incident requires investigation and resolution to ensure the proper functioning of the affected services.


MySQL instance restart incident

```shell
export SERVICE_NAME="PLACEHOLDER"

export IP_ADDRESS="PLACEHOLDER"

export SYSTEMD_SERVICE_NAME="PLACEHOLDER"

export HOST_INSTANCE="PLACEHOLDER"
```


### Check the status of the systemd service on the affected host instance

```shell
systemctl status ${SERVICE_NAME}
```

### Check the systemd journal for logs related to the service crash

```shell
journalctl -u ${SERVICE_NAME} -b
```

### Check the system logs for any relevant error messages

```shell
dmesg | grep ${SERVICE_NAME}
```

### Check the CPU and memory usage on the affected host instance

```shell
top
```

### Check the disk usage and available space on the affected host instance

```shell
df -h
```

### Check the network connectivity on the affected host instance

```shell
ping ${IP_ADDRESS}
```

### Check the firewall rules on the affected host instance

```shell
iptables -L
```

### Check the hardware status of the affected host instance

```shell
sensors
```

### The host system's resources were overloaded due to high usage or traffic, causing the systemd service to fail.

```shell


#!/bin/bash



# Get the current CPU usage

CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2 + $4}')



# Get the current memory usage

MEMORY_USAGE=$(free | awk '/Mem/{printf("%.2f"), $3/$2*100}')



# Check if CPU or memory usage is over 90%

if (( $(echo "$CPU_USAGE > 90" | bc -l) )) || (( $(echo "$MEMORY_USAGE > 90" | bc -l) )); then

    # If usage is high, print an error message and restart the systemd service

    echo "High CPU or memory usage detected. Restarting systemd service..."

    systemctl restart ${SYSTEMD_SERVICE_NAME}

else

    # If usage is normal, print a success message

    echo "CPU and memory usage is normal."

fi


```


### Restart the systemd service on the affected host instance: This can be done to try and resolve the issue by manually restarting the systemd service on the affected host instance. If the failure was due to a temporary issue, the service should resume normal operations after restarting.

```shell


#!/bin/bash



# Replace ${HOST_INSTANCE} with the name of the affected host instance.

HOST_INSTANCE=${HOST_INSTANCE}



# Restart the systemd service on the affected host instance.

systemctl restart ${SYSTEMD_SERVICE_NAME}@${HOST_INSTANCE}.service



# Check the status of the systemd service to verify if it has resumed normal operations.

systemctl status ${SYSTEMD_SERVICE_NAME}@${HOST_INSTANCE}.service


```


Host systemd service crashed (instance) incident.

Overview

Parameters

Debug

Check the status of the systemd service on the affected host instance

Check the system logs for any relevant error messages

Check the CPU and memory usage on the affected host instance

Check the disk usage and available space on the affected host instance

Check the network connectivity on the affected host instance

Check the firewall rules on the affected host instance

Check the hardware status of the affected host instance

The host system's resources were overloaded due to high usage or traffic, causing the systemd service to fail.

Repair

Restart the systemd service on the affected host instance: This can be done to try and resolve the issue by manually restarting the systemd service on the affected host instance. If the failure was due to a temporary issue, the service should resume normal operations after restarting.

Learn more

Related Runbooks

Host Out of Memory (OOM) Incident

Increase of the errors/second rate for Tomcat server

Spark executor failure during job execution.

High error rate on NGINX incident

Support