BMC Helix IT Operations Management (ITOM) Deployment: OnPrem - PostgreSQL pods are in CrashLoopBackOff state and running out of disk space due to pg

Knowledge Article

Article Number

000412356

Old Article Number

Article Type

Solutions to a Product Problem

Title

BMC Helix IT Operations Management (ITOM) Deployment: OnPrem - PostgreSQL pods are in CrashLoopBackOff state and running out of disk space due to pg_wal (Write-Ahead Logging)

Summary

BMC Helix IT Operations Management (ITOM) Deployment: OnPrem - PostgreSQL pods are in CrashLoopBackOff state and running out of disk space due to pg_wal (Write-Ahead Logging)

Product

BMC Helix Operations Management OnPrem Containers

Component

BMC Helix Operations Management OnPrem Containers

Applies to

23.X 24.X

Problem

In a BMC Helix IT Operations Management (ITOM) Deployment the following conditions are observed:

The PostgreSQL pool pods are stuck in the CrashLoopBackOff state
PostgreSQL is running out of disk space due to pg_wal (Write-Ahead Logging)
The following errors are observed to be repeating in the postgressql.log files:

2024-04-15 12:02:07.223 GMT [21179] ERROR: requested WAL segment 00000023000003E40000007F has already been removed
2024-04-15 12:02:07.223 GMT [21179] STATEMENT: START_REPLICATION SLOT "postgres_bmc_pg_ha_1" 3E4/7F000000 TIMELINE 35
2024-04-15 12:02:08.107 GMT [21180] ERROR: requested WAL segment 00000023000003E40000007F has already been removed
2024-04-15 12:02:08.107 GMT [21180] STATEMENT: START_REPLICATION SLOT "postgres_bmc_pg_ha_2" 3E4/7F000000 TIMELINE 35

With the following errors in the postgres-bmc-pg-ha-x pod logs:

2024-04-15 09:02:08,409 INFO: establishing a new patroni heartbeat connection to postgres
2024-04-15 09:02:08,415 WARNING: Retry got exception: connection problems
2024-04-15 09:02:08,415 WARNING: Failed to determine PostgreSQL state from the connection, falling back to cached role

Cause

The /data/pg_wal directory in the PostgreSQL PVC is consuming all the available disk space.

Solution

This issue occurs when the /data/pg_wal directory in the PostgreSQL PVC consumes all the available disk space.

As a result, PostgreSQL pods continue to remain in a running state (despite the disk being full) but applications fail to connect to the PostgreSQL server.

Handing this scenario was identified as Product Improvement DRRE3-3560 and a more-automated approach has been implemented in BMC Helix IT Operations Management (ITOM) Version 24.3.00.

24.3.00+ Solution:

A newly-introduced pod in 24.3.00, postgres-replication-monitor, monitors the replication of the PostgreSQL pods.

If the postgres-replication-monitor pod detects issues with any one of the replicas, it attempts to remediate the issue via the same commands used to resolve the issue manually prior to 24.3. (see below)

On rare occasions, such as if both replicas have issues, the postgres-replication-monitor pod may be unable to resolve the issue and will, by design, go into the CrashLoopBackOff state so that the problem condition is easier to detect via monitoring.

If the postgres-replication-monitor pod is detected to be in the CrashLoopBackOff status, please contact BMC Customer Support for further assistance.

Solution for versions prior to 24.3.00:

This issue occurs when replication is false on either of the follower's PostgreSQL pods which can result in the PostgreSQL wal file size growing.

To fix the replication issue on the secondary postgres pods, users can reinitialize PostgreSQL on problematic pods by following the below steps:

1. Determine the master PostgreSQL pod:
kubectl get pods -n <namespace> --show-labels | grep -i postgres | grep -i master

2. Exec into the master PostgreSQL pod, e.g postgres-bmc-pg-ha-2
kubectl exec -it postgres-bmc-pg-ha-2 -n <namespace> bash

Note: The following commands use the patronictl command line too. See here for more information.

3. Get the list of postgres members, their status, and role:

postgres@postgres-bmc-pg-ha-2:/$ patronictl list

Member                  Host            Role        State            TL    Lag in MB
------------------{}{}{}-----{}{}+{}{}-{}{}{}{}------{}{}{}{}{}{}-------
postgres-bmc-pg-ha-0    10.xx.x.xx Replica    stopped                   unknown
postgres-bmc-pg-ha-1    10.xx.x.xx Replica    start failed             unknown
postgres-bmc-pg-ha-2    10.xx.x.xx Leader      running          15
------------------{}{}{}-----{}{}+{}{}-{}{}{}{}------{}{}{}{}{}{}-------

From this output, note the members which are not in the "running" or "streaming" state.

Note: In some scenarios, it will be observed that the value of the "State" column for all the members is reported as non-problematic i.e. running or streaming. However, the Lag and disk space for the pg_wall directory keeps growing. In this scenario, please contact BMC Customer Support.

4. Re-initialize the failed members noted from command 3. (If prompted for confirmation, approve it)

Note: The below commands are examples using the specific members noted in the above example. This will vary occurrence to occurrence. Make sure to use the correct members from command 3 above in this command. This command is used to reinitialize a specific member (PostgreSQL pod) in a Patroni HA cluster. It forces reinitialization of the specified node.

postgres@postgres-bmc-pg-ha-2:/$ patronictl reinit postgres-bmc-pg-ha postgres-bmc-pg-ha-0
Are you sure you want to reinitialize members postgres-bmc-pg-ha-0? [y/N]: y

postgres@postgres-bmc-pg-ha-2:/$ patronictl reinit postgres-bmc-pg-ha postgres-bmc-pg-ha-1
Are you sure you want to reinitialize members postgres-bmc-pg-ha-1? [y/N]: y

After this, the state of the member will move through the following values:
failed/stopped --> creating replica --> starting --> "running" or "streaming" state.

Note: In this process, the leader does not change.

Member               | Host        | Role    | State   | TL | Lag in MB |
------------------{}{}{}-----{}{}+{}{}-{}{}{}{}-{}{}{}{}{}{}-------
postgres-bmc-pg-ha-0    10.xx.x.xxx Replica    running    15            0
postgres-bmc-pg-ha-1    10.xx.x.xxx Replica    running    15            0
postgres-bmc-pg-ha-2    10.xx.x.xxx Leader      running    15
------------------{}{}{}-----{}{}+{}{}-{}{}{}{}-{}{}{}{}{}{}-------

This replication issue can occur when there is something wrong either on the Kubernetes(K8s) Cluster or the Network/K8s API pods, due to which postgres replications fail to elect another node as a leader resulting in the issue. For investigation on the root cause below are logs to gather:

1. All the postgres pod logs
2. Cluster events
kubectl get events -A
3.K8s API pod logs
How to collect the logs? Please refer to this link

Please open a support case if there are any follow-up queries.

See also: BEST FAQ on resolving PostgreSQL Pod/Deployment issues in an OnPrem BMC Helix IT Operations Management (ITOM) deployment

Attachment(s):