This issue occurs when the /data/pg_wal directory in the PostgreSQL PVC consumes all the available disk space.
As a result, PostgreSQL pods continue to remain in a running state (despite the disk being full) but applications fail to connect to the PostgreSQL server.
Handing this scenario was identified as Product Improvement DRRE3-3560 and a more-automated approach has been implemented in BMC Helix IT Operations Management (ITOM) Version 24.3.00.
24.3.00+ Solution:
A newly-introduced pod in 24.3.00, postgres-replication-monitor, monitors the replication of the PostgreSQL pods.
If the postgres-replication-monitor pod detects issues with any one of the replicas, it attempts to remediate the issue via the same commands used to resolve the issue manually prior to 24.3. (see below)
On rare occasions, such as if both replicas have issues, the postgres-replication-monitor pod may be unable to resolve the issue and will, by design, go into the CrashLoopBackOff state so that the problem condition is easier to detect via monitoring.
If the postgres-replication-monitor pod is detected to be in the CrashLoopBackOff status, please contact BMC Customer Support for further assistance.
Solution for versions prior to 24.3.00:
This issue occurs when replication is false on either of the follower's PostgreSQL pods which can result in the PostgreSQL wal file size growing.
To fix the replication issue on the secondary postgres pods, users can reinitialize PostgreSQL on problematic pods by following the below steps:
1. Determine the master PostgreSQL pod:
kubectl get pods -n <namespace> --show-labels | grep -i postgres | grep -i master
2. Exec into the master PostgreSQL pod, e.g postgres-bmc-pg-ha-2
kubectl exec -it postgres-bmc-pg-ha-2 -n <namespace> bash
Note: The following commands use the patronictl command line too. See here for more information.
3. Get the list of postgres members, their status, and role:
postgres@postgres-bmc-pg-ha-2:/$ patronictl list
Member Host Role State TL Lag in MB
------------------{}{}{}-----{}{}+{}{}-{}{}{}{}------{}{}{}{}{}{}-------
postgres-bmc-pg-ha-0 10.xx.x.xx Replica stopped unknown
postgres-bmc-pg-ha-1 10.xx.x.xx Replica start failed unknown
postgres-bmc-pg-ha-2 10.xx.x.xx Leader running 15
------------------{}{}{}-----{}{}+{}{}-{}{}{}{}------{}{}{}{}{}{}-------
From this output, note the members which are not in the "running" or "streaming" state.
Note: In some scenarios, it will be observed that the value of the "State" column for all the members is reported as non-problematic i.e. running or streaming. However, the Lag and disk space for the pg_wall directory keeps growing. In this scenario, please contact BMC Customer Support.
4. Re-initialize the failed members noted from command 3. (If prompted for confirmation, approve it)
Note: The below commands are examples using the specific members noted in the above example. This will vary occurrence to occurrence. Make sure to use the correct members from command 3 above in this command. This command is used to reinitialize a specific member (PostgreSQL pod) in a Patroni HA cluster. It forces reinitialization of the specified node.
postgres@postgres-bmc-pg-ha-2:/$ patronictl reinit postgres-bmc-pg-ha postgres-bmc-pg-ha-0
Are you sure you want to reinitialize members postgres-bmc-pg-ha-0? [y/N]: y
postgres@postgres-bmc-pg-ha-2:/$ patronictl reinit postgres-bmc-pg-ha postgres-bmc-pg-ha-1
Are you sure you want to reinitialize members postgres-bmc-pg-ha-1? [y/N]: y
After this, the state of the member will move through the following values:
failed/stopped --> creating replica --> starting --> "running" or "streaming" state.
Note: In this process, the leader does not change.
Member | Host | Role | State | TL | Lag in MB |
------------------{}{}{}-----{}{}+{}{}-{}{}{}{}-{}{}{}{}{}{}-------
postgres-bmc-pg-ha-0 10.xx.x.xxx Replica running 15 0
postgres-bmc-pg-ha-1 10.xx.x.xxx Replica running 15 0
postgres-bmc-pg-ha-2 10.xx.x.xxx Leader running 15
------------------{}{}{}-----{}{}+{}{}-{}{}{}{}-{}{}{}{}{}{}-------
This replication issue can occur when there is something wrong either on the Kubernetes(K8s) Cluster or the Network/K8s API pods, due to which postgres replications fail to elect another node as a leader resulting in the issue. For investigation on the root cause below are logs to gather:
1. All the postgres pod logs
2. Cluster events
kubectl get events -A
3.K8s API pod logs
How to collect the logs? Please refer to
this link
Please open a support case if there are any follow-up queries.
See also: BEST FAQ on resolving PostgreSQL Pod/Deployment issues in an OnPrem BMC Helix IT Operations Management (ITOM) deployment