How to check if the datastore is corrupted? |
When is it recommended to execute this procedure?
Discovery: Core dumps in /usr/tideway/cores (see the root cause 1)
This procedure is also recommended in case of symptoms documented in our KB with the db corruption as possible root cause. For example (non-limitative list):
When possible/acceptable, it is also recommended to check for db corruptions before taking a backup. Outside this cases, if not already done, it is recommended to ask BMC Customer Support to evaluate the opportunity to execute this procedure. What do users have to know before executing this procedure:
The procedure below only checks the datastore of the member where it it executed. It always has to be executed on each members (including the coordinator). This can be done in parallel or not.
- If multi generational datastore is disabled: The procedure below requires to stop the services of the member where the procedure is executed.
- If multi generational datastore is enabled: The procedure below will check the entire datatore (all partitions) of the current member only if all the services are stopped. Otherwise, it will check only a part of the datastore: the "read-only partitions".
Procedure to check if the datastore of one member is corrupted: 1- (only If the TKU is older than Jul22) Install tw_support_tool
2- open an ssh session with the linux account "tideway" and execute the command below: screen -S mySession
3- Execute the command below:
tw_support_tool --check-db-corruptions --restart-services
It will stop the services (implies downtime in non-FT cluster or standalone appliance case), check the db corruptions and restart the services afterwards. Note: -If tw_support_tool looks blocked/stuck, please refer to this article: Discovery: tw_support_tool looks stuck
- this procedure will collect customizations, ip addresses, hostnames and user names (PII). Alternative method 1: stop the services of the current member, then execute the command tw_support_tool --check-db-corruptions Alternative method 2: If the user can't stop services, execute the command tw_support_tool --check-db-corruptions. It will list the checks that can be done (if any) and will tell the user how to check more. It won't be able to check anything if the services are up and multi generational datastore is disabled.
4- Wait for the end of the step above. This can be long. Sometimes more than 12 hours with large datastore. The duration can't be predicted. It depends on many parameters including the performance of the appliance, the datastore volume, the performance of the storage, the latest compaction etc.
5- When BMC Customer Support asks for the execution of this procedure, attach the file /tmp/SendThisFileToBMCSupport-<hostname>.tgz to the corresponding support case.
6- [Optional] Review the section "db_verify" of /tmp/tw_support_tool.latest (sed -n "/db_verify/,/---/p" /tmp/tw_support_tool.latest). This file is zipped in /tmp/SendThisFileToBMCSupport-<hostname>.tgz If no corruptions were found, the db_verify section of the file tw_support_tool.latest may look like this: --------------------------- db_verify
[...] 1334 checkable db files They were all checked Unexpected messages (10 first ones): <= unexpected messsages does not mean "db corruption" Warning ABC
ERROR 123 If a corruption was found, it may look like this:
--------------------------- db_verify
[...] 1 db corruptions detected 1 index file(s) corrupted 0 history file(s) corrupted 0 state file(s) corrupted 0 rels file(s) corrupted Errors found (10 first ones): db_verify: p0003_rInference_pidx: BDB0090 DB_VERIFY_BAD: Database verification failed What if ...? - some corruptions are reported
Review the articles below: - the section "db_verify" contains the messages below
db_verify: BDB3018 unknown: unwritable page 51406 remaining in the cache after error 28 (error 28 is ENOSPC)
OR db_verify: BDB0137 write: 0x246c2f0, 1024: No space left on device db_verify: BDB3015 unknown: write failed for page 1428795 db_verify: BDB3018 unknown: unwritable page 1428795 remaining in the cache after error 28 This means that the /usr filesystem was saturated during the execution of tw_support_tool. Backup the cores (if any) and log files if required to free some space. Then re-execute the procedure with more free disk space.
- tw_support_tool was interrupted by a timeout (12h by default) The message will also explain how to increase the timeout.
- the user interrupted the script Use CTRL-C. If afterwards, the user can't see what he types anymore, type "stty echo" + enter. If there are still remaining db_verify processes running 1h later, reboot.
- the ssh session was disconnected while tw_support_tool was running. Open a new ssh session on the host and execute "screen -r mySession". If the session can't be found, this means that the tool is finished, the result is in /tmp/SendThisFileToBMCSupport-<hostname>.tgz
Please also refer following video: |