Discovery: Best practices to monitor disk consumption and avoid disk full issues - INCLUDES VIDEO
Knowledge Article
Discovery: Best practices to monitor disk consumption and avoid disk full issues - INCLUDES VIDEO
Best practices to monitor the disk consumption and avoid disk full issues.
BMC Discovery
BMC Discovery
BMC Discovery + ITOM On Prem All Versions
BMC Discovery
BMC Discovery
BMC Discovery + ITOM On Prem All Versions
What are the best practices to monitor the disk consumption and avoid disk full issues?
This use case is covered in this webinar:
As Discovery grows, it consumes more disk space, and over time the disks may become full. In many cases, disk consumption is not monitored and nothing is done until the appliance shuts down.
The monitoring can also be manually done through the SSH interface (df -h) or though the UI of Discovery (Administration > Performance, Daily Disk Usage Statistics).
The rest of the article is for versions older than 22.2 (only):
If the datastore and transaction logs are in the same partition (or if the size of their partition is the same), the best is to implement the plan below (see the "case 1" below to implement it):
When the datastore partition is 70%+ full: notify the Discovery admin with an email.
When any partition is 80%+ full: notify the Discovery admin with an email and stop the scans (excepted in the case of consolidator)
When any partition is 90%+ full: immediate shut down.
If the size of the datastore partition and transaction logs partition are different, the best is to implement this plan (see the "case 2" below to implement it):
When the datastore partition is 70%+ full: notify the Discovery admin with an email.
When a free disk space threshold (in Mb) is reached by any partitions: notify the Discovery admin with an email and stop the scans (excepted in the case of consolidator)
When the half of the threshold above is reached by any partitions: immediate shut down.
Case 1: the size of both datastore and datastore transaction logs is the same
standalone (not a cluster): take a discovery backup. It can be restored even if the services are down.
a cluster: stop all the VMs and take a VMware snapshot of each of them. If the services can't restart because of the problem above, only VMware snapshots can be restored (tw_backup needs the services to be up)
2- go to Administration > Appliance Baseline
3- verify that the severity for "Appliance Disk Space" is still CRITICAL (default value).
4- execute the 2 commands below to prepare the first threshold (DISK_FREE_REQUIRE_MB). It is about the free disk space of the datastore partition. In the cluster case, execute this step and the following ones only once, in 1 single member, any of them.
5- Checkpoint: compare the size of the datastore partition (in megabytes) with the value in the generated command (in megabyte). Make sure that DISK_FREE_REQUIRE_MB is 10% of the size of the datastore partition like in the example above.
6- execute the command under the line "generated command". In the example above this command is "tw_options DISK_FREE_REQUIRE_MB=XXXX".
7- execute the 2 commands below to prepare the second threshold (DISK_FREE_REQUIRE_MB_LOGS). It is about the free disk space of the log partition.
8- checkpoint: compare the size of the log partition (in megabytes) with the value in the generated command (in megabyte). Make sure that DISK_FREE_REQUIRE_MB_LOGSis 10% of the size of the log partition like in the example above.
9- execute the command under the line "generated command". In the example above this command is "tw_options DISK_FREE_REQUIRE_MB_LOGS=XXXX".
Step 3: Configure the actions
configure Discovery to send an email in case of MAJOR and CRITICAL problems. See this documentation.
scanners only: configure Discovery to "stop discovery" in case of CRITICAL issues. This will keep user from restarting the scans in manage > discovery (start all scans) before the problem is resolved. It is not recommended to do that for consolidators. It can lead to other issues.
What to do if the appliance goes down and/or can't restart? ("tw_service_control --start" fails with "insufficient disk space")
Read this article and if needed, open a Customer Support case.
What to do if the appliance is still running but:
discovery is stopped (Manage > Discovery)
and Administration > Appliance Baseline shows a critical error 'Appliance Disk Space'.
and the admin receives an email that contains "Appliance Disk Space: CRITICAL: This appliance has insufficient free disk space"
If the user tries to reboot or restart, the appliance won't restart. While the appliance is running, the user can use the UI to configure a new disk.
In scanner case, let it run until it finishes to delete what it has to delete (check the DDD removal statistics).
In consolidator case, stop all the scanners, let the consolidator run until it finishes to delete what it has to delete.
Once the deletion process is done, stop the appliance and try to compact. If this is not enough to restart the appliance, read this article.
What to do if the appliance is still running, restart/reboot works but an email that contains "DataStore Size Warning: MAJOR" is sent?
Read this article to try to prevent the disk saturation from happening. Open a Customer Support case if needed.
Notes:
once the appliance will be stopped, it won't be possible to change the values of DISK_FREE_REQUIRE_MB_LOGS and DISK_FREE_REQUIRE_MB until the appliance restarts with enough free disk space. This implies that the user may have to use tw_disk_utils to configure new disks, even if the appliance has up to 20% of free disk space
DISK_FREE_REQUIRE_MB and DISK_FREE_REQUIRE_MB_LOGS match the "Free disk space shutdown threshold" documented here.
The threshold used by "Appliance Disk Space" matches the "Free disk space baseline threshold" documented in the link above.
By default, there are 40 baseline alerts using the severity MAJOR. With the steps above, an email will be sent if any of them fail. It could generate more emails than desired. The severity of some of them can be reduced if needed.
The logs and the datastore may be in the same partition. In this case, the values for DISK_FREE_REQUIRE_MB and DISK_FREE_REQUIRE_MB_LOGS will be the same.
"Appliance Disk Space" baseline check can only monitor two partitions (the ones that contain the logs and the datastore). It is not possible for now (12.1) to monitor other partitions.
It is possible to use shorter thresholds but it will provide less time to resolve the problem before the appliance shuts down. For example, DISK_FREE_REQUIRE_MB and DISK_FREE_REQUIRE_MB_LOGS could be set to 5% of the respective disk sizes. The appliance would shut down if disks are 95% full and the critical alert would be sent if the disks are 90% full.
Case 2: The size of both datastore and datastore transaction logs is different: