Discovery: Best practices to monitor disk consumption and avoid disk full issues

Knowledge Article

Article Number

000376381

Old Article Number

Article Type

FAQ/Procedural

Title

Discovery: Best practices to monitor disk consumption and avoid disk full issues - INCLUDES VIDEO

Summary

Best practices to monitor the disk consumption and avoid disk full issues.

Product

BMC Discovery

Component

BMC Discovery

Applies to

BMC Discovery + ITOM On Prem All Versions

Question

What are the best practices to monitor the disk consumption and avoid disk full issues?

Answer

This use case is covered in this webinar:

As Discovery grows, it consumes more disk space, and over time the disks may become full. In many cases, disk consumption is not monitored and nothing is done until the appliance shuts down.

To avoid that, the best practice is to:

1- Monitor the disk consumption.

If using 22.2+:

Use the documented disk monitoring options to send an email to anticipate disk saturation.

The monitoring can also be manually done through the SSH interface (df -h) or though the UI of Discovery (Administration > Performance, Daily Disk Usage Statistics).

For versions older than 22.2 (only):

This article explains how to send a regular reports about the disk consumption: How to generate an email report to monitor disk space on a Discovery appliance?

2- Automate actions in case of disk saturation.

If using 22.2+: Follow the documented disk monitoring options.

The rest of the article is for versions older than 22.2 (only):

If the datastore and transaction logs are in the same partition (or if the size of their partition is the same), the best is to implement the plan below (see the "case 1" below to implement it):

- - When the datastore partition is 70%+ full: notify the Discovery admin with an email.
  - When any partition is 80%+ full: notify the Discovery admin with an email and stop the scans (excepted in the case of consolidator)
  - When any partition is 90%+ full: immediate shut down.

If the size of the datastore partition and transaction logs partition are different, the best is to implement this plan (see the "case 2" below to implement it):

- - When the datastore partition is 70%+ full: notify the Discovery admin with an email.
  - When a free disk space threshold (in Mb) is reached by any partitions: notify the Discovery admin with an email and stop the scans (excepted in the case of consolidator)
  - When the half of the threshold above is reached by any partitions: immediate shut down.

Case 1: the size of both datastore and datastore transaction logs is the same

IMPORTANT: This procedure is only suitable to the specific case where both datastore data and datastore transaction logs are in the same partition (or the size of both partitions is the same). If used in other cases, the appliance may go down. Use this article if this happens: Discovery: How to restart the appliance if DISK_FREE_REQUIRE_MB was set to an unreasonable value?

Step 1: configure the "Size Warning" baseline

1- execute the command below (in an ssh session opened with tideway)

tw_options DATASTORE_SOFT_LIMIT=$((`df -k /usr/tideway/var/tideway.db/data/datadir/ | tail -n 1 | sed 's/\s\+/ /g' | cut -d\ -f2`*7/(1024*1024*10)))

2- go to Administration > Appliance Baseline and raise the severity of "Size Warning" to MAJOR

Step 2: configure the "Appliance Disk Space" baseline and the two "Free disk space" thresholds

Prepare a backup plan, even if the cluster is fault tolerant, there is a risk of down time if this procedure is not correctly used. For more information, see this article: Discovery: How to restart the appliance if DISK_FREE_REQUIRE_MB was set to an unreasonable value?

1- If the appliance is:

- - - - standalone (not a cluster): take a discovery backup. It can be restored even if the services are down.
      - a cluster: stop all the VMs and take a VMware snapshot of each of them. If the services can't restart because of the problem above, only VMware snapshots can be restored (tw_backup needs the services to be up)

2- go to Administration > Appliance Baseline

3- verify that the severity for "Appliance Disk Space" is still CRITICAL (default value).

4- execute the 2 commands below to prepare the first threshold (DISK_FREE_REQUIRE_MB). It is about the free disk space of the datastore partition. In the cluster case, execute this step and the following ones only once, in 1 single member, any of them.

df /usr/tideway/var/tideway.db/data/datadir/ --block-size=1024K

echo -e "generated command:\ntw_options DISK_FREE_REQUIRE_MB=$((`df /usr/tideway/var/tideway.db/data/datadir/ --block-size=1024K | tail -n 1 | sed 's/\s\+/ /g' | cut -d\ -f2`/10))"

For example, if the db partition size is 500GB, the output of the commands should be:

$ df /usr/tideway/var/tideway.db/data/datadir/ --block-size=1024K
Filesystem 1M-blocks Used Available Use% Mounted on
/dev/sda10 512000 100000 41200 19% /usr/tideway

$ echo -e "generated command:\ntw_options DISK_FREE_REQUIRE_MB=$((`df /usr/tideway/var/tideway.db/data/datadir/ --block-size=1024K | tail -n 1 | sed 's/\s\+/ /g' | cut -d\ -f2`/10))"
generated command:
tw_options DISK_FREE_REQUIRE_MB=51200

5- Checkpoint: compare the size of the datastore partition (in megabytes) with the value in the generated command (in megabyte). Make sure that DISK_FREE_REQUIRE_MB is 10% of the size of the datastore partition like in the example above.

6- execute the command under the line "generated command". In the example above this command is "tw_options DISK_FREE_REQUIRE_MB=XXXX".

7- execute the 2 commands below to prepare the second threshold (DISK_FREE_REQUIRE_MB_LOGS). It is about the free disk space of the log partition.

df /usr/tideway/log --block-size=1024K

echo -e "generated command:\ntw_options DISK_FREE_REQUIRE_MB_LOGS=$((`df /usr/tideway/log --block-size=1024K | tail -n 1 | sed 's/\s\+/ /g' | cut -d\ -f2`/10))"

For example, if the log partition size is 39GB (default), the output of the commands should be:

$ df /usr/tideway/log --block-size=1024K
Filesystem 1M-blocks Used Available Use% Mounted on
/dev/sda10 40005 2317 35600 7% /usr/tideway

$ echo -e "generated command:\ntw_options DISK_FREE_REQUIRE_MB_LOGS=$((`df /usr/tideway/var/tideway.db/data/datadir/ --block-size=1024K | tail -n 1 | sed 's/\s\+/ /g' | cut -d\ -f2`/10))"
generated command:
tw_options DISK_FREE_REQUIRE_MB_LOGS=4000

8- checkpoint: compare the size of the log partition (in megabytes) with the value in the generated command (in megabyte). Make sure that DISK_FREE_REQUIRE_MB_LOGSis 10% of the size of the log partition like in the example above.

9- execute the command under the line "generated command". In the example above this command is "tw_options DISK_FREE_REQUIRE_MB_LOGS=XXXX".

Step 3: Configure the actions

- - - - configure Discovery to send an email in case of MAJOR and CRITICAL problems. See this documentation.
      - scanners only: configure Discovery to "stop discovery" in case of CRITICAL issues. This will keep user from restarting the scans in manage > discovery (start all scans) before the problem is resolved. It is not recommended to do that for consolidators. It can lead to other issues.

What to do if the appliance goes down and/or can't restart? ("tw_service_control --start" fails with "insufficient disk space")

Read this article and if needed, open a Customer Support case.

What to do if the appliance is still running but:

- - - - discovery is stopped (Manage > Discovery)
      - and Administration > Appliance Baseline shows a critical error 'Appliance Disk Space'.
      - and the admin receives an email that contains "Appliance Disk Space: CRITICAL: This appliance has insufficient free disk space"
      - If the user tries to reboot or restart, the appliance won't restart. While the appliance is running, the user can use the UI to configure a new disk.

1. In scanner case, let it run until it finishes to delete what it has to delete (check the DDD removal statistics).
2. In consolidator case, stop all the scanners, let the consolidator run until it finishes to delete what it has to delete.

Once the deletion process is done, stop the appliance and try to compact. If this is not enough to restart the appliance, read this article.

What to do if the appliance is still running, restart/reboot works but an email that contains "DataStore Size Warning: MAJOR" is sent?

Read this article to try to prevent the disk saturation from happening. Open a Customer Support case if needed.

Notes:

- - - once the appliance will be stopped, it won't be possible to change the values of DISK_FREE_REQUIRE_MB_LOGS and DISK_FREE_REQUIRE_MB until the appliance restarts with enough free disk space. This implies that the user may have to use tw_disk_utils to configure new disks, even if the appliance has up to 20% of free disk space
    - DISK_FREE_REQUIRE_MB and DISK_FREE_REQUIRE_MB_LOGS match the "Free disk space shutdown threshold" documented here.
    - The threshold used by "Appliance Disk Space" matches the "Free disk space baseline threshold" documented in the link above.
    - By default, there are 40 baseline alerts using the severity MAJOR. With the steps above, an email will be sent if any of them fail. It could generate more emails than desired. The severity of some of them can be reduced if needed.
    - The logs and the datastore may be in the same partition. In this case, the values for DISK_FREE_REQUIRE_MB and DISK_FREE_REQUIRE_MB_LOGS will be the same.
    - "Appliance Disk Space" baseline check can only monitor two partitions (the ones that contain the logs and the datastore). It is not possible for now (12.1) to monitor other partitions.
    - It is possible to use shorter thresholds but it will provide less time to resolve the problem before the appliance shuts down. For example, DISK_FREE_REQUIRE_MB and DISK_FREE_REQUIRE_MB_LOGS could be set to 5% of the respective disk sizes. The appliance would shut down if disks are 95% full and the critical alert would be sent if the disks are 90% full.

Case 2: The size of both datastore and datastore transaction logs is different:

The best is to upgrade to Discovery >= 22.2 and use the documented disk monitoring options..

Attachment(s):