Notes/Reminders:
- In the ITOM context, the implementation of a DR plan/BCP requires the release of the RFE DRUD1-50337. For now (March 25), the release plan is not ready.
- To automate a Discovery backup, see this KA: Discovery: How to automate Discovery backups?
- General recommendations :
- After an initial and successful design/test, test DR/BC plans from time to time.
- Store Discovery backups on a remote file system
- Use the backup options 'verify' and 'encrypt'
- In the article below, a Discovery "appliance" can be a clustered appliance or a standalone appliance. It can also be a scanner or a consolidator.
- Restoring a Discovery backup does not restore everything.
- Restoring a VMWare snapshot restores everything, including OS customizations and any configuration changes that were implemented without using a documented procedure. These snapshots are logically bigger than Discovery backups, so using them requires more storage.
- Automating Discovery backups is relatively simple and fast. Automating VMWare snapshots/restores requires changes outside the appliance and may require an IT engagement. This is an additional complexity that may require more design/test efforts.
- Taking a VMWare snapshot or a Discovery backup requires some down time as the services have to be stopped.
- Restoring a VMWare snapshot or a Discovery backup may result in some data loss. The amount of data lost depends on the age of the restored snapshot/backup and what has changed since it was taken.
Case 1: Use single cluster
This case does not cover the cmdb sync.
Option 1: Deploy a cluster with fault tolerance enabled and a minimum of 3 members. In case of disaster (1 of the members is corrupted/broken/lost), this option avoids data loss and down time while you remove a broken member, reset it and add it back into the cluster.
Option 2: Deploy a cluster with fault tolerance enabled and 2 members. In case of disaster (1 of the members is corrupted/broken/lost), this option avoids data loss. However, the cluster will be down until the broken member is removed, reset, and added back into the cluster.
Option 3: Take regular VMWare snapshots. In case of disaster, restore the latest snapshot.
Option 4: Take regular Discovery backups and restore them in case of disaster. This option only works if the appliance is healthy enough to restore a backup. If the disaster impacts files/folders that can't be fixed/replaced by the backup, deploy a new fresh appliance and restore the backup on it. This implies more downtime.
Case 2: Use a pair of redundant Discovery appliances (clustered or standalone) scanning the same network at the same time.
This case does not cover the cmdb sync.
If there are enough resources to have two appliances, it is likely more reasonable to use the case 1 or 2. If the case 3 has to be implemented:
Option 1: In case of disaster impacting one appliance, use the other one until the broken appliance is repaired.
Case 3: Use a single Discovery appliance (clustered or standalone) that synchronizes a single CMDB dataset
Option 1: On a regular basis, take a backup or VMWare snapshot of both Discovery and CMDB at the same time. If either Discovery or CMDB is lost, stop both Discovery and CMDB, and restore both with the backups/snapshots that were taken at the same time.
Option 2: On a regular basis, export the Discovery root node keys. If the Discovery appliance is lost, reinstall it or reset its datastore using tw_model_wipe, as appropriate. Then import the root node keys and re-scan the estate. Once the scanning is complete, run a resync on the CMDB connection. This option does not cover disasters on the CMDB side. It will only keep Discovery from replacing CIs with new ones after rescan/resync.
NOTE: If a tw_model_wipe is needed, first see KA 000140149: Best practices when running tw_model_wipe.
Case 4: A pair of primary/secondary (or prod/non-prod) Discovery appliances (clustered or not) that synchronize to the same CMDB dataset
This case only makes sense if you have a reason to not use the case 1.
Option 1: Only the current primary appliance scans and synchronizes to the CMDB.
Perform these steps on a regular basis:
- On the primary appliance: disable the scans, stop the appliance, take a snapshot/backup, restart, enable the scans.
- Restore the snapshot/backup into the secondary appliance
- If the secondary appliance is lost, reinstall it with the latest snapshot.
- If the primary appliance is lost, even for a short period, swap the roles: the secondary appliance becomes the primary appliance and vice versa. This implies that the lost appliance must not rescan the network or synchronize to the CMDB any longer. If it is restarted later, it should only act as a secondary appliance (i.e. it should receive the backup from the primary appliance until the roles are swapped again). When a secondary appliance becomes primary, start the scans (the snapshot was taken while scans were stopped). After the scans are finished, resynchronize the CMDB.
PROS: Even after a disaster, it is still possible to access the data in Discovery and CMDB.
CONS:
- This option requires backups and/or snapshots (storage needed)
- The settings added between the restore in the secondary appliance and the following disaster will be lost. There are no deterministic ways to totally avoid this risk, but its probability can be reduced by updating the secondary appliance with backups as frequently as possible. This also applies to the following three points.
- Some CIs could be deleted in the CMDB after resync. This will happen if after the disaster, Discovery can't find the same data. It could be due to settings that were lost between the restore and the disaster.
- Some CIs could be replaced by new ones in the CMDB. This will happen if new OSIs are found (or if identities/keys change) between the update of secondary appliance and the following disaster.
- Deleted CIs could be (transiently) added back. This may happen if nodes are deleted between the update of the secondary appliance and the following disaster. In this case, the scans following the disaster may not be enough to re-delete them before the resync. They will be re-deleted when the conditions to delete them are met again. To avoid that, do the following before the resync following a disaster:
- Rescan N times with N > "Device aging time limit" (see Administration > Model Maintenance) instead of once. This is the safest approach but implies a delay before the CMDB is resynchronized.
OR
- If it's not possible to wait that long, use a trade-off like N/2 scan to reduce the risk.
OR
- Execute the query below, select all, Actions -> Destroy.
SEARCH Host, NetworkDevice, Printer, MFPart, SNMPManagedDevice, ManagementController where age_count < 0.
This may delete more nodes than may be desired, but it prevent this issue from happening. If the CIs are marked as deleted in the CMDB, they can be added back in the next scans.
Option 2: Full backups are not used.
Perform these steps on a regular basis:
- Prepare the secondary appliance to rescan the network and synchronize the CMDB in case of disaster. To do that:
- Export the credential vault from the primary appliance and import it into the secondary appliance (see the KA 000326345)
- Export the root node key from the primary appliance and import it into the secondary appliance
- Prepare the secondary appliance to scan the same ranges (manual operation) but do not start scans.
- Prepare the secondary appliance to synchronize the CMDB (manual operation) but do not start the sync.
- If the primary appliance is lost, even for a short period, swap the roles: the secondary appliance becomes the primary appliance and vice versa. This implies that the lost appliance must not rescan the network or synchronize to the CMDB anymore. If it is restarted later, it should only act as a secondary appliance (i.e. it should import the vault and root node keys exported by the primary appliance and prepare to rescan and sync the CMDB in case of disaster). When a secondary appliance becomes primary, start the scans. After the scans are finished, an incremental or complete resync will be required.
PROS: Option2 does not require a backup and/or snapshot (less storage needed)
CONS:
-
-
- This option requires more manual operations than option 1.
- The settings added between the restore in the secondary appliance and the following disaster will be lost. There are no deterministic ways to totally avoid this risk, but its probability van be reduced by updating the secondary appliance as frequently as possible. This also apply to the following two points.
- Some CIs could be deleted in the CMDB after resync. This will happen if after the disaster, Discovery can't find the same data. It could be due to settings that were lost between the restore and the disaster.
- Some CIs could be replaced by new ones in the CMDB. This will happen if new OSIs are found (or if identities/keys change) between the root node key import in the secondary appliance and the following disaster.
Option 3: Use both option 1 (on the days where a backup is possible) and option 2 (when backups are not possible)