Discovery: Scans are stuck or very slow (Saas and on prem)

Knowledge Article

Article Number

000247575

Old Article Number

000036709

Article Type

Solutions to a Product Problem

Title

Discovery: Scans are stuck or very slow (Saas and on prem)

Summary

Product

BMC Discovery

Component

BMC Discovery 11.3

Applies to

all versions SaaS or onPrem

Problem

In Discovery Saas or on prem, scans are :

- stuck
- unreasonably slow
- not progressing
- "blocked" or "on hold" (status displayed in the UI, Manage > Discovery)

Cause

Solution

Cause 1: The status of the stuck scan is "on hold"

To confirm: See this documentation page:

The discovery run has not finished within its scheduled scan window. This scan is considered on hold. The run continues or restarts at the next scan window.

On prem: The command "tw_reasoningstatus --waiting" lists the "on hold" requests

If an excluded ip is scanned while its Scheduled Exclude Range ends in < 30mins, the ip scan won't be excluded. It will stay on hold until the Scheduled Exclude Range is over.

Solution:

Extend the duration of scan window
or
Take actions to increase the performance (upgrade the hardware or optimize the performance or resolve performance issues if any)

Cause 2: The status of the stuck scan is "blocked"

See this documentation page

Discovering an endpoint requires additional discovery on another endpoint that is not currently in an open scan window

For example: During a scan A, a pattern decides to run a command on a remote ip. This ip is scanned by another scheduled scan B that will start later. Consequence: the scan A will be blocked by the ip and will be unblocked when the scan B will scan it.

Solution: List the IPs causing the issue (see "To identify blocked runs" in this documentation page) and add them in the range of the scan affected by the issue

Cause 3: In Manage > Discovery, the status of the discovery engine is stopped

Solution: click on "start all scans"

Cause 4: Defect DRUD1-49943. Resolved in version 25.1.

To confirm:

Sometimes, the stuck scans are correlated with an outpost update. In this case, if some outposts stay "out date" several days after the upgrade, the root cause is possible.

The frequency of this issue probably increases with the number of processors in the outposts. If all the outposts have

1-2 cores, this root cause is not probable
4-6 cores, the root cause is possible
> 6 cores, the root cause is probable

If the stuck scans are correlated with "many" Das fail with ERROR/Timeout (peaks at 10-20% of the DAs), the root cause is very probable.

SEARCH flags(no_segment) Host, NetworkDevice
with
(traverse :::DiscoveryAccess as timeoutDas
WHERE
_last_marker
and endtime is defined
and reason matches 'TIMEOUT'
)
where
nodecount( traverse :::DiscoveryAccess
WHERE
_last_marker
and endtime is defined
and reason matches 'TIMEOUT'
)
show
kind(#), name, friendlyTime(single(#timeoutDas.endtime)) as 'latest timeout for this host'

Workaround:

1- Restart all outposts
or
2- Set the number of worker to 1 in each outposts

- In the UI of each outposts, go to to Manage > Configuration
- Uncheck "Allocate workers automatically"
- Set "Workers per instance" to 1 (or the smallest acceptable value). Note that this could reduce the performance.
- Click on apply

Solution: Upgrade to 25.1 or apply TKU-2024-12 or higher.

Cause 5: Discovery has not finished to activate some patterns.

To confirm:

If there were no pattern changes in the last 24h, this root cause is not probable
If the last message in the reasoning.log is "INFO: Loading new rules" and the Even (ECA) engines are stopped (see the root cause above), the root cause is confirmed.

Cause 6: A pattern (probably custom) ran into an infinite recursion. For example: A pattern infers a node that matches its own trigger condition.

To confirm: If the same symptoms of the root cause above are observed (big persistence file) AND the size continues to grow until the disk is full, the root cause is probable
If the issue can't be reproduced with the custom patterns deactivated, the root cause is probable.

Workaround: Stop the services, delete the biggest files, restart, keep the custom patterns deactivated and rescan.

Cause 7: Discovery is designed to focus on few scans and leave the other ones pending. This could lead to the impression that all the scans are stuck while some of them are progressing.

To confirm:

Monitor the scans (ideally for 30 minutes at least), if some scans made some progress, the root cause is possible.

On prem: Measure the performance with this troubleshooting guide (see "How to measure the scan performance?). If the scan rate is reasonably high, this root cause is possible.

Potential workaround: Increase the Maximum concurrent discovery requests per engine. It may not increase the performance in general (especially is the hardware is already overloaded) but it could help Discovery to process more scans in //.

Cause 8: Outposts are disconnected, status is "contact lost"

To confirm: Check the status of the outpost connection
Solution: Fix outpost connection

Cause 9: A large number of sudo attempts failed.

To confirm:

Review the session logs of an ip scan. It it contains an unreasonable amount of sudo password prompt, the root cause is probable. For example:

[sudo]
Sorry, try again.
[sudo]
Sorry, try again.
[sudo]
sudo: 3 incorrect password attempts

If the issue can't be reproduced anymore after having configured Discovery to not use sudo, the root cause is confirmed.

Solution:

- wait for this RFE to be accepted/planned/released

Workaround:

- configure the scanned device to not ask for a password when using sudo
- configure the scanned device to ask for a password only once when using sudo
- configure Discovery to not use sudo
- if a the credential uses an ssh key, add a valid password in the credential
- scan with root or an account that has the permission to run sudo without having to type a login/password again.

Cause 10: Some ip scan are very long. This matches the article below

Discovery: The scan duration for a single ip is unreasonably long (> 30 minutes)

Cause 11: Discovery sends requests faster than acceptable for AWS. See the article below:

Discovery: Sometimes, AWS scans fail with TooManyRequestsException In SessionResult

Cause 12: Dynatrace data imports run too frequently

To confirm:

- [On Prem Only] Check /usr/tideway/var/persist/reasoning/engine/queue and confirm there is a very large number of *.pq files and that it's piling up
- [On Prem And Helix] Run the following query in Discovery and check the frequency it runs at:

search ImportRecord where type has substring "Dynatrace"

If the import runs more than once a day then the root cause is probable.

Note: although not yet observed this could possibly happen with other types of imports. Only run "search ImportRecord" to see all types of imports.

Solution: Schedule the Dynatrace imports to run once a day only

Cause 13: The appliance did not shutdown while a file system went out of disk space.

To confirm:

If the logs contains "OSError: [Errno 28] No space left on device:", the cause is possible.

If the error above was found in the reasoning logs, the cause is probable

Solution: Restart the appliance after having resolved the file system saturation

Cause 14: The consolidation is paused on the Helix Receiver

To confirm:

1- The appliance or the Helix instance must be set to Infer sync to a consolidator / a receiver (this root cause does not apply to Directly Discovered Data consolidation)

2- If "1-" is confirmed:

- - On the Helix receiver:
    - Go to Administration > Discovery Consolidation
    - Check if the Sync for this scanner is paused on the Helix Receiver
  - On the scanner:
    - Go to Administration > Discovery Consolidation
    - If "Syncs in the queue" has a very big number then the root cause is probable

Solution: Either disable the consolidation for this appliance / Helix instance scanner if it's not needed, or fix the issue that is blocking the consolidation from happening.

Cause 15: Defect DRUD1-46860. For now (Jul25), the release plan is not public.

Cause 16: Outpost internal communication issue

To confirm: If the affected outposts contains the error below, the cause is possible

Failed to send control results to Discovery Queue: Communication failure (CORBA.COMM_FAILURE(omniORB.COMM_FAILURE_WaitingForReply,

Workaround: Restart the outpost service.

Cause 17: The scan is "blocked by exclude ranges"

To confirm: Look for the message in the screenshot below

When Scheduled Excluded Range end time is more than 30mn away then scan ends and excluded IPs appear in Dropped Endpoints list with Reason = "Exclude range prevented scanning of these endpoints".

Attachment(s):