Client Management: Troubleshooting: Why isn't my windows client connecting to its relay?

Client Management: Troubleshooting: Why isn't my windows client connecting to its relay?

Knowledge Article

Article Number

000324481

Old Article Number

000121351

Article Type

Solutions to a Product Problem

Title

Client Management: Troubleshooting: Why isn't my windows client connecting to its relay?

Summary

There can be multiple reasons why an agent is in red or never shows up in the console after you've installed it, i'll cover the common ones.

Product

BMC Client Management

Component

Client Management

Applies to

Any version of BMC Client Management (BCM)

Problem

One of my Windows client doesn't connect to its parent. It appears red in the console.

I installed a new agent but I don't see it in the console yet.

Cause

Many!

Solution

Prerequisite: Enable additional / verbose logging Client Management Agent on the device that is encountering the issue, then restart the service of the BCM agent.

- If you are running 12.7 release build there are many chances that you are impacted by the defect DRZKZ-2101 "Agents stop processing incoming connections in some situations". To diagnose this, you should restart the service of the agent on the relay and see if after that your client goes green in the console. If it does, it is probable that you are encountered the issue
If 12.7 hotfix 1 has not yet been releazed by the time of reading this document, then you should set an operational rule that will regularly restart the service of thi device.
Then contact support, mention this defects and send logs from the client and from its relay.
If 12.7 HF1 is out, then upgrade asap. You might need to restart the service of your relays before starting the upgrade as if they are encounterring the issue at this time then they will not get upgraded.

- Check that you have available licenses:

If not you might want to delete some devices that are not in production anymore or contact the account manager who is in charge of your account to buy more licenses or to get a temporary license extension, for the time being, to do some cleanup/buy more.

- Check if the relay is started. Is it in red in the console? You might want to restart its service first. Relays need to be restarted often because it needs to vacuum/cleanup its sqlite. If you don't, sqlites get bigger and bigger and end up slowing down the module or even corrupt its sqlite. We recommend to restart a relay's service automatically once in a while (once every two weeks or once in a month), depending on how many clients it has, on how much you send packages, patches etc.

- Enable verbose logs on the relay, restart the service of its agent then edit its ../log/mtxagent.log and filter on the ip address of the client you are interested in and on the ip address of the master, to see if the relay can communicate with both of them.

- Ip ranges in the relay list must be the real subnet range: you must calculate it from the mask if it's not the standard mask. e.g: if your device has the ip 10.5.159.56/255.255.252.0 the subnet to add in the relay list is "10.5.156.0" and not "10.5.159.0".

- Check if there's an exception on our agent's install folder on your antivirus. If not give it a shot prior to trying anything else. Symantec and Sophos have been the most sensitive with our product lately.

- Check the client's logs in /client/log :

- start by filtering on "ERR " (for errors) and " w " (warnings) to see if there's anything obvious in the client and in the relay logs. Do not forget the " " or you won't filter logs efficiently in this case.
- If you find the following lines the agent is not in capacity to send its identity to the relay because the module's sqlite is corrupted or locked:

- "2013/05/23 11:35:18 AsynchronousActions ERR SQL Error: database disk image is malformed", to solve this you'll have to:

- stop the agent
- delete everything in /client/data/AsynchronousActions/ except for the subfolder "sql" it contains
- restart the agent

- 2013/05/23 11:35:18 AsynchronousActions ERR SQL Error: database disk image is locked/corrupted
I'm unsure of the exact error message but you might have to do the same than in the previous statement or check if there's an antivirus exception on the agent's folder
This should also be checked on its relay (if it has one) and on its master.

- the port you have set in /client/config/HttpProtocoleHandler.ini "Port=" is already used by another application, or wasn't released properly by Windows. If it's the case you should find these lines in the logs shortly after the agent has started:

2013/05/23 11:35:18 AgentCore ERR Socket::AddListeningPort failed: cannot bind to port 1610
2013/05/23 11:35:18 HttpProtocolHandler ERR failed to bind virtual host 'HttpProtocolHandler': retrying in 7200 seconds

- You forgot to set a parentname or you relay selection sequence is not correctly set in your rollout configuration:

- go in your agent's installation directory and edit /client/config/Relay.ini
- if there's nothing written into "Sequence=", check for "ParentName=" set to the relay's hostname or ip address (or the master if it's the device's relay) and port in "ParentPort=" if it's empty
- if there are things written in "Sequence=", filter your Relay.ini with each of that sequence and check how it's consistent:
- if "static" is set, check that "StaticParentName=" and "StaticParentPort" are set correctly with the relay's name/ip and its port
- if "list" is set, check that "ListServerUrl=" is correctly set. You can compare with a device that works normally
- if "dhcp" is set, check that "DhcpExtendedOption=" is set with the correct DHCP option and that the option is set on the DHCP server. You might want to check using nagios or command line maybe if the option is available for devices.
- if "script" is set, check that "ScriptPath=" is correctly set: is it the correct path to the script? Is the script in the same version than you other devices? Also check in /client/log if there's a chilli.log: if the script fails for some reason there will probably be errors in this specific log.
- if "backup" is set, check if "BackupRelays=" is set like that: "Relay's_Hostname:Relay's_Port"
You will find more information on these modes in this document.

- check the client's logs to see if it:

- synchronizes with its relay:

2013/08/22 17:20:31 Relay I Synchronized with relay 10.5.65.244:1610 (self_ip=10.5.159.243, relay_guid=0001343EC363E507691E73398CEC07CD26FB, relay_tunnel=1611)

- enters the mechanisms you have set in your Relay.ini and if they're working. As an example, these logs show that the previous relay went down and that the relay module is now entering the backup mechanism and that it manages to synchronize with this new relay:

2013/08/22 17:43:18 Relay                         W   Failed to verify the supplied relay (10.5.65.244:1610)
2013/08/22 17:43:18 Relay                         T   Entering backup mechanism
2013/08/22 17:43:18 Relay                         T   Processing <action.RelaySetValues>
2013/08/22 17:43:19 AgentActionDB                 I   Invoke action RelayCheckClient on remote host http://Numara FootPrints Asset Core Agent:****@***.no-ip.com:1610
2013/08/22 17:43:20 Relay                         I   Synchronized with relay lerch.no-ip.com:1610 (self_ip=198.147.192.8, relay_guid=00014BE9AB357DA842E62CDA291B03B3A8DE, relay_tunnel=1611)
2013/08/22 17:43:20 Relay                         T   Processing <event.agent.runtime.parent.updated>

- the new relay could be obtained with the new selected mechanism. These logs show that we couldn't get the parent name from the relay using the DHCP option. It could come from a bug that was fixed by a cumulative hotfix in 11.1 and 11.5:

2013/03/19 11:42:16 Relay T Entering DHCP mechanism
2013/03/19 11:42:46 Relay W Failed to receive DHCP response (timeout)

- Don't forget to check for firewall configuration, ping, telnet on the agent ports and dns resolution:

- the agent set exceptions in windows' firewall the first time it starts: are they set for each type of network? Have you tried to deactivate the device's firewall to be sure?
- can you ping and especially telnet to the ports of the agent of the relay from the device? Default agent ports are 1610 and 1611. A device MUST be able to telnet its relay on these ports, but reverse is not mandatory, thanks to tunnels
- a firewall issue if the device can bind to that port (refer to "AgentCore ERR Socket::AddListeningPort failed: cannot bind to port 1610" above)
- an issue on the relay if you can ping but not telnet it: check for the relay logs (e.g "queue full" messages which means the relay/master cannot queue up any additional network connections)
- can you resolve the relay's name from the client? Try to set the parent's ip address instead of the hostname in the client's Relay.ini

- The client you have deployed might have the same GUID (Globally Unique ID) than others:

- In 11.6 were not correctly set if you had selected multiple options in the system variables to generate it. This was fixed in the first cumulative hotfix
- Are they included in an OS Deployment wim or so? If so, the agent must not have been started before you captured the device or all of your device will have the same GUID (Globally Unique ID). Starting from 12.x (it might habeen in 11.7 as well, but I'm unsure) the script normally manage cleaning the identity file before capturing a wim.
If you have devices in one of the two previous situations, you probably want to copy the value of the field "GUID=" of one of these devices ../config/Identify.ini in the GUID blacklist in Global Settings > System variables > Connection Management of the console. As a result, devices which will then upload thei identity with this GUID will be given the order to recalculate their GUID which should solve the issue right away.
- You have devices with the same names on your network. In this situation you might need to change the GUID scheme in Global Settings > System variables > Connection Management of the console so it takes into account other criterions than only the hostname to generate it. You will also need to allow duplicate device names in the system variables.
Note: this will regenerate all the GUID of your devices so you might have some unreachable devices for a while and there's still a limited risk that the process doesn't work for some devices.
- Do you deploy the agent before having set a specific devicename to it? The GUID is not recalculated when you update the name of a device so all devices on whihc the service started for the first time while they had the same name will get the same GUID. If it is renamed only after it was started you'll either need to:

A- delete the device from the console then wait for it to send a new identity update/ force an Identity update by:

- stopping the agent service on the device
- delete the value of the field "lastidentitysent=" of its ../config/Identity.ini
- restart the service of the agent

B- reinitialize its ../config/Identity.ini and update the db:

- stop the device agent service
- delete the value of the field "GUID Scheme="
- delete the value of the field "GUID="
- restart the service- wait a bit for the device to regenerate a GUID

C- edit its ../config/Identity.ini and copy the value of its "GUID=" field to update the "GloballyuniqueID" column of this device in the table "Devices" of your DB:

UPDATE DEVICES SET GloballyUniqueID='_DEVICE_GUID_' WHERE DeviceName='_DEVICE_NAME_';
Where "_DEVICE_GUID_" will be the value of the "GUID=" field you just copied and "_DEVICE_NAME_" will be the name of the device to update

- Check its relay logs (the master is also a relay and the device might be a direct child of the master), for the same things as in the client's logs to make sure the relay can connect to its own relay or to the master. Also simply assign a basic operational rule (I usually use the step "Wait" and set it to "3") to the relay, to see if it's executed. Sometimes the icon wil appear red but the agent actually connects to it.

- Edit your client and your parent's /client/config/mtxagent.ini (or /client/etc/mtxagent.ini if the parent is a linux device) to see if "PAC=" and "SSL=" are set to the same values on each of your devices. If it's not the case you'll have to:

- edit the client's mtxagent.ini to set PAC= and SSL= as in the parent mtxagent.ini, then restart the client service
- edit your rollout configuration to set it so the next clients you'll install will be installed correctly:

"PAC=" corresponds to "Access Control", "SSL=" to "Secure communication".

Note:
- this document explains why it's not always a problem that the PAC and SLL settings are not the same on all devices
- if PAC=2 is set on your devices then GMT date and time must all be synchronized, communication won't be possible if they ain't!

- Check the size of the ../data/asynchronousactions/asynchronousactions.sqlite on its relay (if it has one) and on the master. In some version this sqlite tended to grow because the module was not able to process identities anymore. This has been fixed for a while now.

- Check the master logs to see if there are no errors from the module "Vision64database" that states that the master cannot write to the database.

- Check if there is no jam in the table "Workqueue" of the database of your master. If you do not remember where the database is stored at, you can find some information in the master ../config/Vision64database.ini. Let's say that if there's more than 50 entries there you probably have a problem and that should call support for a better investigation.

Attachment(s):