Agent Health Problem. The agent did not send events during the last 35 minutes. No additional information is available.

Hi,

we have from time to time the following message in our OBM (2023.05 LNX):

Agent Health Problem. The agent did not send events during the last 35 minutes. No additional information is available. 

Each time for a different agent, for the moment I don't see a pattern.

When we check the agents that seem not to send a message, all is ok. 

2024-09-26 05:48:31,569 [Thread-58] INFO AgentHeartbeatImpl.submitEvent(83) - sent event: db2f1e15-a2fa-445b-aa9b-50f618b525f3 for xxx|yyy|e30ee484-acf4-75e3-04d7-b7f0331e3b0d; severity CRITICAL; Agent Health Problem. The agent did not send events during the last 35 minutes. No additional information is available.
2024-09-26 05:49:31,218 [Thread-58] INFO AgentHeartbeatImpl.submitEvent(83) - sent event: c6ff3453-a22a-443b-8bed-f1053778eb26 for xxx|yyy|e30ee484-acf4-75e3-04d7-b7f0331e3b0d; severity NORMAL; Agent Health Ok.

I was wondering if it is possible to generate more logging/debug, would it be possible to have the hearbeat mechanism be more verbose? e.g. write in the log the first occurrence when a heartbeat from the agent was missed? 

I can use this info to go our network team and start some trace to what is actually send over the lan. Just already to be able to pinpoint if the issue is in OBM or on our LAN

Any help much  appreciated

Parents
  • 0

    I have seen a rare case when bbcutil was working fine but still random agent was buffering at some certain intervals to the OBM. we did below check

    1) Check the heartbeat log and found out that there is a pattern at which its happening. For example: Agent 1 will buffer today at 10:00AM and same agent will be buffering tomorrow around same time.

    2) We also noticed, sometimes this happens mainly when agent is sending the ASSD information every 24hrs, and somehow this gets blocked by the firewall, You can verify this my analyzing the heartbeat log, and take a few server, check when those servers reported the health alarms again next day, for us its almost around same time every day ( unless the agent is restarted in between), This was one of the reason why it's happening for different agent at different time.

    2) Later we did some network packet analysis and found out that there are network drops happening at network level at OBM side.

    Finally, issue was fixed by bypassing the OBM IPs at network firewall level.

    It took very long time to figure out the issue because it's hard to detect these drops.

    Hope this helps.

Reply
  • 0

    I have seen a rare case when bbcutil was working fine but still random agent was buffering at some certain intervals to the OBM. we did below check

    1) Check the heartbeat log and found out that there is a pattern at which its happening. For example: Agent 1 will buffer today at 10:00AM and same agent will be buffering tomorrow around same time.

    2) We also noticed, sometimes this happens mainly when agent is sending the ASSD information every 24hrs, and somehow this gets blocked by the firewall, You can verify this my analyzing the heartbeat log, and take a few server, check when those servers reported the health alarms again next day, for us its almost around same time every day ( unless the agent is restarted in between), This was one of the reason why it's happening for different agent at different time.

    2) Later we did some network packet analysis and found out that there are network drops happening at network level at OBM side.

    Finally, issue was fixed by bypassing the OBM IPs at network firewall level.

    It took very long time to figure out the issue because it's hard to detect these drops.

    Hope this helps.

Children
No Data