Agent Health Problem. The agent did not send events during the last 35 minutes. No additional information is available.

Hi,

we have from time to time the following message in our OBM (2023.05 LNX):

Agent Health Problem. The agent did not send events during the last 35 minutes. No additional information is available. 

Each time for a different agent, for the moment I don't see a pattern.

When we check the agents that seem not to send a message, all is ok. 

2024-09-26 05:48:31,569 [Thread-58] INFO AgentHeartbeatImpl.submitEvent(83) - sent event: db2f1e15-a2fa-445b-aa9b-50f618b525f3 for xxx|yyy|e30ee484-acf4-75e3-04d7-b7f0331e3b0d; severity CRITICAL; Agent Health Problem. The agent did not send events during the last 35 minutes. No additional information is available.
2024-09-26 05:49:31,218 [Thread-58] INFO AgentHeartbeatImpl.submitEvent(83) - sent event: c6ff3453-a22a-443b-8bed-f1053778eb26 for xxx|yyy|e30ee484-acf4-75e3-04d7-b7f0331e3b0d; severity NORMAL; Agent Health Ok.

I was wondering if it is possible to generate more logging/debug, would it be possible to have the hearbeat mechanism be more verbose? e.g. write in the log the first occurrence when a heartbeat from the agent was missed? 

I can use this info to go our network team and start some trace to what is actually send over the lan. Just already to be able to pinpoint if the issue is in OBM or on our LAN

Any help much  appreciated

  • 0  

    please check in Monitored Nodes what type of Agent Health is configured to that agent. it can be "agent", "agent + server" or "none".
    based on your timings I assume it is "agent + server" which means if Agent didnt send event for 30 mins, then server is trying to reach to the agent. if after 5 minutes server couldnt get response from agent, then you will get that Agent Health event.

    so based on your timings, I would suggest to check why server couldnt reach agent...

  • 0 in reply to   

    my plan is to add some automatioc action or groovy that triggers when the message appears but then it can already be too late. So if there is a way to get a message earlier that would be great.

  • 0 in reply to 

    Is also notice that on our agents, the following value is set: OPC_HB_MSG_INTERVAL=1800

    I wonder if this value should not be decreased to 300.  I assume that each time the server gets an agent HB, the 30m interval is reset

  • 0 in reply to 

    Hello,

    Do you have any update on your issue ? i have the same issue on my infra

  • 0   in reply to 

    Hello Rick,

    Firstly, you can enable some extra logging - these are the log4j files you’ll need to change:

    %TOPAZ_HOME%\conf\core\Tools\log4j\wde\opr-heartbeat.properties

    %TOPAZ_HOME%\conf\core\Tools\log4j\opr-backend\opr-heartbeat.properties

    The log files are described in the log4j files.  Changing OPC_HB_MSG_INTERVAL depends on a few factors: importaint in the node.  Typically, I'd not change from default but it's a long story why and it depends on how many managed nodes you have etc. etc. etc.  

    Anyway, if you need a groovy script, maybe this will help.  If nothing else, it's a good starting point for you to change. This script was uses an “Event storm detected filter” which was the message generated when an event storm conditon was met.  This directs the event to the groovy script. It’s set for “before storing events”.

     

    The idea was to stop the agent if an event storm originated on that node. I think you have a similar use case. However, please note in the command that this groovy script takes 2990ms to run. That’s how long it takes for ovdeploy to stop opcmsga on the problem node. The script will stop after 3 seconds, opcmsga will not stop on the problem node, but the event will arrive in the console regardless. Realistically, it would be better to use OBM’s action REST API as it will get the 200 and will then continue. But, if this script only runs once or twice a year, it’s not going to be a big deal.

    import java.util.List
    import com.hp.opr.api.scripting.Event
    import org.apache.commons.logging.Log
    import org.apache.commons.logging.LogFactory
     
    class getNodeNameFromEventStorm {
        private static Log s_log = LogFactory.getLog(getNodeNameFromEventStorm.class)
        private String DEBUG_MODE = 'TRUE' 
     
        def init() {}
        def destroy() {}
     
        def process(List<Event> events) {
            events.each { event ->
                modifyEvent(event)
            }
        }
     
        def modifyEvent(Event event) {
            if (DEBUG_MODE.equals('TRUE')) {
                s_log.error("${this.class.name} Event arrived: ${event.getTitle()}")
            }
     
            def title = event.getTitle()
            def matcher = title =~ /for '([^']+)'\./
            if (matcher.find()) {
                def domain = matcher[0][1]
                //event.addAnnotation("domain", domain)
                event.addCustomAttribute("Event Storm", domain); 
                if (DEBUG_MODE.equals('TRUE')) {
                    s_log.error("${this.class.name} Domain added: ${domain}, Event title: ${title}")
                }
     
                // Define ovdeploy
                def command = [
                    "/opt/OV/bin/ovdeploy",
                    "-ovrg", "server",
                    "-cmd", "/opt/OV/bin/ovc",
                    "-par", "-stop opcmsga",
                    "-host", domain
                ]
     
                if (DEBUG_MODE == 'TRUE') { 
                    s_log.error("${this.class.name} Running command: ${command.join(' ')}")
                }
     
                // Start ovdeploy
                if (DEBUG_MODE == 'TRUE') { 
                    s_log.error("${this.class.name} running ovdeploy")
                }
                ProcessBuilder processBuilder = new ProcessBuilder(command)
                processBuilder.redirectErrorStream(true)
                Process process = processBuilder.start()
     
                int exitCode = process.waitFor()
                
                if (DEBUG_MODE == 'TRUE') { 
                    s_log.error("${this.class.name} Finished ovdeploy with exit code: ${exitCode}")
                }
     
            } else {
                if (DEBUG_MODE.equals('TRUE')) {
                    s_log.error("${this.class.name} No domain found")
                    s_log.error("${this.class.name} Title: $title")
                    s_log.error("${this.class.name} Matcher groups: ${matcher.groupCount()}")
                    for (int i = 0; i <= matcher.groupCount(); i++) {
                        s_log.error("${this.class.name} Group $i: ${matcher[i]}")
                    }
                }
            }
        }
    }

    DPS:log/opr-scripting-host/opr-scripting-host.log with logging enabled:

    2024-03-06 13:20:34,104 [Thread-23 (ActiveMQ-client-global-threads)] INFO  ScriptExecutorImpl.initAndCheckinScript(147) - Loading script: 'getNodeNameFromEventStorm v2' Script ID: '3da1ea1d-439d-4490-93fc-aeddfaa6af45'

    2024-03-06 13:20:53,626 [TaskExecutor-1] ERROR getNodeNameFromEventStorm.call(-1) - getNodeNameFromEventStorm Event arrived: Event storm detected for 'puegcsvm29192.swinfra.net'. Current incoming event rate: 11 events / 60 seconds.

    2024-03-06 13:20:53,626 [TaskExecutor-1] ERROR getNodeNameFromEventStorm.call(-1) - getNodeNameFromEventStorm Domain added: puegcsvm29192.swinfra.net, Event title: Event storm detected for 'puegcsvm29192.swinfra.net'. Current incoming event rate: 11 events / 60 seconds.

    2024-03-06 13:20:53,627 [TaskExecutor-1] ERROR getNodeNameFromEventStorm.call(-1) - getNodeNameFromEventStorm Running command: /opt/OV/bin/ovdeploy -ovrg server -cmd /opt/OV/bin/ovc -par -stop opcmsga -host puegcsvm29192.swinfra.net

    2024-03-06 13:20:53,627 [TaskExecutor-1] ERROR getNodeNameFromEventStorm.call(-1) - getNodeNameFromEventStorm running ovdeploy

    2024-03-06 13:20:56,600 [TaskExecutor-1] ERROR getNodeNameFromEventStorm.call(-1) - getNodeNameFromEventStorm Finished ovdeploy with exit code: 0

    2024-03-06 13:20:56,601 [RMI TCP Connection(201)-15.119.125.229] INFO  ScriptExecutionService.execute(178) - Script execution (getNodeNameFromEventStorm v2 - 3da1ea1d-439d-4490-93fc-aeddfaa6af45) took: 2990 ms.

    If you want to change what the groovy script doesm change this:

                def command = [

                    "ovdeploy",

                    "-ovrg", "server",

                    "-cmd", "/opt/OV/bin/ovc",

                    "-par", "-stop opcmsga",

                    "-host", domain

                ]

    I hope this helps.

  • 0

    I have seen a rare case when bbcutil was working fine but still random agent was buffering at some certain intervals to the OBM. we did below check

    1) Check the heartbeat log and found out that there is a pattern at which its happening. For example: Agent 1 will buffer today at 10:00AM and same agent will be buffering tomorrow around same time.

    2) We also noticed, sometimes this happens mainly when agent is sending the ASSD information every 24hrs, and somehow this gets blocked by the firewall, You can verify this my analyzing the heartbeat log, and take a few server, check when those servers reported the health alarms again next day, for us its almost around same time every day ( unless the agent is restarted in between), This was one of the reason why it's happening for different agent at different time.

    2) Later we did some network packet analysis and found out that there are network drops happening at network level at OBM side.

    Finally, issue was fixed by bypassing the OBM IPs at network firewall level.

    It took very long time to figure out the issue because it's hard to detect these drops.

    Hope this helps.