Agent Health Problem : The agent did not send events during the last 35 minutes. however agent still working fine.

Hello, 

I have some issue since we migrated our opsbridge on external k8s (version 24.1).

I'm receiving a lot of agent health issue everyday (like maybe 50-100 events / per day for around 800 nodes). 

- The status of this event is "classic" agent health issue :

         Agent health problem
         The agent did not send events during the last 35 minutes
         No additional information is available

         Connectivity Status: Disconnected

- All agent is configured with default health check settings ( Agent & Server , Heartbeat interval 30 minutes, heartbeat grace period 5 minutes).


- I have upgraded all my agent on 12.25.006 version. 

What's weird is we receive this kind of alert however we can still perform some action on theses nodes (like executing tools), on agent all ovc services is UP and RUNNING 

One of my agent currently have the event up :

ovc -status
agtrep OV Discovery Agent AGENT,AgtRep (36182) Running
hpsensor Compute Sensor AGENT,OA (36334) Running
oacore Operations Agent Core AGENT,OA (36308) Running
oahmon Agent Health Monitor AGENT,EA (241346) Running
ompolparm OM Parameter Handler AGENT,EA (36156) Running
opcacta OVO Action Agent AGENT,EA (36264) Running
opcgeni Generic Source Interceptor AGENT,EA (36109) Running
opcle OVO Logfile Encapsulator AGENT,EA (36253) Running
opcmona OVO Monitor Agent AGENT,EA (36205) Running
opcmsga OVO Message Agent AGENT,EA (36281) Running
opcmsgi OVO Message Interceptor AGENT,EA (36195) Running
ovbbccb OV Communication Broker CORE (241052) Running
ovcd OV Control CORE (241043) Running

ovconfd OV Config and Deploy COREXT (241080) Running

I'v find in system.txt some issue with communication agent <-> opsbridge (that I didn't have before in version 2022 with CDF) :

0: WRN: Thu Sep 19 07:08:41 2024: opcmsga (36281/139952380274496): [genmsga.c:9931]: Forwarding message/action response to OVO message
receiver failed due to server failure : (bbc-422) HttpOutputRequestImpl::ReceiveResponse() caught OvXplNet::ConnectionRefusedException_t. <null>
. (OpC30-36)ere is no server process active for address: https://[MYSERVEROPSBRIDGE]:383/com.hp.ov.opc.msgr/rpc/.

0: INF: Thu Sep 19 07:08:42 2024: opcmsga (36281/139952380274496): [genmsga.c:7342]: Message Agent is not buffering. (OpC30-100)

From my agent I can bbcutil -ping MYSERVEROPSBRIDGE and spam it i will always have a answer everything seem working fine.

I have a L1 support team All day long executing tools to restart the agent, after which the alert disappears.

Anyone can help me on this ? (i have also a case opened on support but to be honest until today i don't have any workaround and this community seem more active for this kind of issue)


Thanks :) ! 

Parents Reply Children
  • 0   in reply to 

    Hello there,

    I think there are two to three errors in Raphael's posting.  If you can specify in a few works the issue you are having, I'd be happy to make some suggestions.  

    If you mean:

    0: WRN: Thu Sep 19 07:08:41 2024: opcmsga (36281/139952380274496): [genmsga.c:9931]: Forwarding message/action response to OVO message
    receiver failed due to server failure : (bbc-422) HttpOutputRequestImpl::ReceiveResponse() caught OvXplNet::ConnectionRefusedException_t. <null>
    . (OpC30-36)ere is no server process active for address: https://[MYSERVEROPSBRIDGE]:383/com.hp.ov.opc.msgr/rpc/.

    When WDE starts, it registers it's URI /com.hp.ov.opc.msgr/rpc/ with ovbbccb which acts as a proxy for inbound connections to OBM.  When an agent connects to the gateway, the connection includes /com.hp.ov.opc.msgr/rpc/.  ovbbccb looks to see if it has /com.hp.ov.opc.msgr/rpc/ registered.  If not, or if that connecton is busy, an error message is returned so the agent then buffers events until the server can start accepting events again.  This can be caused by either an issue with ovbbccb, or the event pipeline (on the DPS) being full of events. 

    You can check the event pipeline by running:

    # /opt/HP/BSM/opr/support/opr-jmsUtil.sh |grep -e ^queue -e opr_gateway_queue |grep -v queue/alert
    queue | total | buffered | delivering | memory | pages | consumers
    opr_gateway_queue | 424927 | 23322 | 0 | 0 | 0 | 1

    It's hard to see here, but the column "buffered" has 23322 event buffered.  If you have something like this, it means the event pipeline has an issue. and you will need to contact Opentext Support for help.

    If nothing is buffering, and assuming there isn't a firewall involved....

    - On the Gateway Server, please run: /opt/OV/bin/ovc -kill. 

    - Next, pleae run: /opt/HP/BSM/opr/support/opr-support-utils.sh -restart wde

    - Check wde JVM has restarted: /opt/HP/BSM/opr/support/opr-support-utils.sh -ls

    - Finally, please run:  /opt/OV/bin/ovc -start

    - Check the issue is resolved.

    There was also mentioned an event forwarding issue:

    2024-09-19 12:06:46,335 [EventSyncThread:itom-opsbridge-des-svc:MYOPSBRIDGE_239824] ERROR EventSyncForward.logForward(527) - Event forward request has expired for node itom-opsbridge-des-svc. Deleting request from queue for event with ID: 0aab5590-761b-71ef-158d-0a3603440000

    This is where one OBM server cannot forward to another connected server.  It may be caused by the same issue above.

    Please try running:

    /opt/OV/bin/bbcutil -ping tom-opsbridge-des-svc/.../ -ovrg server

    If that doens't work, please run through the above process on the target system and see if that helps.  If not, please contact Opentext Support.

    Thanks.

    Thanks.

  • 0 in reply to 

    Hey, 

    Yep issue still persist to be honest I gave up a bit because I didn't get any answer from the support, here only Duncan (thanks again) managed to give me a few hints but due to lack of time I didn't go any further.

    we receive slightly fewer alerts, but still around 50-60 a day, the issue is still "No additional information is available" ...

    I have find some documentation about some agent config health with the  "NO_INFO" Agent Config Health - Operations Bridge Manager

    I have work on that and for 800 nodes 80% is now "confirmed" and 20% is in not_confirmed I don't know if we talk about the same issue but it seems to help a little.

    Btw : 

    bash-4.4$ /opt/HP/BSM/opr/support/opr-jmsUtil.sh |grep -e ^queue -e opr_gateway_queue |grep -v queue/alert
    queue | total | buffered | delivering | memory | pages | consumers
    opr_gateway_queue | 7794485 | 0 | 0 | 0 | 0 | 1

    Beyond that we have no other impact, the full kubernetes version requires a certain level of maintenance on our side to adapt functionalities but the solution is fairly stable (around 800 nodes).