Agent Health Problem : The agent did not send events during the last 35 minutes. however agent still working fine.

Hello, 

I have some issue since we migrated our opsbridge on external k8s (version 24.1).

I'm receiving a lot of agent health issue everyday (like maybe 50-100 events / per day for around 800 nodes). 

- The status of this event is "classic" agent health issue :

         Agent health problem
         The agent did not send events during the last 35 minutes
         No additional information is available

         Connectivity Status: Disconnected

- All agent is configured with default health check settings ( Agent & Server , Heartbeat interval 30 minutes, heartbeat grace period 5 minutes).


- I have upgraded all my agent on 12.25.006 version. 

What's weird is we receive this kind of alert however we can still perform some action on theses nodes (like executing tools), on agent all ovc services is UP and RUNNING 

One of my agent currently have the event up :

ovc -status
agtrep OV Discovery Agent AGENT,AgtRep (36182) Running
hpsensor Compute Sensor AGENT,OA (36334) Running
oacore Operations Agent Core AGENT,OA (36308) Running
oahmon Agent Health Monitor AGENT,EA (241346) Running
ompolparm OM Parameter Handler AGENT,EA (36156) Running
opcacta OVO Action Agent AGENT,EA (36264) Running
opcgeni Generic Source Interceptor AGENT,EA (36109) Running
opcle OVO Logfile Encapsulator AGENT,EA (36253) Running
opcmona OVO Monitor Agent AGENT,EA (36205) Running
opcmsga OVO Message Agent AGENT,EA (36281) Running
opcmsgi OVO Message Interceptor AGENT,EA (36195) Running
ovbbccb OV Communication Broker CORE (241052) Running
ovcd OV Control CORE (241043) Running

ovconfd OV Config and Deploy COREXT (241080) Running

I'v find in system.txt some issue with communication agent <-> opsbridge (that I didn't have before in version 2022 with CDF) :

0: WRN: Thu Sep 19 07:08:41 2024: opcmsga (36281/139952380274496): [genmsga.c:9931]: Forwarding message/action response to OVO message
receiver failed due to server failure : (bbc-422) HttpOutputRequestImpl::ReceiveResponse() caught OvXplNet::ConnectionRefusedException_t. <null>
. (OpC30-36)ere is no server process active for address: https://[MYSERVEROPSBRIDGE]:383/com.hp.ov.opc.msgr/rpc/.

0: INF: Thu Sep 19 07:08:42 2024: opcmsga (36281/139952380274496): [genmsga.c:7342]: Message Agent is not buffering. (OpC30-100)

From my agent I can bbcutil -ping MYSERVEROPSBRIDGE and spam it i will always have a answer everything seem working fine.

I have a L1 support team All day long executing tools to restart the agent, after which the alert disappears.

Anyone can help me on this ? (i have also a case opened on support but to be honest until today i don't have any workaround and this community seem more active for this kind of issue)


Thanks :) ! 

Parents
  • 0  

    Hello Raphael,

    If you get several such events for multiple agents, then the problem is very likely on the receiver end, not on the agent(s).

    I would check if wde is low on memory, and if that's the case increase memory allocation for wde if possible.

    Check /opt/HP/BSM/log/wde/jvm_statistics.log in your omi-0 and omi-1 pod.

    If there are times where the free heap memory is 0 or close to 0, you will need to increase memory for wde.

    Best regards,

    Tobias

  • 0 in reply to   

    Hello, 

    Yes you are right I think it's more on server side ! 

    I have checked as you requested : 

    On both OMI i'm still between 280 and 500 on HEAP FREE , NON-HEAP always at 498.7

    2024-09-19 08:18:11,625 INFO  - HEAP - [USED: 408.2, COMMITTED: 866.1, MAX: 866.1, FREE: 457.8];

    Both OMI :

    omiuser@omi-0:/> grep Xmx /opt/HP/BSM/conf/OPR-SCRIPTING-HOST_vm_params.ini
    -Xms1024m -Xmx1024m -XX:MaxMetaspaceSize=256m

    omiuser@omi-0:/> grep Xmx /opt/HP/BSM/conf/OPR_vm_params.ini
    -Xms3072m -Xmx3072m -XX:MaxMetaspaceSize=256m



    Do you think is enough ? 

Reply
  • 0 in reply to   

    Hello, 

    Yes you are right I think it's more on server side ! 

    I have checked as you requested : 

    On both OMI i'm still between 280 and 500 on HEAP FREE , NON-HEAP always at 498.7

    2024-09-19 08:18:11,625 INFO  - HEAP - [USED: 408.2, COMMITTED: 866.1, MAX: 866.1, FREE: 457.8];

    Both OMI :

    omiuser@omi-0:/> grep Xmx /opt/HP/BSM/conf/OPR-SCRIPTING-HOST_vm_params.ini
    -Xms1024m -Xmx1024m -XX:MaxMetaspaceSize=256m

    omiuser@omi-0:/> grep Xmx /opt/HP/BSM/conf/OPR_vm_params.ini
    -Xms3072m -Xmx3072m -XX:MaxMetaspaceSize=256m



    Do you think is enough ? 

Children
  • 0   in reply to 

    Hello Raphael,

    866 M in general is on the low side. But if FREE memory doesn't go below 280 M, then that's not the problem.

    As a side note, for wde, you get memory settings in this file:

    > grep Xmx /opt/HP/BSM/conf/wde_vm_params.ini
    -Xms896m -Xmx896m -XX:MaxMetaspaceSize=256m

    You could check if there are any errors in opr-gateway.log

    grep ERROR /opt/HP/BSM/log/wde/opr-gateway.log

    Best regards,

    Tobias