NA 2023.05 Linux lost of ssh connection when trying to pull device configuration

I'm having an issue with a new deployment of NA 2023.05 on RHEL 8 VMs. Error message:

NA 2023.05 receives error: Attempt to retrieve data from device failed: Task thread was interrupted. When the snapshot starts it connect using the svc_na account using SSHv2 port 22. The account logs in and connects with the password. The device can be accessed and it's building the configuration and it fails when retrieving the configuration. We are also having problems finding on the server where the session logs are located to analyze more information. The driver packs have been updated to the latest version.

 Thanks,  Jim

Parents
  • 0  

    Hi Jim,  

    So, I have a few questions for you that hopefully will get you what you're looking for.  

    Let's start with the logs first. When you say session logs, are you talking about the session log box that you can check for a task?  That's in the DB.  If you do a task report and look at that task's result, you'll see it.  

    If you want it in a file (log), then you could either turn up logging globally or for that single device / task (device/session -> trace).  Then it will be in the appserver_wrapper.log or the generated task log for the single device.  

    The task specific log would be in ../NA/server/ext/appserver/standalone/log/Task Type Name task id #### on device ID ###.task.log

    If you have more than one NA Core, just make sure you to go the core that ran the task, same if you do a download of the troubleshoot.zip.  

    Now, the take snapshot issue.  First question - how long is it saying the task is taking, or more to the point, is it taking > 60 minutes (and this is if you or someone hasn't changed the default max task timer).  

    Do you mind if I ask what device type this device is?  Do you have others and if so, are they all failing or just this one device (of many) is failing?  

    Do you see anything when you look at the task (view session log), specifically, do you see NA logging in and then executing commands?  Can you see what the last thing was?  Was it trying to get the config via a transport method (scp, ftp, sftp)?  If so, do other devices use these methods and if not, have you configured this in NA?  

    As a work-around, you could try editing the device and only selecting CLI / ssh to get the config (uncheck scp and other methods) and see if the task is successful (granted, this may be a device where you need a method, but for say IOS, this should work).  If that did work for this device task, then perhaps look at how NA is configured (scp / sftp / ftp) or make sure that there's no issues with firewalls or such.  

    Hope this helps,

    Chris

  • 0 in reply to   

    Thanks Chris, I checked the /opt/NA/server/ext/appserver/standalone/log/server.log and saw there's an issue with the " Error executing SCP get - Failure executing SCP command. The tasks is not taking more than 3 - 40 seconds to run. I can see it logging in, collecting data and building the configuration file but not pulling it back.  The devices that mainly are having issues are Cisco Nexus and ASR devices.  The scp/sftp/ftp setting are configured to match our older NA 2020.08.   Thanks

  • 0   in reply to 

    OK, let's take a look at a few things....

    First, for these devices, other tasks like the standard out of the box diagnostics, they run fine, correct?  
    And the prior version of NA, two questions here:
    1) These devices had no problems with snapshots
    2) The task shows the same method used - this will be a bit difficult as you may have to hunt for session logs to view and check but am curious if "how" the task is getting the config switched with either the version change or driver change.

    OK, so you see the error with the SCP get, but nothing after that when looking at the View Session Log link? Is it trying first do a scp get, that doesn't work, then NA logs into the device OR you see NA logs into the device and when it is in the device, you then see the SCP get error? By chance, can you remove any company info and just leave the rest and post? Trying to see how far the task is actually getting.

    My suggestion to dig a bit deeper:

    Pick one device, run a Take Snapshot, check the box for session log, then scroll all the way to the bottom.

    Expand Task Logging and pick:

    device/driver/javascript

    device/driver/parser

    device/driver/updater

    device/session

    Run the task, then when it finishes, do the option to download the troubleshoot.zip and check the box for this task specific file as well (again, from the core that ran the task).  

    That file should show you the details of what was going on with the task.  I would expect more unless there is an error that is causing NA to bail on the task and that would show up in the log.   

    You could do a tail -F /opt/NA/server/log/appserver_wrapper.log > some_other_file.log while this is going on...but if you have other tasks running, it just gets really messy.  

    Worst case, with the TS.zip file you get, you can provide to support (let's say there's a bug), this will be a quick first step with them and they'll be able to take that and run with it.  

    Another thing you can try, see if you can log in to the NA core with the NA ID / password it uses for scp / ftp / sftp - should work, probably works, but worth trying - can you log in, can you transfer a file - just make sure permissions isn't a problem, etc.  
    Or, pick a different device, maybe a IOS switch - run a snapshot and check the box for session log - do you see any problems there?

    Again, as a temp fix, might be worth creating a group of these "problem devices" and then doing a batch edit and switching off scp/sftp/ftp/cli and leave cli and then get a snapshot in case it's been a long time, then put back to the default.  You didn't say how long it hadn't worked, but if its been a long time, might be a good, safe idea.  

    Let's try these and see what we find.

    -Chris

Reply
  • 0   in reply to 

    OK, let's take a look at a few things....

    First, for these devices, other tasks like the standard out of the box diagnostics, they run fine, correct?  
    And the prior version of NA, two questions here:
    1) These devices had no problems with snapshots
    2) The task shows the same method used - this will be a bit difficult as you may have to hunt for session logs to view and check but am curious if "how" the task is getting the config switched with either the version change or driver change.

    OK, so you see the error with the SCP get, but nothing after that when looking at the View Session Log link? Is it trying first do a scp get, that doesn't work, then NA logs into the device OR you see NA logs into the device and when it is in the device, you then see the SCP get error? By chance, can you remove any company info and just leave the rest and post? Trying to see how far the task is actually getting.

    My suggestion to dig a bit deeper:

    Pick one device, run a Take Snapshot, check the box for session log, then scroll all the way to the bottom.

    Expand Task Logging and pick:

    device/driver/javascript

    device/driver/parser

    device/driver/updater

    device/session

    Run the task, then when it finishes, do the option to download the troubleshoot.zip and check the box for this task specific file as well (again, from the core that ran the task).  

    That file should show you the details of what was going on with the task.  I would expect more unless there is an error that is causing NA to bail on the task and that would show up in the log.   

    You could do a tail -F /opt/NA/server/log/appserver_wrapper.log > some_other_file.log while this is going on...but if you have other tasks running, it just gets really messy.  

    Worst case, with the TS.zip file you get, you can provide to support (let's say there's a bug), this will be a quick first step with them and they'll be able to take that and run with it.  

    Another thing you can try, see if you can log in to the NA core with the NA ID / password it uses for scp / ftp / sftp - should work, probably works, but worth trying - can you log in, can you transfer a file - just make sure permissions isn't a problem, etc.  
    Or, pick a different device, maybe a IOS switch - run a snapshot and check the box for session log - do you see any problems there?

    Again, as a temp fix, might be worth creating a group of these "problem devices" and then doing a batch edit and switching off scp/sftp/ftp/cli and leave cli and then get a snapshot in case it's been a long time, then put back to the default.  You didn't say how long it hadn't worked, but if its been a long time, might be a good, safe idea.  

    Let's try these and see what we find.

    -Chris

Children
  • 0 in reply to   

    Thank you,

     I re-ran the snapshot making the adjustment you recommended.  I took a while for the snapshot to run but I was able to capture the output from the screen so I'll pass that to our network admin to see if it's something in his device settings that need to be changed. 

     I checked the appserver_wrapper.log for the snapshot and found the following:

    postgresql.util.PSQLException: ERROR: duplicate key value violates unique constrant "rn_event_pkey"

    Detail: Key (eventid)=(2769962) already exist

    We are running 2 NA cores on VMs configured for horizonal-scalability but both systems shows it running running as a standalone core. Info does replicate between the 2 cores. I'm using Postgres-14 on a separate VM