NA 2023.05 Linux lost of ssh connection when trying to pull device configuration

I'm having an issue with a new deployment of NA 2023.05 on RHEL 8 VMs. Error message:

NA 2023.05 receives error: Attempt to retrieve data from device failed: Task thread was interrupted. When the snapshot starts it connect using the svc_na account using SSHv2 port 22. The account logs in and connects with the password. The device can be accessed and it's building the configuration and it fails when retrieving the configuration. We are also having problems finding on the server where the session logs are located to analyze more information. The driver packs have been updated to the latest version.

 Thanks,  Jim

  • 0  

    Hi Jim,  

    So, I have a few questions for you that hopefully will get you what you're looking for.  

    Let's start with the logs first. When you say session logs, are you talking about the session log box that you can check for a task?  That's in the DB.  If you do a task report and look at that task's result, you'll see it.  

    If you want it in a file (log), then you could either turn up logging globally or for that single device / task (device/session -> trace).  Then it will be in the appserver_wrapper.log or the generated task log for the single device.  

    The task specific log would be in ../NA/server/ext/appserver/standalone/log/Task Type Name task id #### on device ID ###.task.log

    If you have more than one NA Core, just make sure you to go the core that ran the task, same if you do a download of the troubleshoot.zip.  

    Now, the take snapshot issue.  First question - how long is it saying the task is taking, or more to the point, is it taking > 60 minutes (and this is if you or someone hasn't changed the default max task timer).  

    Do you mind if I ask what device type this device is?  Do you have others and if so, are they all failing or just this one device (of many) is failing?  

    Do you see anything when you look at the task (view session log), specifically, do you see NA logging in and then executing commands?  Can you see what the last thing was?  Was it trying to get the config via a transport method (scp, ftp, sftp)?  If so, do other devices use these methods and if not, have you configured this in NA?  

    As a work-around, you could try editing the device and only selecting CLI / ssh to get the config (uncheck scp and other methods) and see if the task is successful (granted, this may be a device where you need a method, but for say IOS, this should work).  If that did work for this device task, then perhaps look at how NA is configured (scp / sftp / ftp) or make sure that there's no issues with firewalls or such.  

    Hope this helps,

    Chris

  • 0 in reply to   

    Thanks Chris, I checked the /opt/NA/server/ext/appserver/standalone/log/server.log and saw there's an issue with the " Error executing SCP get - Failure executing SCP command. The tasks is not taking more than 3 - 40 seconds to run. I can see it logging in, collecting data and building the configuration file but not pulling it back.  The devices that mainly are having issues are Cisco Nexus and ASR devices.  The scp/sftp/ftp setting are configured to match our older NA 2020.08.   Thanks

  • 0   in reply to 

    OK, let's take a look at a few things....

    First, for these devices, other tasks like the standard out of the box diagnostics, they run fine, correct?  
    And the prior version of NA, two questions here:
    1) These devices had no problems with snapshots
    2) The task shows the same method used - this will be a bit difficult as you may have to hunt for session logs to view and check but am curious if "how" the task is getting the config switched with either the version change or driver change.

    OK, so you see the error with the SCP get, but nothing after that when looking at the View Session Log link? Is it trying first do a scp get, that doesn't work, then NA logs into the device OR you see NA logs into the device and when it is in the device, you then see the SCP get error? By chance, can you remove any company info and just leave the rest and post? Trying to see how far the task is actually getting.

    My suggestion to dig a bit deeper:

    Pick one device, run a Take Snapshot, check the box for session log, then scroll all the way to the bottom.

    Expand Task Logging and pick:

    device/driver/javascript

    device/driver/parser

    device/driver/updater

    device/session

    Run the task, then when it finishes, do the option to download the troubleshoot.zip and check the box for this task specific file as well (again, from the core that ran the task).  

    That file should show you the details of what was going on with the task.  I would expect more unless there is an error that is causing NA to bail on the task and that would show up in the log.   

    You could do a tail -F /opt/NA/server/log/appserver_wrapper.log > some_other_file.log while this is going on...but if you have other tasks running, it just gets really messy.  

    Worst case, with the TS.zip file you get, you can provide to support (let's say there's a bug), this will be a quick first step with them and they'll be able to take that and run with it.  

    Another thing you can try, see if you can log in to the NA core with the NA ID / password it uses for scp / ftp / sftp - should work, probably works, but worth trying - can you log in, can you transfer a file - just make sure permissions isn't a problem, etc.  
    Or, pick a different device, maybe a IOS switch - run a snapshot and check the box for session log - do you see any problems there?

    Again, as a temp fix, might be worth creating a group of these "problem devices" and then doing a batch edit and switching off scp/sftp/ftp/cli and leave cli and then get a snapshot in case it's been a long time, then put back to the default.  You didn't say how long it hadn't worked, but if its been a long time, might be a good, safe idea.  

    Let's try these and see what we find.

    -Chris

  • 0 in reply to   

    Thank you,

     I re-ran the snapshot making the adjustment you recommended.  I took a while for the snapshot to run but I was able to capture the output from the screen so I'll pass that to our network admin to see if it's something in his device settings that need to be changed. 

     I checked the appserver_wrapper.log for the snapshot and found the following:

    postgresql.util.PSQLException: ERROR: duplicate key value violates unique constrant "rn_event_pkey"

    Detail: Key (eventid)=(2769962) already exist

    We are running 2 NA cores on VMs configured for horizonal-scalability but both systems shows it running running as a standalone core. Info does replicate between the 2 cores. I'm using Postgres-14 on a separate VM

     

     

  • 0   in reply to 

    OK, let's take a look at the cores / setup.

    Do you see admin / distributed in NA Web UI?  If so, what do you see for List Cores?

    Two cores, one core?  Are you expecting both to be active or is one standby (by design)?  Something like:

    Name Core  Hostname Status                                        Timezone Offset Realm 
    Core 1          bob            Running: Fully functional          UTC -8               Default Realm 
    Core 2          harry          Running: Fully functional          UTC -8               Default Realm 

    Check in /opt/NA/jre - do you see distributed.rcx? 

    easy check, query your DB:

    select * from rn_core;

    You should get two rows back, if you get just one, then someone missed some steps (you aren't the first - it happens).  If you do get two rows, then the distributed.rcx is probably what's missing.  

    You may want to shut down the core that isn't shown in the output of the sql command above (or have someone do it) until that gets resolved, don't want to introduce any problems.  

    Good luck with the snapshot task issue. 

    Again, hope this helps,

    Chris

  • 0 in reply to   

    I recheck my system and the distributed.rcx was not deployed. I remember that when I deployed during the initial NA configuration truecontrol did not start up correctly and the NA web page didn't come up. I did copy the distributed.rcx file back to the /opt/NA/jre/ and restarted truecontrol and again the web page failed to come up. 

  • 0   in reply to 

    Hmmm, if you do a /etc/init.d/truecontrol status, do you see:

    TrueControl Management Engine is running.

    If you do, check the appserver_wrapper.log...

    do you see a line similar to:

     [ServerService Thread Pool -- 99] 75 ServerStartupThread: Startup complete for type: Postgres

    or

     [ServerService Thread Pool -- 81] 75 DistributedDaemon: Distributed Horizontal scalability is enabled  (though this is from a 24.2 instance, so it may not be seen in 2023.05)

    Is the current DB a new database or was it exported / imported to your current DB?  Trying to understand if the DB is expecting two cores or not.  If not, the distributed.rcx won't be expected.    

    Did you do the install of NA or did you inherit this or maybe you're helping someone else out.  But curious if this was done:

    Install NA in a horizontally scalable environment - Network Automation (microfocus.com)

    -Chris

  • 0 in reply to 

    Here's some additional info we've been trying to work with another support vendor. we were looking if it was the ssh connection.

    ---------

    Looking below, the sshd_config is currently set as follows:

     

    ClientAliveInterval 600

    ClientAliveCountMax 3

     

    There are multiple devices having issues but the two devices I am currently investigating are two different Cisco ASR’s (different models) with XR code (different versions).  Devices are pingable, we can access the devices via NA as well as directly, and as stated below NA builds the configuration but then I think the problem comes when it tries to write it to the database.  When this happens and trying to investigate, we are not able to look at the because verbose session log disappears when the snap fails and there is no hyperlink to it.  Is there a location on the server that we can look at the specified verbose session log of the device snap?

     

    Please note: Other models such as Cisco 3560 switches are able to build and write the config to the database and we are able to see the verbose session log and see the log via displayed hyperlink.

     

    More info that you asked in the below,

     Yes, Jim installed the recent DDP and the problem still exists.

     Everything (all the above) works on our legacy 2020.08 build but this issues are happening on the new 2023.05 build that is being rolled out.

     Hope this helps and thanks for the help,

    Note: We were wondering if it could be an issue with the Postgres database not saving the configuration when trying to write?  Our old system was using Oracle 12c

  • 0 in reply to   

    I did the NA install myself. I've done other installs on other networks which were NA with Postgres embedded database without issue. We lost our support due to a contract change (DoD). The DB is new with data exported out from the old and into the new Na

  • 0   in reply to 

    Oh, OK, big question - did someone try to export / migrate the Oracle database to Postgres?  I'm curious for a couple reasons....

    The ASRs, if you can get back in NA, can you say what driver is being used?  

    Also, if you can get in NA Web UI, can you do the following:

    Search For / Tasks

    By chance, do you guys have a scheduled task for Take Snapshot that isn't called Take Snapshot?  Maybe it's called Weekly Take Snapshot or something unique?  If so, for Task Name, put that in there.

    Uncheck Schedule Date

    Check Start Date  - if the db is new (say less than a month) leave blank otherwise set to maybe a month (at least a week)

    Check Complete Date

    Check Duration

    Uncheck Priority

    For Task Type - pick Take Snapshot

    Check Core

    hit Search

    What I'm looking for is to see if the tasks that are having problems are on a specific core.  

    Also curious what driver the ASR is using, but more interested in task - core info.  

    Let's see what comes from  this.

    -Chris