NSS Free Tree block is corrupt on pool

OES 2018 SP2 on VMware

One of pools cannot be initialize:

2024-11-22T17:08:09.460603+01:00 fs kernel: [ 34.958925] Pool "DATA1" - MSAP activate.
2024-11-22T17:08:09.460604+01:00 fs kernel: [ 34.958925] Server(1c6bb2d4-d010-11e1-91-04-005056ab1587) Cluster(00000000-0000-0000-00-00-000000000000)
2024-11-22T17:08:09.460605+01:00 fs kernel: [ 34.958927] nsslibrary: [MSAP] comnLog[201]
2024-11-22T17:08:09.460606+01:00 fs kernel: [ 34.958927] Pool "DATA1" - Watching pool.
2024-11-22T17:08:09.655744+01:00 fs kernel: [ 35.400145] nsszlss64: NSS Free Tree block 673053475(0x281dfb23) is corrupt on pool "DATA1"
2024-11-22T17:08:09.655761+01:00 fs kernel: [ 35.400148] libnss: NSS Free Tree(3) is corrupt.
2024-11-22T17:08:09.655762+01:00 fs kernel: [ 35.400161] ------------[ cut here ]------------
2024-11-22T17:08:09.655764+01:00 fs kernel: [ 35.400163] kernel BUG at /home/abuild/rpmbuild/BUILD/nss/obj/default/public_core/libnss/misc/Abend.c:40!
nss /PoolMaintenance="DATA1" hangs, so no possibility to rebuild the pool.
  • 0  

    I would be confirming what the vmware storage and the underlaying physical storage (SAN?) is saying about it.  It is one of those cases where a full power cycle of those systems may help or give you more clues.   More details on that front would be useful for us to assist further.

    If those layers all check out and restarts of the storage don't resolve it, we may have a Maxim 41 situation "Do you have a backup?" means "I can't fix this."   So checking your backup status might be a valuable parallel task (that should be periodically performed anyways)

    ________________________

    Andy of KonecnyConsulting.ca in Toronto
    Please use the "Like" and/or "Verified Answers" as appropriate as that helps us all.

  • 0 in reply to   

    The pool is on virtual vmdk disk.

    The customer's administrator has a backup, but some small part has not by chance backuped, so it would be welcomed to manage rebuilding the pool.

  • 0   in reply to 

    Under that vmdk is a datastore, that is provisioned on some actual physical drives in some fashion, a SAN with redundancy, hopefully.  The health of those needs to be looked at. I have seen issues happen there that look like what you are seeing, and finding and fixing those issues has helped me before.  At least once, a full power shutdown (convenient power outage beyond UPS) resolved this.

    Location of the vm, such as what host it is on, vs the datastore has been an issue before. Once, just bringing up the vm on a different host made a big difference.  While that "shouldn't" make a difference, reality likes laughing at and thumbing its noses at the "shoulds"

    ________________________

    Andy of KonecnyConsulting.ca in Toronto
    Please use the "Like" and/or "Verified Answers" as appropriate as that helps us all.

  • 0 in reply to   

    The virtual server was copied to a synology environment and a clone was created on the VMware environment. The same problem with the pool was identify on the both migrated servers. So I think, that a hw problem is not probable.

  • 0   in reply to 

    Ah,  that is a good elimination.

    The only other thought would be to bring the volume up on a different version of OES in case it is just a code issue. Even just updating the server after a good snapshot of it off.

    ________________________

    Andy of KonecnyConsulting.ca in Toronto
    Please use the "Like" and/or "Verified Answers" as appropriate as that helps us all.

  • 0 in reply to   

    Yes, on Monday (I must consult it with the customer's administrator) I am going to install a new OES 24.4 server and try to connect the problematic pool, at first outside of the product tree to investigate if the problem is or is not a problem of a progam code.

  • 0  

    I know such a situation with NSS. In any case, you need to look deep into the logs of the VMware hosts in question. I had iSCSI and SAN Multipath issues that caused problems in the virtual volumes in the layers above, which then affected the application.

    One issue concerned certain NAS systems that were used for VMware. Block allocation errors were caused by incorrect SCSI block commands at low level. What can also happen from time to time in the iSCSI environment is that packets (UDP packet loss) lead to problems in the upper layers.

    To be able to make a concrete statement, I would have to have logs from the Vmware and use a sniffer to look into SAN or iSCSI or whatever is connected to it. One last thing that comes to mind. I also had a misconfigured VSAN that caused errors in the devices

    George

    “You can't teach a person anything, you can only help them to discover it within themselves.” Galileo Galilei

  • Verified Answer

    +1 in reply to   

    The problem was resolved:
    nssstart.cfg:
    /PoolAutoDeactivate=all
    reboot

    Then /PoolMaintenance=DATA1 was successful and "ravsui rebuild" repaired the pool.