GPF in GWPOA with latest greatest 24.3 build but very old Groupwise data (oldest user creation date 2002)

Hello there to the dying breed of Groupwise Admins!

The post office run since 2023 with a 18.4.2 build

some GPF happened in the past and most of the time we rebooted the VM and for month the post office has kept on running, but on Monday the 26th of August the GWPOA of two of the four large post offices started to GPF on startup of the Groupwise service continuously.

After 3 days of nearly continuous downtime (many thanks to the sleepy support engineers of the Groupwise front line) a back line engineer from Rotterdam renamed the NGWDFR.DB for the tracking of messages sent with the delay delivery send option and on one of the two large post offices remain stable again.

The other one with 470 users, +1.2 million of files and more than 2 terabytes disk space used for /grpwise/po data files kept to GPF every few minutes.

I could solve the GPF (General Protection Fault) on my own with a 24 hour Team Viewer dial in from the Hotel during vacation running a standalone GWCHECK with all options with stopped Groupwise service on the host, took about 12 hours.

But i can dupe the GPF on a test host taking over the data with dbcopy as soon as the GWCHECK for the content with fix problems start to rebuild the NGWDFR.DB defer database

My assumption is that there must be "dangerous messages" in the post office with delivery date still in the future that corrupt the defer database as soon as the GWCHECK content check with fix problems find them and populate the defer database with the information from those weird messages.

We have a lot of users that are really fond of using the delay delivery send option and complain about the C0D5 when they try to send a message with delay delivery active and the NGWDFR.DB is not there.

Any idea how to come out of this s(h)ituation?

My only desperate last resort would be to create a new post office and move every user from the corrupted post office to the new one, but the effort is huge since you have to recheck continuously with which user move the error goes from one post office to the other.

so far - so good (or bad), Stefano

Parents
  • 0

    Hi all, here the standalone GWCHECK with ngwdfr.db renamed to ngwdfr.dba

    STRUCTURAL VERIFICATION of system databases
    STRUCTURAL VERIFICATION of database ngwguard.db
    - Database is structurally consistent
    Reading Guardian Database store catalog info
    Processing Post Office = PO02, Store Catalog Path = /grpwise/po02prod
    STRUCTURAL VERIFICATION of database /grpwise/po02prod/ofmsg/ngwdfr.db
    - Attempting to correct structural problem in database
    Problem 39- Unknown file ngwdfr.dba - 77824 bytes, 09/30/24 10:15
    NOTE- the timestamp on this file is recent, and may reflect a
    temporary mismatch between the file system and the databases.
    - File is too recent- will not be deleted
    Problem 39- Unknown file ngwdfr.dbb - 217088 bytes, 09/24/24 17:35
    NOTE- the timestamp on this file is recent, and may reflect a
    temporary mismatch between the file system and the databases.
    - File is too recent- will not be deleted
    Problem 39- Unknown file ngwdfr.dbc - 73728 bytes, 09/24/24 10:38
    NOTE- the timestamp on this file is recent, and may reflect a
    temporary mismatch between the file system and the databases.
    - File is too recent- will not be deleted
    Problem 39- Unknown file ngwdfr.dbd - 77824 bytes, 09/27/24 13:56
    NOTE- the timestamp on this file is recent, and may reflect a
    temporary mismatch between the file system and the databases.
    - File is too recent- will not be deleted
    Problem 39- Unknown file ngwdfr.dbe - 217088 bytes, 09/24/24 17:35
    NOTE- the timestamp on this file is recent, and may reflect a
    temporary mismatch between the file system and the databases.
    - File is too recent- will not be deleted
    Problem 39- Unknown file ngwdfr.dbf - 221184 bytes, 09/30/24 00:41
    NOTE- the timestamp on this file is recent, and may reflect a
    temporary mismatch between the file system and the databases.
    - File is too recent- will not be deleted
    Error 0x8209 opening /grpwise/po02prod/ofmsg/ngwdfr.db
    - Beginning rebuild for database ngwdfr.db
    Error 26- DbRebuild error STORE_FILE_NOT_FOUND (0xC05D)
    - Store will be dropped from guardian catalog so it can be re-created
    *WARNING*: no records were recovered from database during
    rebuild process. Try to restore an earlier backup of the
    file, or else run CONTENTS check to repair system folders.
    Validating file references in database:
    Error 18- MESSAGE database open error INVALID_STORE_NUM (0xC067) on n
    Suggestion- Try physical check/rebuild of database
    PROCESSING COMPLETED- total processing time: 0:00:00

    *********************************************************************
    Uncorrectable conditions encountered:
    CODE DESCRIPTION COUNT
    ---- -------------------------------------------------- -----
    18 Message database open errors....................... 1
    26 Errors trying to do structural database rebuild.... 1
    Correctable conditions encountered:
    CODE DESCRIPTION COUNT
    ---- -------------------------------------------------- -----
    39 Unrecognized or invalid files in mail directories.. 6
    *********************************************************************

  • 0 in reply to 

    Hi there,

    for my understanding now if i send a message with delay delivery send option a new ngwdfr.db will be created, correct?

    Many thanks in advance, Stefano

  • 0   in reply to 

    Which file system is your DB and GroupWise running on? What is under your VMs? iSCSI? TrueNAS.Rob and Diethmar have given valuable hints, but I also know such behavior when hardware under the VMs is defective or something else is going on. With TruenNAS and related systems, there is or was currently a severe problem with the SCSI protocol under high load. 

    “You can't teach a person anything, you can only help them to discover it within themselves.” Galileo Galilei

  • 0 in reply to   

    Hi there,

    we have a Dell VxRail HCI 8 node VMware vSphere with vSAN stretched cluster with 2 data centers

    we have moved the data with a host migration just sunday the 22th of September

    XFS file system with the mkfs.xfs settings recommended from Veeam

    we have two GWDR servers in the two data center with copies of the live post offices, i can dupe the problem on different copies of the live data, corruption is in the Groupwise system and not in the underlying infrastructure, for this i am quite sure

    Open Text support put for days false claims about the culprit saying it is the SLES OS and the old Groupwise 18.4.2 Groupwise software patch level - i have duped the problem again with GW 24.3 and latest greatest SLES 15 SP5 patch level, it was all only diversionary tactic to avoid to spend hours and hours on a linux coredump file analysis

  • 0 in reply to 

    new test with dbcopy copy only for a month:

    /opt/novell/groupwise/agents/bin/dbcopy -i 8-30-2024 -w /grpwise/po /grpwise/mnt/po02prod-30days

    run content check with GWPOA running and in the middle of the GWCHECK i get this here

    Core was generated by `./gwpoa -noconfig -nomtp -home /grpwise/po02prod-30days -ip 10.17.65.249 -port'.
    Program terminated with signal SIGABRT, Aborted.
    #0 0x00007f0b3d0c2d2b in raise () from ./lib64/libc.so.6
    [Current thread is 1 (Thread 0x7f0b387a1700 (LWP 17026))]
    #backtrace
    #0 0x00007f0b3d0c2d2b in raise () from ./lib64/libc.so.6
    #1 0x00007f0b3d0c43e5 in abort () from ./lib64/libc.so.6
    #2 0x00007f0b3d108c87 in __libc_message () from ./lib64/libc.so.6
    #3 0x00007f0b3d110d2a in malloc_printerr () from ./lib64/libc.so.6
    #4 0x00007f0b3d11178c in malloc_consolidate () from ./lib64/libc.so.6
    #5 0x00007f0b3d113d90 in _int_malloc () from ./lib64/libc.so.6
    #6 0x00007f0b3d1157d8 in malloc () from ./lib64/libc.so.6
    #7 0x00007f0b3e4f397f in ?? ()
    #8 0x0000000100000008 in ?? ()
    #9 0x00000000000007b8 in ?? ()
    #10 0x00007f0b3e78b4ee in ?? ()
    #11 0xfffffffb000007b8 in ?? ()
    #12 0x0000000000000000 in ?? ()
    #

    so the corruption is not in some older messages it is somewhere else

  • 0   in reply to 

    If this is with the new ngwdfr.db that you created, you did see this with recent 24.3 code running and you did make sure this TID was followed then provide the core file of the POA thats created

    You can provide this core via the case.

  • 0   in reply to   

    Good idea to consider the noelision part.


    Use "Verified Answers" if your problem/issue has been solved!

  • 0 in reply to   

    OK, this is what i will do:

    A )

    repeat the dbcopy action from live PO to test VM to be able to run the test outside production

    B )

    rename ngwdfr.db

    C )

    run GWCHECK with user ngwdfr.db structural rebuild

    D)

    start GWPOA, login with a user where i know the password and

    recreate new ngwdfr.db with new message with delay delivery send option 

    E)

    stop GWPOA

    F )

    run GWCHECK for Content with Fix Problems

    G )

    start GWPOA that run in the GPF and obtain coredump scc<whatever>.txz archive

  • 0   in reply to 

    You first might request the latest gw 24.3 build in the case as well

    You can use case 02969073 as this is still open .....

Reply Children
No Data