GPF in GWPOA with latest greatest 24.3 build but very old Groupwise data (oldest user creation date 2002)

Hello there to the dying breed of Groupwise Admins!

The post office run since 2023 with a 18.4.2 build

some GPF happened in the past and most of the time we rebooted the VM and for month the post office has kept on running, but on Monday the 26th of August the GWPOA of two of the four large post offices started to GPF on startup of the Groupwise service continuously.

After 3 days of nearly continuous downtime (many thanks to the sleepy support engineers of the Groupwise front line) a back line engineer from Rotterdam renamed the NGWDFR.DB for the tracking of messages sent with the delay delivery send option and on one of the two large post offices remain stable again.

The other one with 470 users, +1.2 million of files and more than 2 terabytes disk space used for /grpwise/po data files kept to GPF every few minutes.

I could solve the GPF (General Protection Fault) on my own with a 24 hour Team Viewer dial in from the Hotel during vacation running a standalone GWCHECK with all options with stopped Groupwise service on the host, took about 12 hours.

But i can dupe the GPF on a test host taking over the data with dbcopy as soon as the GWCHECK for the content with fix problems start to rebuild the NGWDFR.DB defer database

My assumption is that there must be "dangerous messages" in the post office with delivery date still in the future that corrupt the defer database as soon as the GWCHECK content check with fix problems find them and populate the defer database with the information from those weird messages.

We have a lot of users that are really fond of using the delay delivery send option and complain about the C0D5 when they try to send a message with delay delivery active and the NGWDFR.DB is not there.

Any idea how to come out of this s(h)ituation?

My only desperate last resort would be to create a new post office and move every user from the corrupted post office to the new one, but the effort is huge since you have to recheck continuously with which user move the error goes from one post office to the other.

so far - so good (or bad), Stefano

  • 0 in reply to 

    new test with dbcopy copy only for a month:

    /opt/novell/groupwise/agents/bin/dbcopy -i 8-30-2024 -w /grpwise/po /grpwise/mnt/po02prod-30days

    run content check with GWPOA running and in the middle of the GWCHECK i get this here

    Core was generated by `./gwpoa -noconfig -nomtp -home /grpwise/po02prod-30days -ip 10.17.65.249 -port'.
    Program terminated with signal SIGABRT, Aborted.
    #0 0x00007f0b3d0c2d2b in raise () from ./lib64/libc.so.6
    [Current thread is 1 (Thread 0x7f0b387a1700 (LWP 17026))]
    #backtrace
    #0 0x00007f0b3d0c2d2b in raise () from ./lib64/libc.so.6
    #1 0x00007f0b3d0c43e5 in abort () from ./lib64/libc.so.6
    #2 0x00007f0b3d108c87 in __libc_message () from ./lib64/libc.so.6
    #3 0x00007f0b3d110d2a in malloc_printerr () from ./lib64/libc.so.6
    #4 0x00007f0b3d11178c in malloc_consolidate () from ./lib64/libc.so.6
    #5 0x00007f0b3d113d90 in _int_malloc () from ./lib64/libc.so.6
    #6 0x00007f0b3d1157d8 in malloc () from ./lib64/libc.so.6
    #7 0x00007f0b3e4f397f in ?? ()
    #8 0x0000000100000008 in ?? ()
    #9 0x00000000000007b8 in ?? ()
    #10 0x00007f0b3e78b4ee in ?? ()
    #11 0xfffffffb000007b8 in ?? ()
    #12 0x0000000000000000 in ?? ()
    #

    so the corruption is not in some older messages it is somewhere else

  • 0   in reply to 

    Unfortunately we do not see where this activity has been aborted.

    Just a slight idea ... can you check if there any files larger than 4G in your GW directory?


    Use "Verified Answers" if your problem/issue has been solved!

  • 0   in reply to 

    If this is with the new ngwdfr.db that you created, you did see this with recent 24.3 code running and you did make sure this TID was followed then provide the core file of the POA thats created

    You can provide this core via the case.

  • 0   in reply to   

    ulimit problem?

    “You can't teach a person anything, you can only help them to discover it within themselves.” Galileo Galilei

  • 0   in reply to   

    Good idea to consider the noelision part.


    Use "Verified Answers" if your problem/issue has been solved!

  • 0 in reply to   

    OK, this is what i will do:

    A )

    repeat the dbcopy action from live PO to test VM to be able to run the test outside production

    B )

    rename ngwdfr.db

    C )

    run GWCHECK with user ngwdfr.db structural rebuild

    D)

    start GWPOA, login with a user where i know the password and

    recreate new ngwdfr.db with new message with delay delivery send option 

    E)

    stop GWPOA

    F )

    run GWCHECK for Content with Fix Problems

    G )

    start GWPOA that run in the GPF and obtain coredump scc<whatever>.txz archive

  • 0   in reply to 

    You first might request the latest gw 24.3 build in the case as well

    You can use case 02969073 as this is still open .....