OES/SLES15 multipath boot issues after kernel update.

Hi.

I recently ran into an odd issue at a customer. We recently did a rolling cluster upgrade (4 nodes) from OES2018 to 24.2, without much issues. Now the first round of patches after the installation came, and we ran into the worst issue. After patching, the server (Dell) wouldn't boot.

This is what happened:

During the original installation, we answered "yes" to the question if we want to enable multipath. Without further configuration, this results in the local boot device also being accessed via device-mapper, which usually isn't a problem. Servers were rebooted countless times after initial installation, and always came up fine.

Until the first kernel upgrade, which we just now applied. The newly build kernel initrd after upgrade is not configured to load multipath at boot time. Result: Server doesn't boot after update, as it doesn't find root partition/boot device.

Insult added to injury: The kernel patch, while regularly patching OES factually rebuilds the initrd for *all* currently installed kernels. Yes you read that right. Result: After patching, not even the previous kernel will boot, as dracut has destroyed the previous, working kernel, too. You can clearly see that when you look at the dracut output at the end of a regular "zypper patch" run, that dracut will buld new initrds for *al* kernels. I consider this a *HUGE* bug. Under absolutely *no* circumstance should an update to a new kernel touch the previous one (and destroy it)


Fix: start rescue system, chroot to the boot disk, (www.suse.com/.../, and have dracut rebuild the initrd *with multipath support:

dracut -f --kver 5.3.18-150300.59.147-default --add multipath

(replace "5.3.18-150300.59.147-default" above with your most recently installed kernel version as visible in /boot)

exit chroot and reboot.

What is unknown as of now, is if that wil be persistent, or if it will break again when the next kernel patch gets released.

Attempts to "fix" this before patching the kernel, by adding multipath to the dracut configuration through seeveral documented methods have been unsuccesful so far. We tried:

www.suse.com/.../

And eventhough we *do* see "rd.driver.pre=dm_multipath" in the dracut output of the regular patch, it still won't boot, and needs to be fixed using above method.

We also tried "dracut --force --add multipath" before patching, again seeing mutipath being added, but still, after patching it won't boot, yet again needing the manual process above.

So apparently, the only way that works (or rather, that I currently know works) at this point in time is to manually run dracut with the "--add multipath" option, to get a working kernel with multipath support at boot time.





Parents
  • 0  

    It is difficult to say from experience, but it is the configuration of multipath itself that leads to errors. Also there are always issues with udev rules after patch or wrong (phantom) entries in /dev/disk after patch

    What does the dmsg say about the start phase of the boot for multipath and what is in the Multipath.conf? Is SAN boot or iscsi boot configured?

    A good help are the multipath tools themselves to perform diagnostics. Have you ever tried to consciously delete and recreate the maps with multipath tools

    outputs -ll or -t with the tools any errors or mapping errors.

    Georg

    “You can't teach a person anything, you can only help them to discover it within themselves.” Galileo Galilei

  • 0   in reply to   

    Georg.
    It can't be udev or mutlipath.conf issues, as neither would become fixed simply by the process of rebulding the initrd with the --add multipath parameter. The server doesn't get as far as dmsg or other debug options being availble.

    Also noteworthy: dracut emergency shell also doesn't work. Manually adding rd.shell to the kernel line in Grub has no effect. What also fails is to manually edit the kernel load line in grub to /dev/sdX notation. I don't know how or why, but it *seems* as if editing from the Grub menu (ctrl-e) is entirely without function.

  • 0   in reply to   

    Does the Dell system use NVME?

    “You can't teach a person anything, you can only help them to discover it within themselves.” Galileo Galilei

Reply Children