[Bug 245430] ZFS boot failure following memory exhaustion

bugzilla-noreply at freebsd.org bugzilla-noreply at freebsd.org
Tue Apr 7 19:10:08 UTC 2020


https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=245430

            Bug ID: 245430
           Summary: ZFS boot failure following memory exhaustion
           Product: Base System
           Version: 12.1-RELEASE
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: kern
          Assignee: bugs at FreeBSD.org
          Reporter: jwb at freebsd.org

The error message here is about the same as PR 221077, but I'm opening a new PR
due to differences in configuration and some new info.

The exact error is as follows:

ZFS: i/o error - all block copies unavailable
ZFS: can't read object set for dataset u
ZFS: can't open root filesystem
gptzfsboot: failed to mount default pool zroot

This occurred following a memory exhaustion event that flooded my messages file
with entries like the following:

Apr  6 09:56:14 compute-001 kernel: swap_pager_getswapspace(11): failed

This was caused by some computational jobs exhausting available memory.

The system was still up and accepting logins, but was no longer reading or
writing the ZFS pool.  All attempts to reboot resulted in the message above.

The system was installed recently using default ZFS configuration provided by
bsdinstall on 4 disks configured as RAID-0 on a Dell PERC H700 (which does not
support JBOD).  I.e. ZFS is running a RAID-Z on mfid0 through mfid3.

I updated an old forum thread on the issue and included my successful fix:

https://forums.freebsd.org/threads/10-1-doesnt-boot-anymore-from-zroot-after-applying-p25.54422/

Unlike that thread, this did not appear to be triggered by an upgrade.

The gist of it is that some of the datasets (filesystems) appear to have been
corrupted and the out of swap errors seem likely to be the cause.

zpool scrub did not find any errors.  All drives are reported as online and
optimal by the RAID controller.

My fix was as follows:

Boot from USB drive, go to live image, log in as root.

# mount -u -o rw / # Allow creating directories on USB drive
# zpool import -R /mnt -fF zroot
# cd /mnt
# mount zroot/ROOT/default # Not mounted by default, canmount defaults to
noauto
# mv boot boot.orig
# cp -Rp /boot . # Note -p to make sure permissions are correct in the new
/boot
# cp boot.orig/loader.conf boot/    # Restore customizations
# cp boot.orig/zfs/zpool.cache boot/zfs/
# cd
# zfs get canmount
# zfs set canmount=on var/log (and a couple others that did not match defaults)
# zpool export
# reboot

After successful reboot, ran "freebsd-update fetch install" and rebooted again,
so my /boot would be up-to-date.

Everything seems fine now.

I've made a backup of my /boot directory and plan to do so following every
freebsd-update so I can hopefully correct this quickly if it happens again.

I am seeing the same error on a workstation using 4 vanilla SATA ports, but
have not had physical access to it due to COVID 19.  This is the first time
I've seen the error without the presence of an underlying hardware RAID.  I'll
update this thread when I can gather more information.

-- 
You are receiving this mail because:
You are the assignee for the bug.


More information about the freebsd-bugs mailing list