Root volume renumbered unexpectedly, no longer boots

Sat Dec 14 07:09:06 UTC 2019

On 2019-12-13 12:57, Matthew Pounsett wrote:
> We have a large-ish FreeBSD 11.2-p7  file server with two 24-disk
> ZFS pools (20 live 4 spares each) and a single SSD boot volume. 
> Yesterday we pulled some dead drives from the ZFS pools and replaced 
> them with new drives intended to become spares.  After powering the 
> system back up, it looks like the boot volume has been renumbered 
> from da0 to da4.
> 
> I thought renumbering like this wasn't supposed to happen for at 
> least the last decade, since ATA_STATIC_ID was introduced to the 
> kernel, but there's little doubt that's what's happened.
> 
> Automatic boot now fails and drops to the third stage loader prompt 
> when the kernel tries to mount the root volume from ufs:/dev/da0p2. I
> can manually try to mount the root volume as ufs:/dev/da4p2, and the
> system begins to load the root volume, but then hangs.  The only two
> lines printed after loading da4 are related to loading up the ZFS 
> pools.  I can't reproduce the messages again now (explained below), 
> so quoting them verbatim isn't possible, but they're related to the 
> ZFS version being behind and suggesting I upgrade the pools.  The 
> messages themselves are not unusual,and I'm used to seeing similar 
> messages in the 'zpool status' output for a while now.  What is 
> unusual is that the system seems to hang at this point.  I'm 
> concerned that the re-ordering of drives might be causing problems 
> for the system trying to put the ZFS pools back together. I don't 
> really know, though.  Does anyone have any insight into what's going 
> on here?
> 
> There is a new wrinkle... since booting from a USB stick so that I 
> could get into the box and double-check some things, and confirm the 
> location of the root volume, the BIOS no longer seems to see da4 as
> a potential boot volume.  I'm hoping that goes back to the way it
> was once the USB stick is removed.  At the moment I have no way to
> even get the box to try/fail to boot from its normal boot volume.
> The machine is many thousands of miles remote, so I haven't tried to
> do this yet... I can invoke some remote help once that's necessary.
> 
> BIOS issue aside, I'm hoping there's a way I can pin this drive back 
> to da0.  I don't know how that could be done, but if anyone has any 
> suggestions I'd happily try them.  Failing that, I suppose I can just
> insert a vfs.root.mountfrom option in loader.conf.
> 
> Can anyone clue me into what's happening here, or suggest some 
> further troubleshooting that will help me gain some insight?
> 
> Thanks!

On 2019-12-13 13:57, Matthew Pounsett wrote:
> The SSD is on the mainboard controller.  I have no idea what the SATA
> controller's original mode was, but just now it was set to IDE. I
> tried switching it to AHCI, but that didn't improve anything, and 
> generated> a new "AHCI BIOS not installed" error during boot, so I
> switched it back to IDE. Either of the RAID settings seem like bad
> choices, since I want direct access to the physical drives for ZFS.

On 2019-12-13 15:26, Matthew Pounsett wrote:
> 48 disks, actually. :)   Half of those are on an external JBOD 
> connected via an LSI FC controller.  This server is significantly 
> older than my association with it, so I'm uncertain about how the 
> internal 24 drives are connected.  If it helps, dmesg only reports 
> ses0 and ses1 drivers.  The boot disk and the first 24 ZFS drives
> are all on ses0.  I think that implies only two controllers in use,
> not three.

One of my favorite Debian tricks is to keep my system drive images
smaller than 16.0E+09 bytes, partition my system drives with MBR, have
an unencrypted boot partition, an encrypted swap partition, and an
encrypted root partition.  I can then dd the raw system drive image
between HDD's, SSD's, and USB drives, select the boot drive in BIOS,
power up, and the system boots the system using the new device.  My
guess is that this works because I use UUID's in /etc/crypttab and
/etc/fstab, GRUB reads that information when it is configured/
reconfigured, and GRUB puts that information into /boot/initrd.img-* for
use by GRBU during subsequent boots (?).

When I tried the above with FreeBSD.  swap and root were not found if
the system drive device node changes when I dd the image to another
device.  Apparently, FreeBSD uses device node names in
/boot/loader.conf.  The work-around is to boot an installer or live
disk/memstick to a shell, mount the system drive boot filesystem, and
edit /boot/loader.conf; changing the old device node names to the new
device node names.  As I use ZFS, I also found it was necessary to move
aside /boot/zfs/zpoolcache.

Perhaps if your edited /boot/loader.conf on your system drive, changed
the 'da0' entries to 'da4', moved aside /boot/zfs/zpoolcache, and
configured your BIOS to boot from da4, the system would boot.

Alternatively, put an HBA into the machine, connect the system drive to
it, and hope the drive comes up as da0.  (AIUI FreeBSD will scan all the
other drives, look for ZFS signatures/ metadata, and assemble your
pool(s) regardless of device node names).

David