Can't mount root from raidz2 after r255763 in stable/9

Sun Nov 3 16:57:11 UTC 2013

TL;DR; version -- Solved.

The failure was caused by zombie ZFS volume labels from the previous
life of the disks in another pool. For some reason kernel picks labels
from the raw device first now and tries to boot from the pool that
does not exist. Nuking old labels with dd solved my booting issues.

On Sun, Nov 3, 2013 at 1:02 AM, Andriy Gapon <avg at freebsd.org> wrote:
> on 03/11/2013 05:22 Artem Belevich said the following:
>> Hi,
>>
>> I have a box with root mounted from 8-disk raidz2 ZFS volume.
>> After recent buildworld I've ran into an issue that kernel fails to
>> mount root with error 6.
>> r255763 on stable/9 is the first revision that fails to mount root on
>> mybox. Preceding r255749 boots fine.
>>
>> Commit r255763 (http://svnweb.freebsd.org/base?view=revision&revision=255763)
>> MFCs bunch of changes from 10 but I don't see anything that obviously
>> impacts ZFS.
>
> Indeed.
>
>> Attempting to boot with vfs.zfs.debug=1 shows that order in which geom
>> providers are probed by zfs has apparently changed. Kernels that boot,
>> show "guid match for provider /dev/gpt/<valid pool slice>" while
>> failing kernels show "guid match for provider /dev/daX" -- the raw
>> disks that are *not* the right geom provider for my pool slices. Beats
>> me why ZFS picks raw disks over GPT partitions it should have.
>
> Perhaps the kernel gpart code fails to recognize the partitions and thus ZFS
> can't see them?
>
>> Pool configuration:
>> #zpool status z0
>>   pool: z0
>>  state: ONLINE
>>   scan: scrub repaired 0 in 8h57m with 0 errors on Sat Oct 19 20:23:52 2013
>> config:
>>
>>         NAME                 STATE     READ WRITE CKSUM
>>         z0                   ONLINE       0     0     0
>>           raidz2-0           ONLINE       0     0     0
>>             gpt/da0p4-z0     ONLINE       0     0     0
>>             gpt/da1p4-z0     ONLINE       0     0     0
>>             gpt/da2p4-z0     ONLINE       0     0     0
>>             gpt/da3p4-z0     ONLINE       0     0     0
>>             gpt/da4p4-z0     ONLINE       0     0     0
>>             gpt/da5p4-z0     ONLINE       0     0     0
>>             gpt/da6p4-z0     ONLINE       0     0     0
>>             gpt/da7p4-z0     ONLINE       0     0     0
>>         logs
>>           mirror-1           ONLINE       0     0     0
>>             gpt/ssd-zil-z0   ONLINE       0     0     0
>>             gpt/ssd1-zil-z0  ONLINE       0     0     0
>>         cache
>>           gpt/ssd1-l2arc-z0  ONLINE       0     0     0
>>
>> errors: No known data errors
>>
>> Here are screen captures from a failed boot:
>> https://plus.google.com/photos/+ArtemBelevich/albums/5941857781891332785
>
> I don't have permission to view this album.

Argh. Copy-paste error. Try these :
https://plus.google.com/photos/101142993171487001774/albums/5941857781891332785?authkey=CPm-4YnarsXhKg
https://plus.google.com/photos/+ArtemBelevich/albums/5941857781891332785?authkey=CPm-4YnarsXhKg

>
>> And here's boot log from successful boot on the same system:
>> http://pastebin.com/XCwebsh7
>>
>> Removing ZIL and L2ARC makes no difference -- r255763 still fails to mount root.
>>
>> I'm thoroughly baffled. Is there's something wrong with the pool --
>> some junk metadata somewhere on the disk that now screws with the root
>> mounting? Changed order in geom provider enumeration? Something else?
>> Any suggestions on what I can do to debug this further?
>
> gpart.

Long version of the story: It was stale metadata after all.

'zdb -l /dev/daN' showed that one of the four pool labels was still
found on every drive in the pool.
Long ago the drives were temporarily used as raw drives in a ZFS pool
on a test box. Then I destroyed the pool, sliced them into partitions
with GPT and used one of partitions to build current pool. Apparently
not all old pool labels were overwritten by the new pool, but by
accident that went unnoticed until now because new pool was detected
first. Now detection order has changed (I'm still not sure how or why)
and that resurrected the old non-existing pool and caused boot
failures.

After finding location of volume labels on the disk and nuking them
with dd boot issues went away. The scary part was that the label was
*inside* the current pool slice so I had to corrupt current pool data.
I figured that considering that label is still alive, current pool
didn't write anything there yet and therefore it should be safe to
overwrite the label. I first did it on one drive only. In case I was
wrong, ZFS should have been able to rebuild the pool. Lucky for me no
vital data was hurt in the process and zfs scrub reported zero errors.
After nuking old labels on other drives, boot issues went away.

Even though my problem has been dealt with, I still wonder whether
pool detection should be more robust. I've been lucky that it was
kernel that changed pool detection and not the bootloader. It would've
made troubleshooting even more interesting.

Would it make sense to prefer partitions over whole drives?
Or, perhaps prefer pools with all the labels intact over devices that
only have small fraction of valid labels?

--Artem