[bug] fsck refuses to repair damaged UFS using backup superblock

Rick Macklem rmacklem at uoguelph.ca
Sun Nov 25 15:25:25 UTC 2018


Kirk McKusick wrote:
>> To: soralx at cydem.org
>> Subject: Re: [bug] fsck refuses to repair damaged UFS using backup superblock
>> From: "Julian H. Stacey" <jhs at berklix.com>
>> Organization: http://berklix.eu BSD Unix Linux Consultants, Munich Germany
>> Date: Fri, 23 Nov 2018 02:17:20 +0100
>>
>> Hi soralx at cydem.org,
>> Added cc: <freebsd-fs at freebsd.org> to ensure file system specialists see this.
>>
>> Reference:
>>> From:                <soralx at cydem.org>
>>> Date:                Tue, 20 Nov 2018 05:30:00 -0800
>>
>> soralx at cydem.org wrote:
>>>
>>> Howdy!
>>>
>>>  Since send-pr(1) is now gone, I guess the next option is to send a
>>>  message directly to the developers...
>>>
>>>  Yesterday, I ran into a bug in fsck_ffs that gave me a little scare.
>>>
>>>  Short story: on -CURRENT, fsck refuses to check a FS with a corrupted
>>>  superblock, even when an alternate (backup) SB location is given.
>>>
>>>  Long story. I've been testing a newly-built system based on an X399
>>>  platform with a 2950X CPU and an Optane 905P 480GB U.2 drive. The
>>>  system ran a ~2-day old -CURRENT; when compiling newest world and
>>>  kernel, I found the machine in a locked-up state. After a hard reset,
>>>  boot failed because the root FS became corrupted & was not available:
>>>    kernel: Superblock check-hash failed: recorded check-hash XXX != computed >check-hash YYY
>>>
>>>  I have not yet figured out why the corruption happened... bad hardware?
>>>  bug in the NVMe driver?
>>>
All I did was boot a pre-r339671 kernel that used the file systems and then, bingo...

>>>  "OK", I thought, "No worries. We'll just boot using another disk, fsck
>>>  the corrupted FS with a backup superblock, and be up in a moment".
>>>  The machine was doing nothing but compiling, so no valuable data loss.
>>>
>>>  So I did `dumpfs -m /dev/ada0p3` on the spare disk (which was the
>>>  source for the new disk image, thus had almost identical partitions
>>>  and filesystems) to get the FS details, then did `newfs -N [...]
>>>  /dev/ada0p3` to find locations of superblock backups, then finally
>>>  ran `fsck_ffs -b 192 /dev/nvd0p3` -- only to get the same "check-
>>>  -hash failed" message, plus another strange message: "Can't open
>>>  /dev/nvd0p3: [...]". Then fsck quits.
>>>  Note that `fsck_ffs -b ...` on a FS with good superblock works OK.
>>>
>>>  After fiddling with a debugger for a bit, I commented out the line
>>>  "return (0);" in /usr/src/sbin/fsck_ffs/setup.c:136, recompiled fsck,
>>>  and the FS was recovered successfully.
>>>
>>>  What was actually happening: fsck's setup.c calls ufs_disk_fillout()
>>>  from libufs' type.c, which in turn calls sbread() from the same
>>>  library, which then calls sbget(disk->d_fd, &fs, -1) [[where '-1'
>>>  is hard-coded to indicate the primary superblock]] that then simply
>>>  invokes ffs_sbget from ffs kernel driver -- and this returns ENOENT,
>>>  which eventually causes fsck to give up before even looking at the
>>>  specified backup superblock.
>>>
>>>  I don't know what exactly ufs_disk_fillout() does, but fortunately
>>>  for me, fsck worked without the "sbread(disk)" part of that function
>>>  having much luck on a disk with corrupted superblock. Also, I have a
>>>  feeling that calling a kernel's ffs driver function when using fsck
>>>  to fix a broken filesystem is not the best thing to do...
>>>
>>>  Please CC, as I am not subscribed.
>>>
>>> --
>>> [SorAlx]  ridin' VN2000 Classic LT
>>
>> Cheers,
>> Julian
>
>Below is a proposed fix for fsck_ffs to properly handle superblock
>check-hash failures (notably to optionally search for a usable
>alternate superblock). Let me know if you still have a filesystem
>on which you can test it, and if so whether it works correctly.

As above, I think you can reproduce this by running an older kernel that
mounts the file system. I ended up re-installing when I ran into this yesterday
(no biggy, it was just a test machine). It happened after I had been running
a kernel built from stable/12 on the system and then tried to boot it.
(Since the root fs got these errors, I couldn't boot any kernel on the root fs.)

It would be nice if there was a way to override the check and boot the system.
(Is a loader tunable reasonable for this?)

rick



More information about the freebsd-fs mailing list