UFS not handling errors correctly

Johannes Totz jo_t at gmx.net
Sun Sep 9 10:07:35 PDT 2007


Hi!

Seems like UFS does not handle disk/write errors properly, causes silent
corruptions and which causes a panic later during snapshot creation.

> #uname -a
> FreeBSD alfred 6.2-STABLE FreeBSD 6.2-STABLE #0: Thu Jul 12 20:40:55 CEST 2007     root at alfred:/usr/obj/usr/src/sys/ALFRED  i386

One day a write error on one of my disks happened:

> Aug 22 05:24:39 alfred kernel: ad0: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=469004995
> Aug 22 05:24:40 alfred kernel: ad0: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=469004995
> Aug 22 05:24:40 alfred kernel: g_vfs_done():ufs/home[READ(offset=240130525184, length=2048)]error = 5
> Aug 22 05:25:08 alfred kernel: ad0: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=490974155
> Aug 22 05:25:08 alfred kernel: ad0: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=490974155
> Aug 22 05:25:08 alfred kernel: g_vfs_done():ufs/home[READ(offset=251378735104, length=2048)]error = 5

This has never happened before and did not happen again (yet). A long
test with smartctl reports "all fine". So lets attribute that to a
cosmic ray (or neutrino, pick your favorite) hitting the controller.

The system continued to run fine afterwards.
But: next morning during automatic snapshot creation it panic'ed with:

> Aug 23 06:38:14 alfred kernel: ffs_snapshot_mount: old format snapshot inode 8
> Aug 23 06:38:14 alfred savecore: reboot after panic: snapacct_ufs2: bad block

So of course it restarted. And tried to do a background fsck. And failed
again... and again... and again...

> Aug 23 07:08:15 alfred kernel: ffs_snapshot_mount: old format snapshot inode 4
> Aug 23 07:08:15 alfred savecore: reboot after panic: snapacct_ufs2: bad block

The report inode varies but "bad block" is always the same.
So this went on for about 10x until I had a chance to interrupt it (i.e.
woke from slumber) and boot into single user mode.
Multiple runs of fsck fixed the problem. Deleted all old snapshot files
and system is fine. No further problems. Maybe some files got lost;
can't tell, there are a few million on it.

Also see:
http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/114676

Unfortunately I don't have time to dig into this. But I wanted to report
it. Maybe someone already fixed it...



Cheers,
Johannes



More information about the freebsd-fs mailing list