kern/106030: panic while rebooting with a dead disk

Wed Nov 29 14:51:01 PST 2006

The following reply was made to PR kern/106030; it has been noted by GNATS.

From: Robert Watson <rwatson at FreeBSD.org>
To: mjacob at freebsd.org
Cc: Remko Lodder <remko at freebsd.org>, bug-follouwp at FreeBSD.org
Subject: Re: kern/106030: panic while rebooting with a dead disk
Date: Wed, 29 Nov 2006 22:45:53 +0000 (GMT)

 On Wed, 29 Nov 2006, mjacob at freebsd.org wrote:

 > On Wed, 29 Nov 2006, Remko Lodder wrote:
 >> > I had a mounted ufs disk that went away. I rebooted so as to avoid a 
 >> panic. Too bad. Geom paniced
 >> > on me anyway:
 >> >
 >> > Syncing disks, vnodes remaining...2 (da8:isp1:0:6:2): Invalidating pack
 >> > g_vfs_done():da8a[WRITE(offset=81920, length=4096)]error = 6
 >> 
 >> Well, it wants to synchronise the data in the caches to the disk and cannot 
 >> find it.. I think a panic is the best thing to do to prevent any weird 
 >> things happening.  What else
 >
 > A panic should be the last resort. If I/O is returned indicating the device 
 > has gone, a binval on all cached data and a forced close of the file table 
 > entry and notification of all user processes is the reasonable thing to do. 
 > Most real Unix'es that were hardened from the orginal v7 product learned to 
 > do this. FreeBSD hasn't.

 This is a panic on shutdown in the file system.  All user processes have 
 exited, and UFS is unable to sync cached data to disk, so there is no way to 
 report the error to a user process.

 >> should be done when the disk it once had mounted goes away? you have 
 >> different problems already when that happends..
 >
 > As I've repeatedly said, mostly to deaf ears in FreeBSD, a device error 
 > should never be the cause for panic *unless* there is absolutely no way to 
 > notify user processes of the error *and* data corruption may have silently 
 > occurred. Inconvenience to an existing design is not really a good argument.

 The context of your panic note appear to be during system shutdown during the 
 final syncing of vnode data before unmount -- is this not the case?

 > A read error to a device that has disappeared shouldn't cause a panic, even 
 > with a filesystem mounted. A write error to same shouldn't cause a panic - 
 > the error propagates back up the stack to the actual I/O invocation. If it 
 > was writebehind or dirty paging activity that can no longer be associated 
 > with any thread, then a panic is a policy decision that only the invoker of 
 > the I/O can make. Not the device driver. Not the volume manager (which is 
 > what GEOM is).

 There are certainly situations where FreeBSD panics rather than tolerating 
 invalid file system data, but I believe those problems are entirely at the 
 file system layer.  There is a kernel printf from GEOM, but the panic occurs 
 in the buffer cache code, presumably when UFS discovers life sucks more than 
 it thought.  I'd like to see UFS grow more tolerant of this sort of thing, and 
 simply lose the data rather than panicking.

 That said, I think the more pressing issue is actually with FAT, since 
 reliable server configurations frequently run UFS over RAID, but most FAT 
 devices are not only not reliable, but also removeable, which we currently 
 fail to tolerate at all when the FAT file system is mounted.  A practice run 
 on tolerating device removal for FAT would probably prepare us to address the 
 UFS issues more competently, as well as shake out issues in VM, etc, that 
 might arise.  For example, I believe we currently fail rather poorly when 
 paging in data from a failing swap device.  Certainly there's no good way to 
 get out of the situation, but I think we perform one of the less good bad 
 ways.

 Robert N M Watson
 Computer Laboratory
 University of Cambridge