kern/106030: panic while rebooting with a dead disk

Wed Nov 29 15:30:22 PST 2006

The following reply was made to PR kern/106030; it has been noted by GNATS.

From: mjacob at freebsd.org
To: Robert Watson <rwatson at freebsd.org>
Cc: bug-follouwp at freebsd.org
Subject: Re: kern/106030: panic while rebooting with a dead disk
Date: Wed, 29 Nov 2006 15:08:54 -0800 (PST)

 > This is a panic on shutdown in the file system.  All user processes have 
 > exited, and UFS is unable to sync cached data to disk, so there is no way to 
 > report the error to a user process.

 Yes- but it is also true that this would happen at a time other than 
 reboot. In fact, I rebooted rather than try and run with a dead disk 
 mounted and much to my annoyance I *still* couldn't avoid a panic. My 
 only other choice would have been to do a 'reboot -n'. Bad in either 
 case.

 >
 > There are certainly situations where FreeBSD panics rather than tolerating 
 > invalid file system data, but I believe those problems are entirely at the 
 > file system layer.  There is a kernel printf from GEOM, but the panic occurs 
 > in the buffer cache code, presumably when UFS discovers life sucks more than 
 > it thought.  I'd like to see UFS grow more tolerant of this sort of thing, 
 > and simply lose the data rather than panicking.

 Yes.

 > That said, I think the more pressing issue is actually with FAT, since 
 > reliable server configurations frequently run UFS over RAID, but most FAT 
 > devices are not only not reliable, but also removeable, which we currently 
 > fail to tolerate at all when the FAT file system is mounted.  A practice run 
 > on tolerating device removal for FAT would probably prepare us to address the 
 > UFS issues more competently, as well as shake out issues in VM, etc, that 
 > might arise.  For example, I believe we currently fail rather poorly when 
 > paging in data from a failing swap device.  Certainly there's no good way to 
 > get out of the situation, but I think we perform one of the less good bad 
 > ways.

 Uhh- this conversation just took a rather bizaare twist. It's not just a 
 question of making UFS more fault tolerant- UFS is sort of a dead horse 
 by now and RAID may not help when it's a channel failure (e.g., fibre 
 channel or iSCSI). I'd rather see efforts put into ZFS (and fixing the 
 XFS port to actually work)- but that is besides the point. It's more of 
 a case to make sure that we don't panic when we don't have to. Now we do 
 too much.

 But these are very good points- thanks for the review of my somewhat 
 botched bug report.