ZFS panic under extreme circumstances (2/3 disks corrupted)

Thomas Backman serenity at exscape.org
Sun May 24 19:34:00 UTC 2009

On May 24, 2009, at 09:02 PM, Thomas Backman wrote:

> So, I was playing around with RAID-Z and self-healing, when I  
> decided to take it another step and corrupt the data on *two* disks  
> (well, files via ggate) and see what happened. I obviously expected  
> the pool to go offline, but I didn't expect a kernel panic to follow!
> What I did was something resembling:
> 1) create three 100MB files, ggatel create to create GEOM providers  
> from them
> 2) zpool create test raidz ggate{1..3}
> 3) create a 100MB file inside the pool, md5 the file
> 4) overwrite 10~20MB (IIRC) of disk2 with /dev/random, with dd if=/ 
> dev/random of=./disk2 bs=1000k count=20 skip=40, or so (I now know  
> that I wanted *seek*, not *skip*, but it still shouldn't panic!)
> 5) Check if the md5 of file: everything OK, zpool status shows a  
> degraded pool.
> 6) Repeat step #4, but with disk 3.
> 7) zpool scrub test
> 8) Panic!
> [...]
FWIW, I couldn't replicate this when using seek (i.e. corrupt the  
middle of the "disk" rather than the beginning):

[root at clone ~/zfscrash]# zpool status test
   pool: test
  state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
    see: http://www.sun.com/msg/ZFS-8000-8A
  scrub: scrub in progress for 0h0m, 7.72% done, 0h6m to go

	test        ONLINE       0     0    18
	  raidz1    ONLINE       0     0   161
	    ggate0  ONLINE       0     0     0  512 repaired ## note that I  
did *not* touch this "disk" at all, so why "512 repaired"?
	    ggate1  ONLINE       0     0   702  73K repaired
	    ggate2  ONLINE       0     0    62  64.5K repaired

errors: 9 data errors, use '-v' for a list

After overwriting the *beginning* of disk2 and disk3 as well, "zpool  
scrub" appears to hang. Two vdev failures on the console, and zpool  
status hangs as well. No panic this time around (I've waited 5 minutes  
and nothing appears to happen, but the computer is usable on other  
ttys). The failmode property was set to the default, i.e. wait, in  
both cases.


