kern/160777: RAID-Z3 causes fatal hang upon scrub/import on
3zstbn24xn at snkmail.com
Sat Sep 17 01:30:13 UTC 2011
>Synopsis: RAID-Z3 causes fatal hang upon scrub/import on 9.0-BETA2/amd64
>Arrival-Date: Sat Sep 17 01:30:11 UTC 2011
FreeBSD 9.0-BETA2 FreeBSD 9.0-BETA2 #0: Wed Aug 31 18:07:44 UTC 2011 root at farrell.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64
RAID-Z3 causes fatal hang upon scrub/import on 9.0-BETA2/amd64.
By fatal hang, I mean: (1) the hard drive LEDs freeze in a static state of on or off (rather than flashing to indicate drive activity) and stay there; (2) the console no longer responds to any keypress events such as space bar or Control-Alt-F2; (3) the system entirely stops responding to pings.
I noticed this initially when I tried running "zdb pool" while I was doing a "zpool scrub pool", and then the system crashed. I had thought "zdb pool" would be a read only operation just to give me some interesting metadata I could page through. But, rest assured, when I attempted to narrow down what was faulty or problematic here, I didn't touch that command with a ten foot pole (although, in the case where I confirmed that the system was working properly, such as with RAID-Z2, "zdb pool" didn't cause a problem). I think anyhow that "zdb pool" must have consumed too much memory and so the machine crashed. This was the first time the machine had been up and I had created the array in that boot.
So, the first time I attempted to "zpool import pool" after initial creation, I could see all drives being accessed for about a minute or so (positive activity), but then after that minute, the system fatally stalled, as described above. I had tried "zpool scrub -s pool", and was only able to see the data at all by running "zpool export pool && zpool import -o readonly=on pool". Then when I tried importing it read-write again, there was a stall. It wasn't necessary to have the pool be disconnected without a clean dismount. In fact, when I tried repeating the problem with a fresh creation of a new zpool (after a proper zpool destroy of the old one), I found that it was the "zpool import" or "zpool scrub" process alone that triggered the fatal stall.
I sincerely hope this is helpful. I've switched to RAID-Z2 for now, unfortunately. Rest assured, I would be able to do much more rigourous testing on ZFS. If this problem is confirmed and fixed by 9.0 I can offer a contribution of uncovering more bugs with a debugged kernel enabled. In the meantime I need to move forward.
zpool create -O checksum=sha256 -O compression=gzip-9 pool raidz3 gpt/foo*.eli
zfs create -o checksum=sha256 -o compression=gzip-9 -o copies=3 pool/pond
zpool scrub pool
zpool export pool && zpool import pool
(Both of these seem to trigger the fatal stall as described above).
The following conditions may or may not apply. I don't have the resources or time to check. But, (1) the drives are 3TB each; (2) I partitioned the drives using GPT and one large labelled partition each with 99% capacity allocated to it; (3) I am using geli on the large partition. If it seems that these factors are what are causing the problem, note that when I choose to create a RAID-Z2 pool instead of RAID-Z3, there is no problem at all. I can also confirm that the entirety of the drives is accessible, since I did a full dd to the entire drive (partition sector, metadata and all), so it is not a matter of the kernel not seeing the drive size properly. In any case I would expect a graceful error from the kernel instead of this kind of stall. I haven't attempted to move past the actual stall condition such as by kernel debugging, but the reproducibility of the problem leads me to suspect that might not be necessary.
Unknown. I can confirm that if I use RAID-Z2 and do many "zpool import" and "zpool export" commands back to back as well as "zpool scrub" then there is no problem at all.
More information about the freebsd-bugs