zpool import hanging on unexpectedly-rebooted machine

Wed Aug 20 17:23:45 UTC 2008

Pawel Jakub Dawidek wrote:
> On Mon, Aug 18, 2008 at 04:26:39AM -0700, Colin Moller wrote:
>   
>> Hey all,
>>
>> I've got an interestingly frustrating problem on my hands with our 
>> 7.0-STABLE boxes running ZFS.  Sun X4500 box running amd64, 16GB of 
>> RAM., 46x1TB disks in RAIDZ1. (other two for the OS.)
>>
>> Uname for the box is:
>> FreeBSD sf-nas1-c160a.storefront.com 7.0-STABLE FreeBSD 7.0-STABLE #1: 
>> Sat May 31 14:54:22 PDT 2008     
>> root at sf-nas1-c160a.storefront.com:/usr/obj/usr/src/sys/X4500  amd64
>>
>> The box has been running relatively reliably for some months now, but 
>> our hosting provider decided to reboot it on us without asking.  After 
>> the box came back, it had lost /boot/zfs/zpool.cache, so I needed to 
>> reimport the only zpool on the machine (named zfsdata).
>>
>> Running zpool import gives me the output I'm expecting, showing a single 
>> zpool called zfsdata, status of ONLINE, and all the disks are showing up.
>>
>> However, when I run zpool import -f <numerical_pool_id>, the zpool 
>> command simply hangs up with no disk and no CPU activity.  I've run 
>> truss on the zpool import, and the last thing I see happening is:
>>
>> open("/dev/ad96",O_RDONLY,030115000)             = 6 (0x6)
>> ioctl(6,DIOCGIDENT,0xffff9480)                   = 0 (0x0)
>> close(6)                                         = 0 (0x0)
>>
>> After turning on vfs.zfs.debug, I also see this on the console:
>>
>> zfs_ereport_post:293[1]: time=1219057172.795893475 ereport_version=0 
>> class=fs.zfs.checksum zfs_scheme_version=0 pool=zfsdata 
>> pool_guid=316648131406719055 pool_context=2 
>> vdev_guid=7326417523786577584 vdev_type=disk vdev_path=/dev/ad12 
>> vdev_devid=ad:GTF000PAHX5TMF parent_guid=6708978418893991394 
>> parent_type=raidz zio_err=0 zio_offset=89290496000 zio_size=512 
>> zio_object=132 zio_level=0 zio_blkid=244
>>     
>
> if I read this correctly, it reports checksum error on disk /dev/ad12,
> but because this is RAIDZ, it probably tries to self-heal and maybe
> something here goes wrong. I never saw similar problem, so I'm not sure
> how to help you. Even if upgrading to -CURRENT is not an option for you,
> maybe you can still install -CURRENT on a USB pendriver and recompile it
> with new patch? You may also try to remove this disk (ad12) and see if
> it behaves any better. Anyway, please keep me informed on what's going
> on.
>
>   
Turns out it was indeed a failed disk - but we had to boot into an 
opensolaris liveCD to diagnose in the end.  Once we did that, it 
reported soft errors on that disk.  We manually offlined the bad disk 
and the zpool started a resilver to one of the spares immediately.

Strange thing was, FreeBSD didn't report any softfails or DMA timeouts 
or anything that I'd normally see with a failed disk, it'd just hang the 
zpool process...  I wouldn't say this was a bug in the ZFS code itself, 
more of an OS failure that only manifested when we actually tried to use 
the disk.

We've learned a lot about ZFS troubleshooting in the last couple of 
days, though!

Thanks for the response :)

Colin

--
Colin Moller
colin at lefty.tv