ZFS RAID-Z panic on vdev failure + subsequent panics and hangs

Wed Aug 26 00:14:41 UTC 2009

Boris Kochergin wrote:
> Boris Kochergin wrote:
>> Pawel Jakub Dawidek wrote:
>>> On Fri, Aug 07, 2009 at 04:00:05PM -0400, Boris Kochergin wrote:
>>>  
>>>> Pawel Jakub Dawidek wrote:
>>>>   
>>>>> On Fri, Aug 07, 2009 at 03:34:34PM -0400, Boris Kochergin wrote:
>>>>>  
>>>>>     
>>>>>> Pawel Jakub Dawidek wrote:
>>>>>>          
>>>>>>> Yeah, that's strange indeed. Could you try:
>>>>>>>
>>>>>>>     print ab->b_arc_node.list_prev
>>>>>>>     print ab->b_arc_node.list_next
>>>>>>>
>>>>>>>
>>>>>>>                
>>>>>> (kgdb) print ab->b_arc_node.list_prev
>>>>>> $1 = (struct list_node *) 0x1
>>>>>>            
>>>>> Yeah, list_prev is corrupted. If it panics on you everytime, I could
>>>>> send you a patch which will try to catch where the corruption occurs.
>>>>>
>>>>>  
>>>>>       
>>>> I eventually get the arc_evict panic every time I successfully 
>>>> manage to mount the filesystem, but it usually panics (with the 
>>>> other backtrace) as soon as I try to mount it, or mount just hangs. 
>>>> I'll gladly try the patch, though--the data on the array is 
>>>> important to me. Thanks.
>>>>     
>>>
>>> To get the data from there you could also try to 'zfs send' it without
>>> mounting the dataset at all (just in case).
>>>
>>>   
>> Sorry for the delay. I had to find another machine to move the disks 
>> into so that I could continue experimenting. Anyway, the filesystem 
>> didn't have any snapshots I could send, so I tried creating one with 
>> "zfs snapshot home at 1" and the machine hung.
>>
>> FYI, In the new machine, all disks (including the one with the / 
>> filesystem) retain their device names.
>>
>> -Boris
>> _______________________________________________
>> freebsd-fs at freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
> Some more panics using RELENG_8 sources from yesterday: 
> http://acm.poly.edu/~spawk/zfs/. The one in panic3.txt happens much 
> more often than the other ones. If any brave soul wants to look into 
> it, I can provide NFS/geom_gate/whatever access to the disk images (or 
> actual disks, if there's a difference) so that they can recreate the 
> problem on a local machine.
>
> -Boris
> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
For the archives: pjd@ took some time to examine the disk images I made 
of the RAID-Z pool, but found heavy corruption in the metadata. As it 
turns out, the machine had bad RAM during the incident, and that is 
probably what caused it. Unfortunately, I had only started to suspect it 
recently as random userland application and kernel panics became 
frequent. This is good news for ZFS users, as it indicates that ZFS did 
not corrupt my pool on its own. I do, however, advise you to be mindful 
of the problems bad memory can cause for ZFS. Personally, I will start 
shelling out a few more bucks for the ECC stuff from now on.

(Eagerly awaiting the read-only offline recovery functionality described 
at http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg20092.html).

-Boris