Damaged directory on ZFS

Mon Oct 31 22:33:20 UTC 2011

Hi,

On 30.10.2011 08:46, Harold Paulson wrote:
> Pawel,
>
> On Oct 23, 2011, at 7:02 AM, Pawel Jakub Dawidek wrote:
>
>> On Mon, Oct 17, 2011 at 05:17:31PM -0700, Harold Paulson wrote:
>>> Hello,
>>>
>>> I've had a server that boots from ZFS panicking for a couple days.  I have worked around the problem for now, but I hope someone can give me some insight into what's going on, and how I can solve it properly.
>>>
>>> The server is running 8.2-STABLE (zfs v28) with 8G of ram and 4 SATA disks in a raid10 type arrangement:
>>>
>>> # uname -a
>>> FreeBSD jane.sierraweb.com 8.2-STABLE-201105 FreeBSD 8.2-STABLE-201105 #0: Tue May 17 05:18:48 UTC 2011     root at mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  amd64
>>>
>>> And zpool status:
>>>
>>> 	NAME           STATE     READ WRITE CKSUM
>>> 	tank           ONLINE       0     0     0
>>> 	  mirror       ONLINE       0     0     0
>>> 	    gpt/disk0  ONLINE       0     0     0
>>> 	    gpt/disk1  ONLINE       0     0     0
>>> 	  mirror       ONLINE       0     0     0
>>> 	    gpt/disk2  ONLINE       0     0     0
>>> 	    gpt/disk3  ONLINE       0     0     0
>>>
>>> It started panicking under load a couple days ago.  We replaced RAM and motherboard, but problems persisted.  I don't know if a hardware issue originally caused the problem or what.  When it panics, I get the usual panic message, but I don't get a core file, and it never reboots itself.
>>>
>>> http://pastebin.com/F1J2AjSF
>>>
>>> While I was trying to figure out the source of the problem, I notice stuck various stuck processes that peg a CPU and can't be killed, such as:
>>>
>>>   PID JID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
>>> 48735   0 root        1  46    0 11972K   924K CPU3    3 415:14 100.00% find
>>>
>>> They are not marked zombie, but I can't kill them, and restarting the jail they are in won't even get rid of them.  truss just hangs with no output on them.  On different occasions, I noticed pop3d processes for the same user getting stuck in this way.  On a hunch I ran a "find" through the files in the user's Maildir and got a panic.  I disabled this account and now the server is stable again.  At least until locate.updatedb walks through that directory, I suppose.   Evidentially, there is some kind of hole in the file system below that directory tree causing the panic.
>>>
>>> I can move that directory out of the way, and carry on, but is there anything I can do to really *repair* the problem?

the same is observed over here (I'm running CURRENT system dated by Sun 
Oct 16 14:53:49 EEST 2011). Attempts to run any file commands (ls, find 
etc) on such directory (in my case it is /usr/local/include/dirac) make 
those commands hang (kill -9 doesn't help). Though my system doesn't panic.

>> Could you run these commands:
>>
>> 	objdump -D /boot/kernel/zfs.ko.symbols | egrep '^[0-9a-f]{8,16}<fzap_cursor_retrieve>' | awk '{printf("0x%s\n", $1)}' | xargs -J ADDR printf "%u + %u\n" ADDR 0x111 | bc | xargs printf "0x%x\n" | xargs addr2line -e /boot/kernel/zfs.ko.symbols
>>
>> They should convert fzap_cursor_retrieve+0x111 info file:line. Send it
>> here once you obtain it.
>
> % objdump -D /boot/kernel/zfs.ko.symbols | egrep '^[0-9a-f]{8,16}<fzap_cursor_retrieve>' | awk '{printf("0x%s\n", $1)}' | xargs -J ADDR printf "%u + %u\n" ADDR 0x111 | bc | xargs printf "0x%x\n" | xargs addr2line -e /boot/kernel/zfs.ko.symbols
> /usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zap.c:1158

output of suggested command just the same as Harolds's, i.e.:
objdump -D /boot/kernel/zfs.ko.symbols | egrep '^[0-9a-f]{8,16} 
<fzap_cursor_retrieve>' | awk '{printf("0x%s\n", $1)}' | xargs -J ADDR 
printf "%u + %u\n" ADDR 0x111 | bc | xargs printf "0x%x\n" | xargs 
addr2line -e /boot/kernel/zfs.ko.symbols
/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/zap.c:1158

--
WBR,
Andrey Kosachenko