misc/148368: ZFS hanging forever on 8.1-PRERELEASE
Rich Ercolani
admins at acm.jhu.edu
Sun Jul 4 23:10:05 UTC 2010
>Number: 148368
>Category: misc
>Synopsis: ZFS hanging forever on 8.1-PRERELEASE
>Confidential: no
>Severity: serious
>Priority: low
>Responsible: freebsd-bugs
>State: open
>Quarter:
>Keywords:
>Date-Required:
>Class: sw-bug
>Submitter-Id: current-users
>Arrival-Date: Sun Jul 04 23:10:04 UTC 2010
>Closed-Date:
>Last-Modified:
>Originator: Rich Ercolani
>Release: RELENG_8 from June 15th
>Organization:
JHU ACM
>Environment:
FreeBSD manticore.acm.jhu.edu 8.1-PRERELEASE FreeBSD 8.1-PRERELEASE #0: Wed Jun 16 17:10:42 UTC 2010 root@[removed]:/usr/obj/usr/local/ncvs/src/sys/DTRACE amd64
>Description:
Occasionally, much to our chagrin, drives malfunction.
When this happens, ZFS and company appear to "handle" the errors correctly, but in practice, they often require a reboot to become at all responsive any more [e.g. "zpool scrub [affected pool]" will hang forever without returning to a shell, eventually "zpool status" will hang forever].
I've seen this problem before, but we were running an old kernel [circa November 2009] from RELENG_8, and presumed it would go away on upgrade.
The kernel config is the GENERIC config with the following modifications:
# diff GENERIC DTRACE
19c19
< # $FreeBSD: src/sys/amd64/conf/GENERIC,v 1.531.2.13 2010/05/02 06:24:17 imp Exp $
---
> # $FreeBSD: src/sys/amd64/conf/GENERIC,v 1.531.2.8 2010/01/18 00:53:21 imp Exp $
22c22
< ident GENERIC
---
> ident DTRACE
57c57
< options COMPAT_FREEBSD32 # Compatible with i386 binaries
---
> options COMPAT_IA32 # Compatible with i386 binaries
76,77c76,78
< #options KDTRACE_FRAME # Ensure frames are compiled in
< #options KDTRACE_HOOKS # Kernel DTrace hooks
---
> options KDTRACE_FRAME # Ensure frames are compiled in
> options KDTRACE_HOOKS # Kernel DTrace hooks
> options DDB_CTF # Still more Dtrace-related hooks
227d227
< device sge # Silicon Integrated Systems SiS190/191
284d283
< options USB_DEBUG # enable debug msgs
I'm sorry I can't include a precise revision number of the kernel, I used cvsup to pull it, and I don't know how to extract the revision number.
I'm going to try pulling and installing latest RELENG_8 and see if that helps.
For reference, the errors printed in kernel log when the zpool reported read/write errors on a disk:
Jul 4 05:03:29 manticore kernel: arcmsr0:block 'read/write' commandwith gone raid volume Cmd= a, TargetId=1, Lun=4
Jul 4 05:03:29 manticore kernel: arcmsr0:block 'read/write' commandwith gone raid volume Cmd= a, TargetId=1, Lun=4
Jul 4 05:03:29 manticore kernel: arcmsr0:block 'read/write' commandwith gone raid volume Cmd= 8, TargetId=1, Lun=4
Status of the pool now:
pool: cannoli
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: scrub in progress for 0h13m, 0.87% done, 25h56m to go
config:
NAME STATE READ WRITE CKSUM
cannoli ONLINE 0 0 0
da5 ONLINE 0 0 0
da6 ONLINE 0 0 0
da2 ONLINE 0 0 0
da4 ONLINE 0 0 4
errors: 1 data errors, use '-v' for a list
At this point, the system will fail to reboot cleanly, as it spends forever waiting for the zfs filesystems to cleanly unmount [presumably.]
My next kernel will have DDB built in.
>How-To-Repeat:
1) Have a disk which occasionally reports uncorrected read/write errors with a ZFS filesystem on it.
2) ZFS will eventually completely cease to respond to all queries using the "zpool" or "zfs" commands. [traffic to the mounted filesystems is fine for much longer, until the point where the entire system becomes unresponsive.]
>Fix:
>Release-Note:
>Audit-Trail:
>Unformatted:
More information about the freebsd-bugs
mailing list