misc/148368: ZFS hanging forever on 8.1-PRERELEASE

Rich Ercolani admins at acm.jhu.edu
Sun Jul 4 23:10:05 UTC 2010


>Number:         148368
>Category:       misc
>Synopsis:       ZFS hanging forever on 8.1-PRERELEASE
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sun Jul 04 23:10:04 UTC 2010
>Closed-Date:
>Last-Modified:
>Originator:     Rich Ercolani
>Release:        RELENG_8 from June 15th
>Organization:
JHU ACM
>Environment:
FreeBSD manticore.acm.jhu.edu 8.1-PRERELEASE FreeBSD 8.1-PRERELEASE #0: Wed Jun 16 17:10:42 UTC 2010     root@[removed]:/usr/obj/usr/local/ncvs/src/sys/DTRACE  amd64

>Description:
Occasionally, much to our chagrin, drives malfunction.

When this happens, ZFS and company appear to "handle" the errors correctly, but in practice, they often require a reboot to become at all responsive any more [e.g. "zpool scrub [affected pool]" will hang forever without returning to a shell, eventually "zpool status" will hang forever].

I've seen this problem before, but we were running an old kernel [circa November 2009] from RELENG_8, and presumed it would go away on upgrade.

The kernel config is the GENERIC config with the following modifications:
# diff GENERIC DTRACE
19c19
< # $FreeBSD: src/sys/amd64/conf/GENERIC,v 1.531.2.13 2010/05/02 06:24:17 imp Exp $
---
> # $FreeBSD: src/sys/amd64/conf/GENERIC,v 1.531.2.8 2010/01/18 00:53:21 imp Exp $
22c22
< ident         GENERIC
---
> ident         DTRACE
57c57
< options       COMPAT_FREEBSD32        # Compatible with i386 binaries
---
> options       COMPAT_IA32             # Compatible with i386 binaries
76,77c76,78
< #options      KDTRACE_FRAME           # Ensure frames are compiled in
< #options      KDTRACE_HOOKS           # Kernel DTrace hooks
---
> options       KDTRACE_FRAME           # Ensure frames are compiled in
> options       KDTRACE_HOOKS           # Kernel DTrace hooks
> options       DDB_CTF                 # Still more Dtrace-related hooks
227d227
< device                sge             # Silicon Integrated Systems SiS190/191
284d283
< options       USB_DEBUG       # enable debug msgs

I'm sorry I can't include a precise revision number of the kernel, I used cvsup to pull it, and I don't know how to extract the revision number.

I'm going to try pulling and installing latest RELENG_8 and see if that helps.

For reference, the errors printed in kernel log when the zpool reported read/write errors on a disk:
Jul  4 05:03:29 manticore kernel: arcmsr0:block 'read/write' commandwith gone raid volume Cmd= a, TargetId=1, Lun=4
Jul  4 05:03:29 manticore kernel: arcmsr0:block 'read/write' commandwith gone raid volume Cmd= a, TargetId=1, Lun=4
Jul  4 05:03:29 manticore kernel: arcmsr0:block 'read/write' commandwith gone raid volume Cmd= 8, TargetId=1, Lun=4

Status of the pool now:
  pool: cannoli
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub in progress for 0h13m, 0.87% done, 25h56m to go
config:

        NAME        STATE     READ WRITE CKSUM
        cannoli     ONLINE       0     0     0
          da5       ONLINE       0     0     0
          da6       ONLINE       0     0     0
          da2       ONLINE       0     0     0
          da4       ONLINE       0     0     4

errors: 1 data errors, use '-v' for a list


At this point, the system will fail to reboot cleanly, as it spends forever waiting for the zfs filesystems to cleanly unmount [presumably.]

My next kernel will have DDB built in.
>How-To-Repeat:
1) Have a disk which occasionally reports uncorrected read/write errors with a ZFS filesystem on it.
2) ZFS will eventually completely cease to respond to all queries using the "zpool" or "zfs" commands. [traffic to the mounted filesystems is fine for much longer, until the point where the entire system becomes unresponsive.]
>Fix:


>Release-Note:
>Audit-Trail:
>Unformatted:


More information about the freebsd-bugs mailing list