kern/145339: [zfs] deadlock after detaching block device from
raidz pool
Alex Bakhtin
alex.bakhtin at gmail.com
Mon May 3 23:30:04 UTC 2010
The following reply was made to PR kern/145339; it has been noted by GNATS.
From: Alex Bakhtin <alex.bakhtin at gmail.com>
To: Andriy Gapon <avg at icyb.net.ua>
Cc: bug-followup at freebsd.org, Pawel Jakub Dawidek <pjd at freebsd.org>
Subject: Re: kern/145339: [zfs] deadlock after detaching block device from
raidz pool
Date: Tue, 4 May 2010 03:23:35 +0400
Andriy,
Upgraded to today's stable. Reproduced the problem. On GENERIC
the system just hangs with the following output:
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
ad12: FAILURE - WRITE_DMA48
status=3D7f<READY,DMA_READY,DSC,DRQ,CORRECTABLE,INDEX,ERROR> error=3D0
LBA=3D2312588250^
Fatal trap 12: page fault while in kernel mode
cpuid =3D 1; apic id =3D 01
fault virtual address =3D 0x48
fault code =3D supervisor write data, page not present
instruction pointer =3D 0x20:0xffffffff80593e95
stack pointer =3D 0x28:0xffffff8000065ba0
frame pointer =3D 0x28:0xffffff8000065bb0
code segment =3D base 0x0, limit 0xfffff, type 0x1b
=3D DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags =3D interrupt enabled, resume, IOPL =3D 0
current process =3D 3 (g_up)
trap number =3D 12
panic: page fault
cpuid =3D 1
Fatal trap 12: page fault while in kernel mode
cpuid =3D 0; apic id =3D 00
fault virtual address =3D 0x0
fault code =3D supervisor read data, page not present
instruction pointer =3D 0x20:0xffffffff80545a28
stack pointer =3D 0x28:0xffffff80eada2a40
frame pointer =3D 0x28:0xffffff80eada2a90
code segment =3D base 0x0, limit 0xfffff, type 0x1b
=3D DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags =3D interrupt enabled, resume, IOPL =3D 0
current process =3D 0 (spa_zio)
trap number =3D 12
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
With GENERIG + DDB/KDB enabled I got the following (it seems that
first time I detached the device when there was no active transaction
- can try to reproduce):
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
ad12: FAILURE - device detached
Fatal trap 12: page fault while in kernel mode
cpuid =3D 1; apic id =3D 01
fault virtual address =3D 0x48
fault code =3D supervisor write data, page not present
instruction pointer =3D 0x20:0xffffffff805a0345
Fatal double fault
stack pointer =3D 0x28:0xffffff800006aba0
rip =3D 0xffffffff808085ad
frame pointer =3D 0x28:0xffffff800006abb0
rsp =3D 0xffffff80ead87000
code segment =3D base 0x0, limit 0xfffff, type 0x1b
rbp =3D 0xffffff80ead87070
=3D DPL 0, pres 1, long 1, def32 0, gran 1
cpuid =3D 0; processor eflags =3D apic id =3D 00
interrupt enabled, panic: double fault
resume, cpuid =3D 0
IOPL =3D 0
KDB: enter: panic
c[thread pid 0 tid 100113 ]
Stopped at kdb_enter+0x3d: movq $0,0x69cee0(%rip)
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
And another one
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
ad12: FAILURE - WRITE_DMA
status=3D7f<READY,DMA_READY,DSC,DRQ,CORRECTABLE,INDEX,ERROR> error=3D0
LBA=3D111033498^M
^M
^M
Fatal trap 12: page fault while in kernel mode^M
cpuid =3D 1; apic id =3D 01^M
fault virtual address =3D 0x48^M
fault code =3D supervisor write data, page not present^M
instruction pointer =3D 0x20:0xffffffff805a0345^M
stack pointer =3D 0x28:0xffffff800006aba0^M
frame pointer =3D 0x28:0xffffff800006abb0^M
code segment =3D base 0x0, limit 0xfffff, type 0x1b^M
=3D DPL 0, pres 1, long 1, def32 0, gran 1^M
processor eflags =3D interrupt enabled, resume, IOPL =3D 0^M
current process =3D 3 (g_up)^M
[thread pid 3 tid 100011 ]
Stopped at _mtx_lock_flags+0x15: lock cmpxchgq %rsi,0x18(%rdi)
db:0:kdb.enter.default> capture on
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
And with your patch the system doesn't detect that device is detached
and seems to be dead-locked (doesn't respond to power-button):
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
acpi0: suspend request ignored (not ready yet)
acpi0: request to enter state S5 failed (err 6)
acpi0: suspend request ignored (not ready yet)
acpi0: request to enter state S5 failed (err 6)
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
So, I can still easily reproduce this problem on 8-STABLE. Your
simple patch helps to avoid page fault but dead-locks the system. Are
you sure that you can just return at this point? Probably it make
sense to set some error flag before return?
Alex Bakhtin
2010/4/23 Andriy Gapon <avg at icyb.net.ua>:
>
> Can you try this patch?
>
> --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
> +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
> @@ -603,6 +603,9 @@ vdev_geom_io_intr(struct bio *bp)
> =A0 =A0 =A0 =A0zio =3D bp->bio_caller1;
> =A0 =A0 =A0 =A0ctx =3D zio->io_vd->vdev_tsd;
>
> + =A0 =A0 =A0 if (ctx =3D=3D NULL)
> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 return;
> +
> =A0 =A0 =A0 =A0if ((zio->io_error =3D bp->bio_error) =3D=3D 0 && bp->bio_=
resid !=3D 0)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0zio->io_error =3D EIO;
>
>
> --
> Andriy Gapon
>
More information about the freebsd-fs
mailing list