Locked up processes after upgrade to ZFS v15

Thu Oct 7 00:15:36 UTC 2010

Am 06.10.2010 um 14:28 schrieb Kai Gallasch:

> Hi.
> 
> Two days ago I upgraded my server to 8.1-STABLE (amd64) and upgraded ZFS from v14 to v15.
> After zpool & zfs upgrade the server was running stable for about half a day, but then apache processes running inside jails would lock up and could not be terminated any more.
> 
> In the end apache (both worker and prefork) itself locked up, because it lost control of its child processes.

sorry for replying to my own mail, but there is some new information on this issue:
'zfs send' triggered a panic:

<Oct/06 10:32 pm>MCA: Bank 0, Status 0xf600000000010015
                 <Oct/06 10:32 pm>MCA: Global Cap 0x0000000000000106, Status 0x0000000000000004
<Oct/06 10:32 pm>MCA: Vendor "AuthenticAMD", ID 0x100f23, APIC ID 2
<Oct/06 10:32 pm>MCA: CPU 2 UNCOR PCC OVER DTLB L1 error
<Oct/06 10:32 pm>MCA: Address 0xff80d4611000
<Oct/06 10:32 pm>
                 <Oct/06 10:32 pm>
<Oct/06 10:32 pm>Fatal trap 28: machine check trap while in kernel mode
<Oct/06 10:32 pm><Oct/06 10:32 pm>cpuid = 2; apic id = 02

<Oct/06 10:32 pm>instruction pointer	= 0x20:0xffffffff80e60f25
<Oct/06 10:32 pm>stack pointer	        = 0x28:0xffffff832a2e17d0
<Oct/06 10:32 pm>frame pointer	        = 0x28:0xffffff832a2e1a40
<Oct/06 10:32 pm>code segment		= base 0x0, limit 0xfffff, type 0x1b
<Oct/06 10:32 pm>			= DPL 0, pres 1, long 1, def32 0, gran 1
<Oct/06 10:32 pm>processor eflags	= interrupt enabled, IOPL = 0
<Oct/06 10:32 pm>current process		= 0 (zio_write_issue_0)
<Oct/06 10:32 pm>[thread pid 0 tid 101159 ]
<Oct/06 10:32 pm>Stopped at      lzjb_compress+0x165:    addq    $0x1,%rdx

db> bt
Tracing pid 0 tid 101159 td 0xffffff00aa64a3e0
lzjb_compress() at lzjb_compress+0x165
zio_compress_data() at zio_compress_data+0xbe
zio_write_bp_init() at zio_write_bp_init+0xc2
zio_execute() at zio_execute+0x77
zio_ready() at zio_ready+0x162
zio_execute() at zio_execute+0x77
taskq_run_safe() at taskq_run_safe+0x13
taskqueue_run() at taskqueue_run+0x91
taskqueue_thread_loop() at taskqueue_thread_loop+0x3f
fork_exit() at fork_exit+0x12a
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffff832a2e1d30, rbp = 0 ---

I sure know this one:

"CPU 2 UNCOR PCC OVER DTLB L1 error", because this particular
server in the past had some problems with FreeBSD 8.0-REL and
"super pages" enabled.

Workaround was to set vm.pmap.pg_ps_enabled="0" in /boot/loader.conf
Later on with 8.0-STABLE setting the tunable was not necessary any more,
because a workaround for this was commited to src/sys.

So, just to test this I again set vm.pmap.pg_ps_enabled="0" and will see if processes still lock up.

Regards,
Kai.