9.2PRERELEASE ZFS panic in lzjb_compress

Fri Sep 20 21:32:11 UTC 2013

One last piece of information I just got: the problem is not specific to
LZJB compression. I switched to LZ4 and get the same sort of panic:

Fatal trap 12: page fault while in kernel mode
cpuid = 8; apic id = 28
fault virtual address = 0xffffff8581c48000
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff8195f6d1
stack pointer        = 0x28:0xffffffcf950ee850
frame pointer        = 0x28:0xffffffcf950ee8f0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 0 (zio_write_issue_hig)
trap number = 12
panic: page fault
cpuid = 8
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2a/frame
0xffffffcf950ee2e0
kdb_backtrace() at kdb_backtrace+0x37/frame 0xffffffcf950ee3a0
panic() at panic+0x1ce/frame 0xffffffcf950ee4a0
trap_fatal() at trap_fatal+0x290/frame 0xffffffcf950ee500
trap_pfault() at trap_pfault+0x211/frame 0xffffffcf950ee590
trap() at trap+0x344/frame 0xffffffcf950ee790
calltrap() at calltrap+0x8/frame 0xffffffcf950ee790
--- trap 0xc, rip = 0xffffffff8195f6d1, rsp = 0xffffffcf950ee850, rbp =
0xffffffcf950ee8f0 ---
lz4_compress() at lz4_compress+0x81/frame 0xffffffcf950ee8f0
zio_compress_data() at zio_compress_data+0x92/frame 0xffffffcf950ee920
zio_write_bp_init() at zio_write_bp_init+0x24b/frame 0xffffffcf950ee970
zio_execute() at zio_execute+0xc3/frame 0xffffffcf950ee9b0
taskqueue_run_locked() at taskqueue_run_locked+0x74/frame 0xffffffcf950eea00
taskqueue_thread_loop() at taskqueue_thread_loop+0x46/frame
0xffffffcf950eea20
fork_exit() at fork_exit+0x11f/frame 0xffffffcf950eea70
fork_trampoline() at fork_trampoline+0xe/frame 0xffffffcf950eea70
--- trap 0, rip = 0, rsp = 0xffffffcf950eeb30, rbp = 0 ---

(I am now trying without any compression.)

On Fri, Sep 20, 2013 at 11:25 AM, olivier <olivier777a7 at gmail.com> wrote:

> Got another, very similar panic again on recent 9-STABLE (r255602); I
> assume the latest 9.2 release candidate is affected too. Anybody have any
> idea of what could be causing this, and of a workaround other than turning
> compression off?
> Unlike the last panic I reported, this one did not occur during a zfs
> send/receive operation. There were just a number of processes potentially
> writing to disk at the same time.
> All hardware is healthy as far as I can tell (memory is ECC and no errors
> in logs; zpool status and smartctl show no problems).
>
> Fatal trap 12: page fault while in kernel mode
>
>
> cpuid = 4; apic id = 24
> cpuid = 51; apic id = 83
> fault virtual address = 0xffffff8700a9cc65
> fault virtual address = 0xffffff8700ab0ea9
> fault code = supervisor read data, page not present
>
> instruction pointer = 0x20:0xffffffff8195ff47
> fault code = supervisor read data, page not present
> stack pointer        = 0x28:0xffffffcf951390a0
> Fatal trap 12: page fault while in kernel mode
> frame pointer        = 0x28:0xffffffcf951398f0
> Fatal trap 12: page fault while in kernel mode
> code segment = base 0x0, limit 0xfffff, type 0x1b
> = DPL 0, pres 1, long 1, def32 0, gran 1
> instruction pointer = 0x20:0xffffffff8195ffa4
> stack pointer        = 0x28:0xffffffcf951250a0
> processor eflags = frame pointer        = 0x28:0xffffffcf951258f0
> interrupt enabled, code segment = base 0x0, limit 0xfffff, type 0x1b
>
> resume, IOPL = 0
> cpuid = 28; apic id = 4c
> Fatal trap 12: page fault while in kernel mode
>  = DPL 0, pres 1, long 1, def32 0, gran 1
> current process = 0 (zio_write_issue_hig)
> processor eflags = fault virtual address = 0xffffff8700aa22ac
> interrupt enabled, fault code = supervisor read data, page not present
> resume, IOPL = 0
> trap number = 12
> instruction pointer = 0x20:0xffffffff8195ffa4
> current process = 0 (zio_write_issue_hig)
> panic: page fault
> cpuid = 4
> KDB: stack backtrace:
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2a/frame
> 0xffffffcf95138b30
> kdb_backtrace() at kdb_backtrace+0x37/frame 0xffffffcf95138bf0
> panic() at panic+0x1ce/frame 0xffffffcf95138cf0
> trap_fatal() at trap_fatal+0x290/frame 0xffffffcf95138d50
> trap_pfault() at trap_pfault+0x211/frame 0xffffffcf95138de0
> trap() at trap+0x344/frame 0xffffffcf95138fe0
> calltrap() at calltrap+0x8/frame 0xffffffcf95138fe0
> --- trap 0xc, rip = 0xffffffff8195ff47, rsp = 0xffffffcf951390a0, rbp =
> 0xffffffcf951398f0 ---
> lzjb_compress() at lzjb_compress+0xa7/frame 0xffffffcf951398f0
> zio_compress_data() at zio_compress_data+0x92/frame 0xffffffcf95139920
> zio_write_bp_init() at zio_write_bp_init+0x24b/frame 0xffffffcf95139970
> zio_execute() at zio_execute+0xc3/frame 0xffffffcf951399b0
> taskqueue_run_locked() at taskqueue_run_locked+0x74/frame
> 0xffffffcf95139a00
> taskqueue_thread_loop() at taskqueue_thread_loop+0x46/frame
> 0xffffffcf95139a20
> fork_exit() at fork_exit+0x11f/frame 0xffffffcf95139a70
> fork_trampoline() at fork_trampoline+0xe/frame 0xffffffcf95139a70
> --- trap 0, rip = 0, rsp = 0xffffffcf95139b30, rbp = 0 ---
>
>
> 0x51f47 is in lzjb_compress
> (/usr/src/sys/modules/zfs/../../cddl/contrib/opensolaris/uts/common/fs/zfs/lzjb.c:74).
> 69 }
> 70 if (src > (uchar_t *)s_start + s_len - MATCH_MAX) {
> 71 *dst++ = *src++;
> 72 continue;
> 73 }
> 74 hash = (src[0] << 16) + (src[1] << 8) + src[2];
> 75 hash += hash >> 9;
> 76 hash += hash >> 5;
> 77 hp = &lempel[hash & (LEMPEL_SIZE - 1)];
> 78 offset = (intptr_t)(src - *hp) & OFFSET_MASK;
>
> dmesg output is at http://pastebin.com/U34fwJ5f
> kernel config is at http://pastebin.com/c9HKfcsz
> I can provide more information if useful.
> Thanks
>
>
> On Fri, Jul 19, 2013 at 6:52 AM, Volodymyr Kostyrko <c.kworr at gmail.com>wrote:
>
>> 19.07.2013 07:04, olivier wrote:
>>
>>> Hi,
>>> Running 9.2-PRERELEASE #19 r253313 I got the following panic
>>>
>>> Fatal trap 12: page fault while in kernel mode
>>> cpuid = 22; apic id = 46
>>> fault virtual address   = 0xffffff827ebca30c
>>> fault code              = supervisor read data, page not present
>>> instruction pointer     = 0x20:0xffffffff81983055
>>> stack pointer           = 0x28:0xffffffcf75bd60a0
>>> frame pointer           = 0x28:0xffffffcf75bd68f0
>>> code segment            = base 0x0, limit 0xfffff, type 0x1b
>>>                          = DPL 0, pres 1, long 1, def32 0, gran 1
>>> processor eflags        = interrupt enabled, resume, IOPL = 0
>>> current process         = 0 (zio_write_issue_hig)
>>> trap number             = 12
>>> panic: page fault
>>> cpuid = 22
>>> KDB: stack backtrace:
>>> db_trace_self_wrapper() at db_trace_self_wrapper+0x2a/**frame
>>> 0xffffffcf75bd5b30
>>> kdb_backtrace() at kdb_backtrace+0x37/frame 0xffffffcf75bd5bf0
>>> panic() at panic+0x1ce/frame 0xffffffcf75bd5cf0
>>> trap_fatal() at trap_fatal+0x290/frame 0xffffffcf75bd5d50
>>> trap_pfault() at trap_pfault+0x211/frame 0xffffffcf75bd5de0
>>> trap() at trap+0x344/frame 0xffffffcf75bd5fe0
>>> calltrap() at calltrap+0x8/frame 0xffffffcf75bd5fe0
>>> --- trap 0xc, rip = 0xffffffff81983055, rsp = 0xffffffcf75bd60a0, rbp =
>>> 0xffffffcf75bd68f0 ---
>>> lzjb_compress() at lzjb_compress+0x185/frame 0xffffffcf75bd68f0
>>> zio_compress_data() at zio_compress_data+0x92/frame 0xffffffcf75bd6920
>>> zio_write_bp_init() at zio_write_bp_init+0x24b/frame 0xffffffcf75bd6970
>>> zio_execute() at zio_execute+0xc3/frame 0xffffffcf75bd69b0
>>> taskqueue_run_locked() at taskqueue_run_locked+0x74/**frame
>>> 0xffffffcf75bd6a00
>>> taskqueue_thread_loop() at taskqueue_thread_loop+0x46/**frame
>>> 0xffffffcf75bd6a20
>>> fork_exit() at fork_exit+0x11f/frame 0xffffffcf75bd6a70
>>> fork_trampoline() at fork_trampoline+0xe/frame 0xffffffcf75bd6a70
>>> --- trap 0, rip = 0, rsp = 0xffffffcf75bd6b30, rbp = 0 ---
>>>
>>> lzjb_compress+0x185 corresponds to line 85 in
>>> 80 cpy = src - offset;
>>> 81 if (cpy >= (uchar_t *)s_start && cpy != src &&
>>> 82    src[0] == cpy[0] && src[1] == cpy[1] && src[2] == cpy[2]) {
>>> 83 *copymap |= copymask;
>>> 84 for (mlen = MATCH_MIN; mlen < MATCH_MAX; mlen++)
>>> 85 if (src[mlen] != cpy[mlen])
>>> 86 break;
>>> 87 *dst++ = ((mlen - MATCH_MIN) << (NBBY - MATCH_BITS)) |
>>> 88    (offset >> NBBY);
>>> 89 *dst++ = (uchar_t)offset;
>>>
>>> I think it's the first time I've seen this panic. It happened while
>>> doing a
>>> send/receive. I have two pools with lzjb compression; I don't know which
>>> of
>>> these pools caused the problem, but one of them was the source of the
>>> send/receive.
>>>
>>> I only have a textdump but I'm happy to try to provide more information
>>> that could help anyone look into this.
>>> Thanks
>>> Olivier
>>>
>>
>> Oh, I can add to this one. I have a full core dump of the same problem
>> caused by copying large set of files from lzjb compressed pool to lz4
>> compressed pool. vfs.zfs.recover was set.
>>
>> #1  0xffffffff8039d954 in kern_reboot (howto=260)
>>     at /usr/src/sys/kern/kern_**shutdown.c:449
>> #2  0xffffffff8039ddce in panic (fmt=<value optimized out>)
>>     at /usr/src/sys/kern/kern_**shutdown.c:637
>> #3  0xffffffff80620a6a in trap_fatal (frame=<value optimized out>,
>>     eva=<value optimized out>) at /usr/src/sys/amd64/amd64/trap.**c:879
>> #4  0xffffffff80620d25 in trap_pfault (frame=0x0, usermode=0)
>>     at /usr/src/sys/amd64/amd64/trap.**c:700
>> #5  0xffffffff806204f6 in trap (frame=0xffffff821ca43600)
>>     at /usr/src/sys/amd64/amd64/trap.**c:463
>> #6  0xffffffff8060a032 in calltrap ()
>>     at /usr/src/sys/amd64/amd64/**exception.S:232
>> #7  0xffffffff805a9367 in vm_page_alloc (object=0xffffffff80a34030,
>>     pindex=16633, req=97) at /usr/src/sys/vm/vm_page.c:1445
>> #8  0xffffffff8059c42e in kmem_back (map=0xfffffe00010000e8,
>>     addr=18446743524021862400, size=16384, flags=<value optimized out>)
>>     at /usr/src/sys/vm/vm_kern.c:362
>> #9  0xffffffff8059c2ac in kmem_malloc (map=0xfffffe00010000e8, size=16384,
>>     flags=257) at /usr/src/sys/vm/vm_kern.c:313
>> #10 0xffffffff80595104 in uma_large_malloc (size=<value optimized out>,
>>     wait=257) at /usr/src/sys/vm/uma_core.c:994
>> #11 0xffffffff80386b80 in malloc (size=16384, mtp=0xffffffff80ea7c40,
>> flags=0)
>>     at /usr/src/sys/kern/kern_malloc.**c:492
>> #12 0xffffffff80c9e13c in lz4_compress (s_start=0xffffff80d0b19000,
>>     d_start=0xffffff8159445000, s_len=131072, d_len=114688, n=-2)
>>     at /usr/src/sys/modules/zfs/../..**/cddl/contrib/opensolaris/uts/**
>> common/fs/zfs/lz4.c:843
>> #13 0xffffffff80cdde25 in zio_compress_data (c=<value optimized out>,
>>     src=<value optimized out>, dst=0xffffff8159445000, s_len=131072)
>>     at /usr/src/sys/modules/zfs/../..**/cddl/contrib/opensolaris/uts/**
>> common/fs/zfs/zio_compress.c:**109
>> #14 0xffffffff80cda012 in zio_write_bp_init (zio=0xfffffe0143a12000)
>>     at /usr/src/sys/modules/zfs/../..**/cddl/contrib/opensolaris/uts/**
>> common/fs/zfs/zio.c:1107
>> #15 0xffffffff80cd8ec6 in zio_execute (zio=0xfffffe0143a12000)
>>     at /usr/src/sys/modules/zfs/../..**/cddl/contrib/opensolaris/uts/**
>> common/fs/zfs/zio.c:1305
>> #16 0xffffffff803e25e6 in taskqueue_run_locked (queue=0xfffffe00060ca300)
>>     at /usr/src/sys/kern/subr_**taskqueue.c:312
>> #17 0xffffffff803e2e38 in taskqueue_thread_loop (arg=<value optimized
>> out>)
>>     at /usr/src/sys/kern/subr_**taskqueue.c:501
>> #18 0xffffffff8036f40a in fork_exit (
>>     callout=0xffffffff803e2da0 <taskqueue_thread_loop>,
>>     arg=0xfffffe00060cc3d0, frame=0xffffff821ca43a80)
>>     at /usr/src/sys/kern/kern_fork.c:**988
>> #19 0xffffffff8060a56e in fork_trampoline ()
>>     at /usr/src/sys/amd64/amd64/**exception.S:606
>>
>> I have a full crash dump in case someone wants to look at it.
>>
>> --
>> Sphinx of black quartz, judge my vow.
>>
>
>