12.1-RELEASE-p7 panic in zio_free_issue_4_6 (fwd)

Thu Oct 29 07:34:13 UTC 2020

Hi,

On Thu, 29 Oct 2020, Andriy Gapon wrote:
> On 28/10/2020 15:41, Christian Kratzer wrote:
>> Hi,
>> 
>> one of my servers with 12.1-RELEASE-p7 started crashing with following
>> 
>> Fatal trap 12: page fault while in kernel mode
>> cpuid = 19; apic id = 31
>> fault virtual address   = 0x30
>> fault code              = supervisor write data, page not present
>> instruction pointer     = 0x20:0xffffffff826877f4
>> stack pointer           = 0x28:0xfffffe011cefeaa0
>> frame pointer           = 0x28:0xfffffe011cefeaa0
>> code segment            = base 0x0, limit 0xfffff, type 0x1b
>>                         = DPL 0, pres 1, long 1, def32 0, gran 1
>> processor eflags        = interrupt enabled, resume, IOPL = 0
>> current process         = 0 (zio_free_issue_2_3)
>> trap number             = 12
>> panic: page fault
>> cpuid = 19
>> time = 1603797129
>> KDB: stack backtrace:
>> #0 0xffffffff80c1d2f7 at kdb_backtrace+0x67
>> #1 0xffffffff80bd062d at vpanic+0x19d
>> #2 0xffffffff80bd0483 at panic+0x43
>> #3 0xffffffff810a8dcc at trap_fatal+0x39c
>> #4 0xffffffff810a8e19 at trap_pfault+0x49
>> #5 0xffffffff810a840f at trap+0x29f
>> #6 0xffffffff81081c9c at calltrap+0x8
>> #7 0xffffffff8272a903 at zio_ddt_free+0x53
>> #8 0xffffffff82727b7c at zio_execute+0xac
>> #9 0xffffffff80c2fad4 at taskqueue_run_locked+0x154
>> #10 0xffffffff80c30e08 at taskqueue_thread_loop+0x98
>> #11 0xffffffff80b90c43 at fork_exit+0x83
>> #12 0xffffffff81082cde at fork_trampoline+0xe
>> Uptime: 1m12s
>> Automatic reboot in 15 seconds - press a key on the console to abort
>> 
>> 
>> I traced thigs down to importing one of the zpools.
> 
> I suspect that you have a silent corruption on that pool (perhaps because of
> non-ECC RAM?).

This is on a DL380 G7 with 128GB of ECC ram.  I have ran memtest on this server
before without any defects being found.

The sas disks are on an LSI hba. They also do not have defects according to
smartctl.

This of course does not rule out that there might be an issue with ram and
I will need to recheck.

Also I suspect the server might not have enough RAM for doing dedup on this
2 x 7 disk raid-z2 of 1.2GB drives.

The pool was mostly in use for storing backups rsynced over night from two
other servers.

> What you see can happen if a block pointer has a deduplication bit set, but 
> the
> block is not actually deduplicated or deduplication has never been enabled at 
> all.

Could I have ran into an issue and bug by trying to do too much dedup on this 
pool ?

> It would help -- with analysis -- to get a vmcore (kernel crash dump) and to
> install the corresponding kernel debug symbols (if not already).

I need to see why this server is not producing kernel crash dumps. My other 
setup
does so I should be able to get this done.

> As to recovery, I think that the best solution is to import the pool 
> read-only
> and to copy important data elsewhere.  Then re-create the pool.

I was about to do that but the crash also happens when trying to import 
read-only.

I will investigate if I can import based on an older snapshot or checkpoint but 
I am
not sure if that will do what I want.

I will keep this pool around for a couple of days and will try to get a crash 
dump
from the system.  After that I will have delete and recreate the pool and just
wait for backups to roll back in.

Greetings
Christian

-- 
Christian Kratzer                   CK Software GmbH
Email:   ck at cksoft.de               Wildberger Weg 24/2
Phone:   +49 7032 893 997 - 0       D-71126 Gaeufelden
Fax:     +49 7032 893 997 - 9       HRB 245288, Amtsgericht Stuttgart
Mobile:  +49 171 1947 843           Geschaeftsfuehrer: Christian Kratzer
Web:     http://www.cksoft.de/