ZFS Kernel Panic on 10.0-RELEASE

Mon Jun 2 00:01:03 UTC 2014

On 5/30/2014 1:10 PM, Mike Carlson wrote:
> On 5/30/2014 12:48 PM, Jordan Hubbard wrote:
>> On May 30, 2014, at 12:04 PM, Mike Carlson <mike at bayphoto.com> wrote:
>>
>>> Over the weekend, we had upgraded one of our servers from 
>>> 9.1-RELEASE to 10.0-RELEASE, and then the zpool was upgraded (from 
>>> 28 to 5000)
>>>
>>> Tuesday afternoon, the server suddenly rebooted (kernel panic), and 
>>> as soon as it tried to remount all of its ZFS volumes, it panic'd 
>>> again.
>> What’s the panic text?  That’s pretty crucial in figuring out whether 
>> this is recoverable (e.g. if it’s spacemap corruption related, 
>> probably not).
>>
>> - Jordan
>>
>>
>>
> I had linked the pictures I took of the console, but here is my manual 
> reproduction:
>
>    Fatal trap 12: page fault while in kernel mode
>    cpuid = 7; apic id = 07
>    fault virtual address    = 0x4a0
>    fault code               = supervisor read data, page not present
>    instruction pointer      = 0x20:0xffffffff81a7f39f
>    stack pointer            = 0x28:0xfffffe1834789570
>    frame pointer            = 0x28:0xfffffe18347895b0
>    code segment             = base 0x0, limit 0xfffff, type 0x1b
>                              = DPL 0, pres 1, long 1, def32 0, gran 1
>    processor eflags         = interrupt enabled, resume, IOPL = 0
>    current process          = 1849 (txg_thread_enter)
>    trap number              = 12
>    panic: page fault
>    cpuid = 7
>    KDB: stack backtrace:
>    #0 0xffffffff808e7dd0 at kdb_backtrace+0x60
>    #1 0xffffffff808af8b5 at panic+0x155
>    #2 0xffffffff80c8e629 at trap_fatal+0x3a2
>    #3 0xffffffff80c8e969 at trap_pfault+0x2c9
>    #4 0xffffffff80c8e0f6 at trap+0x5e6
>    #5 0xffffffff80c75392 at calltrap+0x8
>    #6 0xffffffff81a53b5a at dsl_dataset_block_kill+0x3a
>    #7 0xffffffff81a50967 at dnode_sync+0x237
>    #8 0xffffffff81a48fcb at dmu_objset_sync_dnodes+0x2b
>    #9 0xffffffff81a48e4d at dmo_objset_sync+0x1ed
>    #10 0xffffffff81a5d29a at dsl_pool_sync+0xca
>    #11 0xffffffff81a78a4e at spa_sync+0x52e
>    #12 0xffffffff81a81925 at txg_sync_thread+0x375
>    #13 0xffffffff8088198a at fork_exit+0x9a
>    #14 0xffffffff80c758ce at fork_trampoline+0xe
>    uptime: 46s
>    Automatic reboot in 15 seconds - press a key on the console to abort
>
This just happened again to another server. We upgraded two servers on 
the same morning, and now both of them exhibit this corrupted zfs volume 
and panic behavior.

Out of all the volumes, one of them is causing the panic, and the panic 
message is nearly identical.

I have 4 snapshots over the last 24 hours, so hopefully a snapshot from 
noon today can be sent to a new volume ( zfs send | zfs recv )

I guess I can now rule out it being a hardware issue, this is clearly 
problem related to the upgrade (freebsd-update  was used). I first 
thought the first system had a bad upgrade, perhaps a mix and match of 
9.2 binaries running on a 10 kernel, but I used the 'freebsd-update IDS' 
command to verify the integrity of the install, and it looked good, the 
only differences were config files in /etc/ that we manage.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 6054 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20140601/a216ddda/attachment.bin>