Troubleshooting kernel panic with zfs

Wed Oct 3 14:32:29 UTC 2018

Following up on this, a bug was just posted to the stable at freebsd.org <mailto:stable at freebsd.org> list where the stack trace exactly matches what I was seeing. See: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=231296

On our end we reduced ARC and other memory tunables and have not seen a panic *yet* but they were unpredictable before so I am not 100% sure that we've resolved the issue.

CC'ing rainer at ultra-secure.de who posted the similar bug to stable@

--
 <http://www.goboomtown.com/>	
Josh Gitlin
Senior Full Stack Developer
(415) 690-1610 x155

Stay up to date and join the conversation in Relay <http://relay.goboomtown.com/>.

> On Sep 20, 2018, at 8:27 PM, Josh Gitlin <jgitlin at goboomtown.com> wrote:
> 
> I am working to debug/troubleshoot a kernel panic with a FreeBSD ZFS iSCSI server, specifically trying to determine if it's a bug or (more likely) a misconfiguration in our settings. Server is running 11.2-RELEASE-p2 with 15.6 GiB of RAM and has a single zpool with 4x2 mirrored vdevs, 2x mirrored zil and 2x l2arc. Server runs pretty much nothing other than SSH and iSCSI (via ctld) and serves VM virtual disks to hypervisor servers over 10gbe LAN.
> 
> The server experienced a kernel panic and we unfortunately did not have dumpdev set in /etc/rc.conf (we have since corrected this) so the only info I have is what was on the screen before I rebooted it. (Because it's a production system I couldn't mess around and had to reboot ASAP)
> 
> trap number = 12
> panic: page fault
> cpuid = 6
> KDB: stack backtrace:
> #0 0xffffffff80b3d567 at kdb_backtrace+0x67
> #1 0xffffffff80af6b07 at vpanic+0x177
> #2 0xffffffff80af6983 at panic+0x43
> #3 0xffffffff80f77fcf at trap_fatal+0x35f
> #4 0xffffffff80f78029 at trap_pfault+0x49
> #5 0xffffffff80f777f7 at trap+0x2c7
> #6 0xffffffff80f57dac at calltrap+0x8
> #7 0xffffffff80dee7e2 at kmem_back+0xf2
> #8 0xffffffff80dee6c0 at kmem_malloc+0x60
> #9 0xffffffff80de6172 at keg_alloc_slab+0xe2
> #10 0xffffffff80de8b7e at keg_fetch_slab+0x14e
> #11 0xffffffff80de8364 at zone_fetch_slab+0x64
> #12 0xffffffff80de848f at zone_import+0x3f
> #13 0xffffffff80de4b99 at uma_zalloc_arg+0x3d9
> #14 0xffffffff826e6ab2 at zio_write_compress+0x1e2
> #15 0xffffffff826e574c at zio_execute+0xac
> #16 0xffffffff80bled74 at taskqueue_run_locked+0x154
> #17 0xffffffff80b4fed8 at taskqueue_thread_loop+0x98
> Uptime: 18d18h31m6s
> mpr0: Sending StopUnit: path (xpt0:mpr0:0:10:ffffffff): handle 10 
> mpr0: Incrementing SSU count
> mpr0: Sending StopUnit: path (xpt0:mpr0:0:13:ffffffff): handle 13 
> mpr0: Incrementing SSU count
> mpr0: Sending StopUnit: path Ixpt0:mpr0:0:16:ffffffff): handle 16 
> mpr0: Incrementing SSU count
> 
> My hunch is that, given this was inside kmem_malloc, we were unable to allocate memory for a zio_write_compress call (the pool does have ZFS compression on) and hence this is a tuning issue and not a bug... but I am looking for confirmation and/or suggested changes/troubleshooting steps. The ZFS tuning configuration has been stable for years, to it may be a change in behavior or traffic... If this looks like it might be a bug, I will be able to get more information from a minidump if it reoccurs and can follow up on this thread.
> 
> Any advice or suggestions are welcome!
> 
> [jgitlin at zfs3 ~]$ zpool status
>   pool: srv
>  state: ONLINE
>   scan: scrub repaired 0 in 2h32m with 0 errors on Tue Sep 11 20:32:18 2018
> config:
> 
> 	NAME            STATE     READ WRITE CKSUM
> 	srv             ONLINE       0     0     0
> 	  mirror-0      ONLINE       0     0     0
> 	    gpt/s5      ONLINE       0     0     0
> 	    gpt/s9      ONLINE       0     0     0
> 	  mirror-1      ONLINE       0     0     0
> 	    gpt/s6      ONLINE       0     0     0
> 	    gpt/s10     ONLINE       0     0     0
> 	  mirror-2      ONLINE       0     0     0
> 	    gpt/s7      ONLINE       0     0     0
> 	    gpt/s11     ONLINE       0     0     0
> 	  mirror-3      ONLINE       0     0     0
> 	    gpt/s8      ONLINE       0     0     0
> 	    gpt/s12     ONLINE       0     0     0
> 	logs
> 	  mirror-4      ONLINE       0     0     0
> 	    gpt/s2-zil  ONLINE       0     0     0
> 	    gpt/s3-zil  ONLINE       0     0     0
> 	cache
> 	  gpt/s2-cache  ONLINE       0     0     0
> 	  gpt/s3-cache  ONLINE       0     0     0
> 
> errors: No known data errors
> 
> ZFS tuning:
> 
> vfs.zfs.delay_min_dirty_percent=90
> vfs.zfs.dirty_data_max=4294967296
> vfs.zfs.dirty_data_sync=3221225472
> vfs.zfs.prefetch_disable=1
> vfs.zfs.top_maxinflight=128
> vfs.zfs.trim.txg_delay=8
> vfs.zfs.txg.timeout=20
> vfs.zfs.vdev.aggregation_limit=524288
> vfs.zfs.vdev.scrub_max_active=3
> vfs.zfs.l2arc_write_boost=134217728
> vfs.zfs.l2arc_write_max=134217728
> vfs.zfs.l2arc_feed_min_ms=200
> vfs.zfs.min_auto_ashift=12
> 
> 
> --
>  <http://www.goboomtown.com/>	
> Josh Gitlin
> Senior DevOps Engineer
> (415) 690-1610 x155
> 
> Stay up to date and join the conversation in Relay <http://relay.goboomtown.com/>.
>