[Bug 236220] ZFS vnode deadlock

Mon Mar 4 14:46:33 UTC 2019

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=236220

            Bug ID: 236220
           Summary: ZFS vnode deadlock
           Product: Base System
           Version: 12.0-RELEASE
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: kern
          Assignee: bugs at FreeBSD.org
          Reporter: ncrogers at gmail.com

Created attachment 202551
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=202551&action=edit
procstat + gdb

Recently a number of my production 12.0 systems have experienced what I can
only gather is a ZFS deadlock related to vnodes. It seems similar to the
relatively recent FreeBSD-EN-18:18.zfs (ZFS vnode reclaim deadlock) problem.
Previously the same systems were running 11.1-RELEASE without problem.

Threads are always stuck with the stack around
vn_lock->zfs_root->lookup->namei. When the system is in this state, a simple
`ls /` or `ls /tmp` always hangs, but other datasets seem unaffected. I have a
fairly straightforward ZFS root setup on a single pool with one SSD. The
workload is a ruby/rails/nginx/postgresql backed web application combined with
some data warehousing and other periodic tasks.

Sometimes I can remote SSH in, other times that fails because the user shell
fails to load, and I can run commands via `ssh ... command`. Sometimes the
system is not accessible remotely at all, or it eventually becomes inaccessible
if left long enough in this state. I always have to physically reboot the
device because the shutdown procedure fails. The network stack (e.g. ping)
seems to work completely fine whilst this is going on, until you try to
interact with or spawn a process that hits the deadlock.

Like previous similar ZFS deadlock issues, increasing kern.vnodes seems to make
the system last longer by up to a few weeks, but is still a bandaid. However, I
have yet to witness vnodes usage actually getting close to the maximum.

I haven't had any luck reproducing this reliably, but eventually it happens
after a few days or a few weeks... I managed to connect to a system in this
state and grab a procstat and get (hopefully) something useful out of kgdb. I
will note that although I was able to install debug symbols, I couldn't manage
to get the source files onto it for kgdb purposes before I lost SSH access.

Attached is an abbreviated procstat and what I was able to get out of kgdb for
an affected thread. Note that the thread backtrace is from a simple `ls`
command.

-- 
You are receiving this mail because:
You are the assignee for the bug.