[Bug 229614] ZFS lockup in zil_commit_impl

bugzilla-noreply at freebsd.org bugzilla-noreply at freebsd.org
Sun Jul 8 20:42:57 UTC 2018


https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=229614

            Bug ID: 229614
           Summary: ZFS lockup in zil_commit_impl
           Product: Base System
           Version: 11.2-RELEASE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Many People
          Priority: ---
         Component: kern
          Assignee: bugs at FreeBSD.org
          Reporter: andreas.sommer87 at googlemail.com
                CC: avg at FreeBSD.org, grembo at FreeBSD.org

Created attachment 194962
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=194962&action=edit
Debugging attempts (command line output)

Relevant part of my research thus far (see attached file for some more commands
I've tried to debug a little):

# procstat -kk 69994
  PID    TID COMM                TDNAME              KSTACK
[...]
69994 101224 python3.6           -                   mi_switch+0xe6
sleepq_wait+0x2c _sx_xlock_hard+0x306 zil_commit_impl+0x11d
zfs_freebsd_putpages+0x635 VOP_PUTPAGES_APV+0x82 vnode_pager_putpages+0x8e
vm_pageout_flush+0xea vm_object_page_collect_flush+0x213
vm_object_page_clean+0x146 vm_object_terminate+0x93 zfs_freebsd_reclaim+0x1e
VOP_RECLAIM_APV+0x82 vgonel+0x208 vrecycle+0x4a zfs_freebsd_inactive+0xd
VOP_INACTIVE_APV+0x82 vinactive+0xfc

This is luckily on a CI instance in AWS EC2, not a production machine. This
happened *multiple* times to me in the last weeks, roughly once per week. So
probably I'll reset the machine very soon but will run into it again if you
want me to debug something hands-on. The earliest occurrence which I can still
see in monitoring graphs was 2018-06-24 i.e. two days before I upgraded to
11.2. Before that, I had run 10.3 until the upgrade to 11.1 on 2018-06-13.
Honestly, I don't recall this happening while we were still on 10.3, but I'm
human and could be mistaken. Hard restart resolves the problem. In my specific
case, I noticed it because builders/workers in my Buildbot web interface were
not showing anymore and on quick look, the buildbot master process was hanging
to that extent. Other things like SSH and the web interface were still working.
Running `sync` manually hangs, see attached command line output.

I've found these possibly related issues:
* Sporadic system hang -
https://github.com/zfsonlinux/zfs/issues/7425#issuecomment-403312992
* Process hang in state “zilog->zl_writer_lock” on Unstable -
https://discourse.trueos.org/t/process-hang-in-state-zilog-zl-writer-lock-on-unstable/2193/20

-- 
You are receiving this mail because:
You are the assignee for the bug.


More information about the freebsd-bugs mailing list