Experiences with ZFS v28 - including deadlock

Fri Jul 15 12:30:44 UTC 2011

Hi all,

Having just quite extensively tested the v28 patchset contained within
http://mfsbsd.vx.sk/iso/mfsbsd-se-8.2-zfsv28-amd64.iso (updated
19.06.2011) I wanted to share my experiences in the hope that the issues
I encountered can be fixed before 8.3 ;-)

The biggest issue was a DEADLOCK which occurs quite reliably with a
given sequence of events in short succession, on a chroot filesystem
with many snapshots and a MySQL socket and nullfs mounts inside it:

     1. Force unmount the nullfs mounts which are mounted on top of it
     2. Close the MySQL socket in /tmp
     3. Force unmount the actual filesystem (even if there are open FDs)
     4. 'zfs rename' the filesystem into our 'trash' filesystem (which I
        understand consists of a clone, promote and destroy)

The entire ZFS subsystem then hangs on any new I/O.

Here is a procstat of the zfs rename process which hangs after the force
unmount:

25674 100871 zfs              initial thread   mi_switch+0x176
sleepq_wait+0x42 _cv_wait+0x129 txg_wait_synced+0x85
dsl_sync_task_group_wait+0x128 dsl_sync_task_do+0x54 dsl_dir_rename+0x8f
dsl_dataset_rename+0x272 zfsdev_ioctl+0xe6 devfs_ioctl_f+0x7b kern_ioctl
+0x102 ioctl+0xfd syscallenter+0x1e5 syscall+0x4b Xfast_syscall+0xe2 

Unfortunately it's not easy to reproduce, it only seems to happen in an
environment which is under load with a lot of datasets and a lot of zfs
operations happening concurrently on other datasets.  I spent two days
trying to reproduce it in self-contained test environments but had no
luck, so I'm now reporting it anyway.

There were two other issues which came up:

     1. http://www.freebsd.org/cgi/query-pr.cgi?pr=157728 - we worked
        around this with a semaphore on 'zfs list' and 'zfs recv' so
        they never ran simultaneously.
     2. After an incremental receive, v28 seems to like to mount the
        filesystem even if it was unmounted at the start of the receive.
        (Notably, on previous versions of ZFS, this only happened for
        non-incremental receives where the filesystem was being created
        by the receive -- incremental receives correctly left the
        filesystem in the mount state it started in). This plays very
        badly when the filesystem then gets modified before we can force
        unmount it (which we do immediately), because in this case the
        next receive operation will fail with "filesystem has
        modifications" - which we handle, but it's expensive to do so on
        every incremental receive.

I had a conversation with jhell on IRC about this and he had this to
say:

<jhell> its happened twice before with ZFS basically a lock being held
and never free'd
<jhell> something there is happening between the snapshots and datasets
though. seems that it for some reason is able to destroy the dataset
before it destroys all the snapshots properly
<jhell> then tries to do the renaming of the snapshots and leads to a
lock not being free()'d or similar

Maybe this can offer a hint for someone to go looking in the right
direction to solve this?

Thank you for working on ZFS in FreeBSD!  v15 is working very well for
us.

-- 
Best Regards,
Luke Marsden
CTO, Hybrid Logic Ltd.

Mobile: +447791750420

www.hybrid-cluster.com - Cloud web hosting platform