kern/184677 / ZFS snapshot handling deadlocks

Mon Dec 23 16:55:36 UTC 2013

Dear Pawel,

Thanks for your response. I hope someone will review my work.

Secondly, I think I;ve found another deadlock scenario:

When 'zfs send -R' finishes, in zfsdev_close() it has spa_namespace_lock 
held. Then it does an unmount of all snapshots, which in turn goes to 
zfsctl_snapshot_inactive(), in which sdp->sd_lock is being locked.

Concurrently, a zfsctl_snapdir_lookup(), holding the same sdp->sd_lock 
does a snapshot mount, which somewhere tries to acquire 
spa_namespace_lock. So it deadlocks. Currently I dont know how to 
workaround this.

Is spa_namespace_lock need to be held in zfsctl_snapshot_inactive()?

What if we release it function enter, and re-acquire upon exit?

Thanks in advance,

Kojedzinszky Richard
Euronet Magyarorszag Informatikai Zrt.

On Fri, 20 Dec 2013, Pawel Jakub Dawidek wrote:

> Date: Fri, 20 Dec 2013 14:44:05 +0100
> From: Pawel Jakub Dawidek <pjd at FreeBSD.org>
> To: krichy at tvnetwork.hu
> Cc: freebsd-fs at freebsd.org
> Subject: Re: kern/184677 / ZFS snapshot handling deadlocks
> 
> On Thu, Dec 19, 2013 at 04:46:11PM +0100, krichy at tvnetwork.hu wrote:
>> Dear devs,
>>
>> I am attaching a more clear patch/fix for my snapshot handling issues
>> (0002), but I would be happy if some ZFS expert would comment it. I am
>> trying to solve it at least for two weeks now, and an ACK or a NACK would
>> be nice from someone. Also a commit is reverted since it also caused
>> deadlocks. I've read its comments, which also eliminates deadlocks, but I
>> did not find any reference how to produce that deadlock. In my view
>> reverting that makes my issues disappear, but I dont know what new cases
>> will it rise.
>
> Richard,
>
> I won't be able to analyse it myself anytime soon, because of other
> obligations, but I forwarded you e-mail to the zfs-devel@ mailing list
> (it is closed, but gathers FreeBSD ZFS devs). Hopefully someone from
> there will be able to help you.
>
>> I've rewritten traverse() to make more like upstream, added two extra
>> VN_HOLD()s to snapdir_lookup() when traverse returned the same vnode what
>> was passed to it (I dont know even that upon creating a snapshot vnode why
>> is that extra two holds needed for the vnode.) And unfortunately, due to
>> FreeBSD calls vop_inactive callbacks with vnodes locked, that could also
>> cause deadlocks, so zfsctl_snapshot_inactive() and
>> zfsctl_snapshot_vptocnp() has been rewritten to work that around.
>>
>> After this, one may also get a deadlock, when a simple access would call
>> into zfsctl_snapshot_lookup(). The documentation says, that those vnodes
>> should always be covered, but after some stress test, sometimes we hit
>> that call, and that can cause again deadlocks. In our environment I've
>> just uncommented that callback, which returns ENODIR on some calls, but at
>> least not a deadlock.
>>
>> The attached script can be used to reproduce my cases (would one confirm
>> that?), and after the patches applied, they disappear (confirm?).
>>
>> Thanks,
>>
>>
>> Kojedzinszky Richard
>> Euronet Magyarorszag Informatikai Zrt.
>>
>> On Tue, 17 Dec 2013, krichy at tvnetwork.hu wrote:
>>
>>> Date: Tue, 17 Dec 2013 14:50:16 +0100 (CET)
>>> From: krichy at tvnetwork.hu
>>> To: pjd at freebsd.org
>>> Cc: freebsd-fs at freebsd.org
>>> Subject: Re: kern/184677 (fwd)
>>>
>>> Dear devs,
>>>
>>> I will sum up my experience regarding the issue:
>>>
>>> The sympton is that a concurrent 'zfs send -R' and some activity on the
>>> snapshot dir (or in the snapshot) may cause a deadlock.
>>>
>>> After investigating the problem, I found that zfs send umounts the snapshots,
>>> and that causes the deadlock, so later I tested only with concurrent umount
>>> and the "activity". More later I found that listing the snapshots in
>>> .zfs/snapshot/ and unounting them can cause the found deadlock, so I used
>>> them for the tests. But for my surprise, instead of a deadlock, a recursive
>>> lock panic has arised.
>>>
>>> The vnode for the ".zfs/snapshot/" directory contains zfs's zfsctl_snapdir_t
>>> structure (sdp). This contains a tree of mounted snapshots, and each entry
>>> (sep) contains the vnode of entry on which the snapshot is mounted on top
>>> (se_root). The strange is that the se_root member does not hold a reference
>>> for the vnode, just a simple pointer to it.
>>>
>>> Upon entry lookup (zfsctl_snapdir_lookup()) the "snapshot" vnode is locked,
>>> the zfsctl_snapdir_t's tree is locked, and searched for the mount if it
>>> exists already. If it founds no entry, does the mount. In the case of an
>>> entry was found, the se_root member contains the vnode which the snapshot is
>>> mounted on. Thus, a reference is taken for it, and the traverse() call will
>>> resolve to the real root vnode of the mounted snapshot, returning it as
>>> locked. (Examining the traverse() code I've found that it did not follow
>>> FreeBSD's lock order recommendation described in sys/kern/vfs_subr.c.)
>>>
>>> On the other way, when an umount is issued, the se_root vnode looses its last
>>> reference (as only the mountpoint holds one for it), it goes through the
>>> vinactive() path, to zfsctl_snapshot_inactive(). In FreeBSD this is called
>>> with a locked vnode, so this is a deadlock race condition. While
>>> zfsctl_snapdir_lookup() holds the mutex for the sdp tree, and traverse()
>>> tries to acquire the se_root, zfsctl_snapshot_inactive() holds the lock on
>>> se_root while tries to access the sdp lock.
>>>
>>> The zfsctl_snapshot_inactive() has an if statement checking the v_usecount,
>>> which is incremented in zfsctl_snapdir_lookup(), but in that context it is
>>> not covered by VI_LOCK. And it seems to me that FreeBSD's vinactive() path
>>> assumes that the vnode remains inactive (as opposed to illumos, at least how
>>> i read the code). So zfsctl_snapshot_inactive() must free resources while in
>>> a locked state. I was a bit confused, and probably that is why the previously
>>> posted patch is as is.
>>>
>>> Maybe if I had some clues on the directions of this problem, I could have
>>> worked more for a nicer, shorter solution.
>>>
>>> Please someone comment on my post.
>>>
>>> Regards,
>>>
>>>
>>>
>>> Kojedzinszky Richard
>>> Euronet Magyarorszag Informatikai Zrt.
>>>
>>> On Mon, 16 Dec 2013, krichy at tvnetwork.hu wrote:
>>>
>>>> Date: Mon, 16 Dec 2013 16:52:16 +0100 (CET)
>>>> From: krichy at tvnetwork.hu
>>>> To: pjd at freebsd.org
>>>> Cc: freebsd-fs at freebsd.org
>>>> Subject: Re: kern/184677 (fwd)
>>>>
>>>> Dear PJD,
>>>>
>>>> I am a happy FreeBSD user, I am sure you've read my previous posts
>>>> regarding some issues in ZFS. Please give some advice for me, where to look
>>>> for solutions, or how could I help to resolve those issues.
>>>>
>>>> Regards,
>>>> Kojedzinszky Richard
>>>> Euronet Magyarorszag Informatikai Zrt.
>>>>
>>>> ---------- Forwarded message ----------
>>>> Date: Mon, 16 Dec 2013 15:23:06 +0100 (CET)
>>>> From: krichy at tvnetwork.hu
>>>> To: freebsd-fs at freebsd.org
>>>> Subject: Re: kern/184677
>>>>
>>>>
>>>> Seems that pjd did a change which eliminated the zfsdev_state_lock on Fri
>>>> Aug 12 07:04:16 2011 +0000, which might introduced a new deadlock
>>>> situation. Any comments on this?
>>>>
>>>>
>>>> Kojedzinszky Richard
>>>> Euronet Magyarorszag Informatikai Zrt.
>>>>
>>>> On Mon, 16 Dec 2013, krichy at tvnetwork.hu wrote:
>>>>
>>>>> Date: Mon, 16 Dec 2013 11:08:11 +0100 (CET)
>>>>> From: krichy at tvnetwork.hu
>>>>> To: freebsd-fs at freebsd.org
>>>>> Subject: kern/184677
>>>>>
>>>>> Dear devs,
>>>>>
>>>>> I've attached a patch, which makes the recursive lockmgr disappear, and
>>>>> makes the reported bug to disappear. I dont know if I followed any
>>>>> guidelines well, or not, but at least it works for me. Please some
>>>>> ZFS/FreeBSD fs expert review it, and fix it where it needed.
>>>>>
>>>>> But unfortunately, my original problem is still not solved, maybe the same
>>>>> as Ryan's:
>>>>> http://lists.freebsd.org/pipermail/freebsd-fs/2013-December/018707.html
>>>>>
>>>>> Tracing the problem down is that zfsctl_snapdir_lookup() tries to acquire
>>>>> spa_namespace_lock while when finishing a zfs send -R does a
>>>>> zfsdev_close(), and that also holds the same mutex. And this causes a
>>>>> deadlock scenario. I looked at illumos's code, and for some reason they
>>>>> use another mutex on zfsdev_close(), which therefore may not deadlock with
>>>>> zfsctl_snapdir_lookup(). But I am still investigating the problem.
>>>>>
>>>>> I would like to help making ZFS more stable on freebsd also with its whole
>>>>> functionality. I would be very thankful if some expert would give some
>>>>> advice, how to solve these bugs. PJD, Steven, Xin?
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>>
>>>>> Kojedzinszky Richard
>>>>> Euronet Magyarorszag Informatikai Zrt.
>>>> _______________________________________________
>>>> freebsd-fs at freebsd.org mailing list
>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>>>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
>>>>
>>>
>
>
>> From 39298da838d006ad225e41529d7b7f240fccfe73 Mon Sep 17 00:00:00 2001
>> From: Richard Kojedzinszky <krichy at cflinux.hu>
>> Date: Mon, 16 Dec 2013 15:39:11 +0100
>> Subject: [PATCH 1/2] Revert "Eliminate the zfsdev_state_lock entirely and
>>  replace it with the"
>>
>> This reverts commit 1d8972b3f353f986eb5b85bc108b1c0d946d3218.
>>
>> Conflicts:
>> 	sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c
>> ---
>>  .../opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h  |   1 +
>>  .../opensolaris/uts/common/fs/zfs/vdev_geom.c      |  14 ++-
>>  .../opensolaris/uts/common/fs/zfs/zfs_ioctl.c      |  16 +--
>>  .../contrib/opensolaris/uts/common/fs/zfs/zvol.c   | 119 +++++++++------------
>>  4 files changed, 70 insertions(+), 80 deletions(-)
>>
>> diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h
>> index af2def2..8272c4d 100644
>> --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h
>> +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h
>> @@ -383,6 +383,7 @@ extern void *zfsdev_get_soft_state(minor_t minor,
>>  extern minor_t zfsdev_minor_alloc(void);
>>
>>  extern void *zfsdev_state;
>> +extern kmutex_t zfsdev_state_lock;
>>
>>  #endif	/* _KERNEL */
>>
>> diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
>> index 15685a5..5c3e9f3 100644
>> --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
>> +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
>> @@ -581,7 +581,7 @@ vdev_geom_open(vdev_t *vd, uint64_t *psize, uint64_t *max_psize,
>>  	struct g_provider *pp;
>>  	struct g_consumer *cp;
>>  	size_t bufsize;
>> -	int error;
>> +	int error, lock;
>>
>>  	/*
>>  	 * We must have a pathname, and it must be absolute.
>> @@ -593,6 +593,12 @@ vdev_geom_open(vdev_t *vd, uint64_t *psize, uint64_t *max_psize,
>>
>>  	vd->vdev_tsd = NULL;
>>
>> +	if (mutex_owned(&spa_namespace_lock)) {
>> +		mutex_exit(&spa_namespace_lock);
>> +		lock = 1;
>> +	} else {
>> +		lock = 0;
>> +	}
>>  	DROP_GIANT();
>>  	g_topology_lock();
>>  	error = 0;
>> @@ -624,7 +630,11 @@ vdev_geom_open(vdev_t *vd, uint64_t *psize, uint64_t *max_psize,
>>  	    !ISP2(cp->provider->sectorsize)) {
>>  		ZFS_LOG(1, "Provider %s has unsupported sectorsize.",
>>  		    vd->vdev_path);
>> +
>> +		g_topology_lock();
>>  		vdev_geom_detach(cp, 0);
>> +		g_topology_unlock();
>> +
>>  		error = EINVAL;
>>  		cp = NULL;
>>  	} else if (cp->acw == 0 && (spa_mode(vd->vdev_spa) & FWRITE) != 0) {
>> @@ -647,6 +657,8 @@ vdev_geom_open(vdev_t *vd, uint64_t *psize, uint64_t *max_psize,
>>  	}
>>  	g_topology_unlock();
>>  	PICKUP_GIANT();
>> +	if (lock)
>> +		mutex_enter(&spa_namespace_lock);
>>  	if (cp == NULL) {
>>  		vd->vdev_stat.vs_aux = VDEV_AUX_OPEN_FAILED;
>>  		return (error);
>> diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c
>> index e9fba26..91becde 100644
>> --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c
>> +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c
>> @@ -5635,7 +5635,7 @@ zfsdev_minor_alloc(void)
>>  	static minor_t last_minor;
>>  	minor_t m;
>>
>> -	ASSERT(MUTEX_HELD(&spa_namespace_lock));
>> +	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
>>
>>  	for (m = last_minor + 1; m != last_minor; m++) {
>>  		if (m > ZFSDEV_MAX_MINOR)
>> @@ -5655,7 +5655,7 @@ zfs_ctldev_init(struct cdev *devp)
>>  	minor_t minor;
>>  	zfs_soft_state_t *zs;
>>
>> -	ASSERT(MUTEX_HELD(&spa_namespace_lock));
>> +	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
>>
>>  	minor = zfsdev_minor_alloc();
>>  	if (minor == 0)
>> @@ -5676,7 +5676,7 @@ zfs_ctldev_init(struct cdev *devp)
>>  static void
>>  zfs_ctldev_destroy(zfs_onexit_t *zo, minor_t minor)
>>  {
>> -	ASSERT(MUTEX_HELD(&spa_namespace_lock));
>> +	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
>>
>>  	zfs_onexit_destroy(zo);
>>  	ddi_soft_state_free(zfsdev_state, minor);
>> @@ -5706,9 +5706,9 @@ zfsdev_open(struct cdev *devp, int flag, int mode, struct thread *td)
>>
>>  	/* This is the control device. Allocate a new minor if requested. */
>>  	if (flag & FEXCL) {
>> -		mutex_enter(&spa_namespace_lock);
>> +		mutex_enter(&zfsdev_state_lock);
>>  		error = zfs_ctldev_init(devp);
>> -		mutex_exit(&spa_namespace_lock);
>> +		mutex_exit(&zfsdev_state_lock);
>>  	}
>>
>>  	return (error);
>> @@ -5723,14 +5723,14 @@ zfsdev_close(void *data)
>>  	if (minor == 0)
>>  		return;
>>
>> -	mutex_enter(&spa_namespace_lock);
>> +	mutex_enter(&zfsdev_state_lock);
>>  	zo = zfsdev_get_soft_state(minor, ZSST_CTLDEV);
>>  	if (zo == NULL) {
>> -		mutex_exit(&spa_namespace_lock);
>> +		mutex_exit(&zfsdev_state_lock);
>>  		return;
>>  	}
>>  	zfs_ctldev_destroy(zo, minor);
>> -	mutex_exit(&spa_namespace_lock);
>> +	mutex_exit(&zfsdev_state_lock);
>>  }
>>
>>  static int
>> diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c
>> index 72d4502..aec5219 100644
>> --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c
>> +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c
>> @@ -104,11 +104,12 @@ static char *zvol_tag = "zvol_tag";
>>  #define	ZVOL_DUMPSIZE		"dumpsize"
>>
>>  /*
>> - * The spa_namespace_lock protects the zfsdev_state structure from being
>> - * modified while it's being used, e.g. an open that comes in before a
>> - * create finishes.  It also protects temporary opens of the dataset so that,
>> + * This lock protects the zfsdev_state structure from being modified
>> + * while it's being used, e.g. an open that comes in before a create
>> + * finishes.  It also protects temporary opens of the dataset so that,
>>   * e.g., an open doesn't get a spurious EBUSY.
>>   */
>> +kmutex_t zfsdev_state_lock;
>>  static uint32_t zvol_minors;
>>
>>  typedef struct zvol_extent {
>> @@ -249,7 +250,7 @@ zvol_minor_lookup(const char *name)
>>  	struct g_geom *gp;
>>  	zvol_state_t *zv = NULL;
>>
>> -	ASSERT(MUTEX_HELD(&spa_namespace_lock));
>> +	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
>>
>>  	g_topology_lock();
>>  	LIST_FOREACH(gp, &zfs_zvol_class.geom, geom) {
>> @@ -465,11 +466,11 @@ zvol_name2minor(const char *name, minor_t *minor)
>>  {
>>  	zvol_state_t *zv;
>>
>> -	mutex_enter(&spa_namespace_lock);
>> +	mutex_enter(&zfsdev_state_lock);
>>  	zv = zvol_minor_lookup(name);
>>  	if (minor && zv)
>>  		*minor = zv->zv_minor;
>> -	mutex_exit(&spa_namespace_lock);
>> +	mutex_exit(&zfsdev_state_lock);
>>  	return (zv ? 0 : -1);
>>  }
>>  #endif	/* sun */
>> @@ -489,10 +490,10 @@ zvol_create_minor(const char *name)
>>
>>  	ZFS_LOG(1, "Creating ZVOL %s...", name);
>>
>> -	mutex_enter(&spa_namespace_lock);
>> +	mutex_enter(&zfsdev_state_lock);
>>
>>  	if (zvol_minor_lookup(name) != NULL) {
>> -		mutex_exit(&spa_namespace_lock);
>> +		mutex_exit(&zfsdev_state_lock);
>>  		return (SET_ERROR(EEXIST));
>>  	}
>>
>> @@ -500,20 +501,20 @@ zvol_create_minor(const char *name)
>>  	error = dmu_objset_own(name, DMU_OST_ZVOL, B_TRUE, FTAG, &os);
>>
>>  	if (error) {
>> -		mutex_exit(&spa_namespace_lock);
>> +		mutex_exit(&zfsdev_state_lock);
>>  		return (error);
>>  	}
>>
>>  #ifdef sun
>>  	if ((minor = zfsdev_minor_alloc()) == 0) {
>>  		dmu_objset_disown(os, FTAG);
>> -		mutex_exit(&spa_namespace_lock);
>> +		mutex_exit(&zfsdev_state_lock);
>>  		return (SET_ERROR(ENXIO));
>>  	}
>>
>>  	if (ddi_soft_state_zalloc(zfsdev_state, minor) != DDI_SUCCESS) {
>>  		dmu_objset_disown(os, FTAG);
>> -		mutex_exit(&spa_namespace_lock);
>> +		mutex_exit(&zfsdev_state_lock);
>>  		return (SET_ERROR(EAGAIN));
>>  	}
>>  	(void) ddi_prop_update_string(minor, zfs_dip, ZVOL_PROP_NAME,
>> @@ -525,7 +526,7 @@ zvol_create_minor(const char *name)
>>  	    minor, DDI_PSEUDO, 0) == DDI_FAILURE) {
>>  		ddi_soft_state_free(zfsdev_state, minor);
>>  		dmu_objset_disown(os, FTAG);
>> -		mutex_exit(&spa_namespace_lock);
>> +		mutex_exit(&zfsdev_state_lock);
>>  		return (SET_ERROR(EAGAIN));
>>  	}
>>
>> @@ -536,7 +537,7 @@ zvol_create_minor(const char *name)
>>  		ddi_remove_minor_node(zfs_dip, chrbuf);
>>  		ddi_soft_state_free(zfsdev_state, minor);
>>  		dmu_objset_disown(os, FTAG);
>> -		mutex_exit(&spa_namespace_lock);
>> +		mutex_exit(&zfsdev_state_lock);
>>  		return (SET_ERROR(EAGAIN));
>>  	}
>>
>> @@ -587,7 +588,7 @@ zvol_create_minor(const char *name)
>>
>>  	zvol_minors++;
>>
>> -	mutex_exit(&spa_namespace_lock);
>> +	mutex_exit(&zfsdev_state_lock);
>>
>>  	zvol_geom_run(zv);
>>
>> @@ -609,7 +610,7 @@ zvol_remove_zv(zvol_state_t *zv)
>>  	minor_t minor = zv->zv_minor;
>>  #endif
>>
>> -	ASSERT(MUTEX_HELD(&spa_namespace_lock));
>> +	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
>>  	if (zv->zv_total_opens != 0)
>>  		return (SET_ERROR(EBUSY));
>>
>> @@ -635,15 +636,15 @@ zvol_remove_minor(const char *name)
>>  	zvol_state_t *zv;
>>  	int rc;
>>
>> -	mutex_enter(&spa_namespace_lock);
>> +	mutex_enter(&zfsdev_state_lock);
>>  	if ((zv = zvol_minor_lookup(name)) == NULL) {
>> -		mutex_exit(&spa_namespace_lock);
>> +		mutex_exit(&zfsdev_state_lock);
>>  		return (SET_ERROR(ENXIO));
>>  	}
>>  	g_topology_lock();
>>  	rc = zvol_remove_zv(zv);
>>  	g_topology_unlock();
>> -	mutex_exit(&spa_namespace_lock);
>> +	mutex_exit(&zfsdev_state_lock);
>>  	return (rc);
>>  }
>>
>> @@ -755,7 +756,7 @@ zvol_update_volsize(objset_t *os, uint64_t volsize)
>>  	dmu_tx_t *tx;
>>  	int error;
>>
>> -	ASSERT(MUTEX_HELD(&spa_namespace_lock));
>> +	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
>>
>>  	tx = dmu_tx_create(os);
>>  	dmu_tx_hold_zap(tx, ZVOL_ZAP_OBJ, TRUE, NULL);
>> @@ -786,7 +787,7 @@ zvol_remove_minors(const char *name)
>>  	namelen = strlen(name);
>>
>>  	DROP_GIANT();
>> -	mutex_enter(&spa_namespace_lock);
>> +	mutex_enter(&zfsdev_state_lock);
>>  	g_topology_lock();
>>
>>  	LIST_FOREACH_SAFE(gp, &zfs_zvol_class.geom, geom, gptmp) {
>> @@ -804,7 +805,7 @@ zvol_remove_minors(const char *name)
>>  	}
>>
>>  	g_topology_unlock();
>> -	mutex_exit(&spa_namespace_lock);
>> +	mutex_exit(&zfsdev_state_lock);
>>  	PICKUP_GIANT();
>>  }
>>
>> @@ -818,10 +819,10 @@ zvol_set_volsize(const char *name, major_t maj, uint64_t volsize)
>>  	uint64_t old_volsize = 0ULL;
>>  	uint64_t readonly;
>>
>> -	mutex_enter(&spa_namespace_lock);
>> +	mutex_enter(&zfsdev_state_lock);
>>  	zv = zvol_minor_lookup(name);
>>  	if ((error = dmu_objset_hold(name, FTAG, &os)) != 0) {
>> -		mutex_exit(&spa_namespace_lock);
>> +		mutex_exit(&zfsdev_state_lock);
>>  		return (error);
>>  	}
>>
>> @@ -888,7 +889,7 @@ zvol_set_volsize(const char *name, major_t maj, uint64_t volsize)
>>  out:
>>  	dmu_objset_rele(os, FTAG);
>>
>> -	mutex_exit(&spa_namespace_lock);
>> +	mutex_exit(&zfsdev_state_lock);
>>
>>  	return (error);
>>  }
>> @@ -899,36 +900,19 @@ zvol_open(struct g_provider *pp, int flag, int count)
>>  {
>>  	zvol_state_t *zv;
>>  	int err = 0;
>> -	boolean_t locked = B_FALSE;
>>
>> -	/*
>> -	 * Protect against recursively entering spa_namespace_lock
>> -	 * when spa_open() is used for a pool on a (local) ZVOL(s).
>> -	 * This is needed since we replaced upstream zfsdev_state_lock
>> -	 * with spa_namespace_lock in the ZVOL code.
>> -	 * We are using the same trick as spa_open().
>> -	 * Note that calls in zvol_first_open which need to resolve
>> -	 * pool name to a spa object will enter spa_open()
>> -	 * recursively, but that function already has all the
>> -	 * necessary protection.
>> -	 */
>> -	if (!MUTEX_HELD(&spa_namespace_lock)) {
>> -		mutex_enter(&spa_namespace_lock);
>> -		locked = B_TRUE;
>> -	}
>> +	mutex_enter(&zfsdev_state_lock);
>>
>>  	zv = pp->private;
>>  	if (zv == NULL) {
>> -		if (locked)
>> -			mutex_exit(&spa_namespace_lock);
>> +		mutex_exit(&zfsdev_state_lock);
>>  		return (SET_ERROR(ENXIO));
>>  	}
>>
>>  	if (zv->zv_total_opens == 0)
>>  		err = zvol_first_open(zv);
>>  	if (err) {
>> -		if (locked)
>> -			mutex_exit(&spa_namespace_lock);
>> +		mutex_exit(&zfsdev_state_lock);
>>  		return (err);
>>  	}
>>  	if ((flag & FWRITE) && (zv->zv_flags & ZVOL_RDONLY)) {
>> @@ -950,15 +934,13 @@ zvol_open(struct g_provider *pp, int flag, int count)
>>  #endif
>>
>>  	zv->zv_total_opens += count;
>> -	if (locked)
>> -		mutex_exit(&spa_namespace_lock);
>> +	mutex_exit(&zfsdev_state_lock);
>>
>>  	return (err);
>>  out:
>>  	if (zv->zv_total_opens == 0)
>>  		zvol_last_close(zv);
>> -	if (locked)
>> -		mutex_exit(&spa_namespace_lock);
>> +	mutex_exit(&zfsdev_state_lock);
>>  	return (err);
>>  }
>>
>> @@ -968,18 +950,12 @@ zvol_close(struct g_provider *pp, int flag, int count)
>>  {
>>  	zvol_state_t *zv;
>>  	int error = 0;
>> -	boolean_t locked = B_FALSE;
>>
>> -	/* See comment in zvol_open(). */
>> -	if (!MUTEX_HELD(&spa_namespace_lock)) {
>> -		mutex_enter(&spa_namespace_lock);
>> -		locked = B_TRUE;
>> -	}
>> +	mutex_enter(&zfsdev_state_lock);
>>
>>  	zv = pp->private;
>>  	if (zv == NULL) {
>> -		if (locked)
>> -			mutex_exit(&spa_namespace_lock);
>> +		mutex_exit(&zfsdev_state_lock);
>>  		return (SET_ERROR(ENXIO));
>>  	}
>>
>> @@ -1002,8 +978,7 @@ zvol_close(struct g_provider *pp, int flag, int count)
>>  	if (zv->zv_total_opens == 0)
>>  		zvol_last_close(zv);
>>
>> -	if (locked)
>> -		mutex_exit(&spa_namespace_lock);
>> +	mutex_exit(&zfsdev_state_lock);
>>  	return (error);
>>  }
>>
>> @@ -1658,12 +1633,12 @@ zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int flag, cred_t *cr, int *rvalp)
>>  	int error = 0;
>>  	rl_t *rl;
>>
>> -	mutex_enter(&spa_namespace_lock);
>> +	mutex_enter(&zfsdev_state_lock);
>>
>>  	zv = zfsdev_get_soft_state(getminor(dev), ZSST_ZVOL);
>>
>>  	if (zv == NULL) {
>> -		mutex_exit(&spa_namespace_lock);
>> +		mutex_exit(&zfsdev_state_lock);
>>  		return (SET_ERROR(ENXIO));
>>  	}
>>  	ASSERT(zv->zv_total_opens > 0);
>> @@ -1677,7 +1652,7 @@ zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int flag, cred_t *cr, int *rvalp)
>>  		dki.dki_ctype = DKC_UNKNOWN;
>>  		dki.dki_unit = getminor(dev);
>>  		dki.dki_maxtransfer = 1 << (SPA_MAXBLOCKSHIFT - zv->zv_min_bs);
>> -		mutex_exit(&spa_namespace_lock);
>> +		mutex_exit(&zfsdev_state_lock);
>>  		if (ddi_copyout(&dki, (void *)arg, sizeof (dki), flag))
>>  			error = SET_ERROR(EFAULT);
>>  		return (error);
>> @@ -1687,7 +1662,7 @@ zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int flag, cred_t *cr, int *rvalp)
>>  		dkm.dki_lbsize = 1U << zv->zv_min_bs;
>>  		dkm.dki_capacity = zv->zv_volsize >> zv->zv_min_bs;
>>  		dkm.dki_media_type = DK_UNKNOWN;
>> -		mutex_exit(&spa_namespace_lock);
>> +		mutex_exit(&zfsdev_state_lock);
>>  		if (ddi_copyout(&dkm, (void *)arg, sizeof (dkm), flag))
>>  			error = SET_ERROR(EFAULT);
>>  		return (error);
>> @@ -1697,14 +1672,14 @@ zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int flag, cred_t *cr, int *rvalp)
>>  			uint64_t vs = zv->zv_volsize;
>>  			uint8_t bs = zv->zv_min_bs;
>>
>> -			mutex_exit(&spa_namespace_lock);
>> +			mutex_exit(&zfsdev_state_lock);
>>  			error = zvol_getefi((void *)arg, flag, vs, bs);
>>  			return (error);
>>  		}
>>
>>  	case DKIOCFLUSHWRITECACHE:
>>  		dkc = (struct dk_callback *)arg;
>> -		mutex_exit(&spa_namespace_lock);
>> +		mutex_exit(&zfsdev_state_lock);
>>  		zil_commit(zv->zv_zilog, ZVOL_OBJ);
>>  		if ((flag & FKIOCTL) && dkc != NULL && dkc->dkc_callback) {
>>  			(*dkc->dkc_callback)(dkc->dkc_cookie, error);
>> @@ -1730,10 +1705,10 @@ zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int flag, cred_t *cr, int *rvalp)
>>  			}
>>  			if (wce) {
>>  				zv->zv_flags |= ZVOL_WCE;
>> -				mutex_exit(&spa_namespace_lock);
>> +				mutex_exit(&zfsdev_state_lock);
>>  			} else {
>>  				zv->zv_flags &= ~ZVOL_WCE;
>> -				mutex_exit(&spa_namespace_lock);
>> +				mutex_exit(&zfsdev_state_lock);
>>  				zil_commit(zv->zv_zilog, ZVOL_OBJ);
>>  			}
>>  			return (0);
>> @@ -1828,7 +1803,7 @@ zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int flag, cred_t *cr, int *rvalp)
>>  		break;
>>
>>  	}
>> -	mutex_exit(&spa_namespace_lock);
>> +	mutex_exit(&zfsdev_state_lock);
>>  	return (error);
>>  }
>>  #endif	/* sun */
>> @@ -1844,12 +1819,14 @@ zvol_init(void)
>>  {
>>  	VERIFY(ddi_soft_state_init(&zfsdev_state, sizeof (zfs_soft_state_t),
>>  	    1) == 0);
>> +	mutex_init(&zfsdev_state_lock, NULL, MUTEX_DEFAULT, NULL);
>>  	ZFS_LOG(1, "ZVOL Initialized.");
>>  }
>>
>>  void
>>  zvol_fini(void)
>>  {
>> +	mutex_destroy(&zfsdev_state_lock);
>>  	ddi_soft_state_fini(&zfsdev_state);
>>  	ZFS_LOG(1, "ZVOL Deinitialized.");
>>  }
>> @@ -1889,7 +1866,7 @@ zvol_dump_init(zvol_state_t *zv, boolean_t resize)
>>  	uint64_t version = spa_version(spa);
>>  	enum zio_checksum checksum;
>>
>> -	ASSERT(MUTEX_HELD(&spa_namespace_lock));
>> +	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
>>  	ASSERT(vd->vdev_ops == &vdev_root_ops);
>>
>>  	error = dmu_free_long_range(zv->zv_objset, ZVOL_OBJ, 0,
>> @@ -2437,7 +2414,7 @@ zvol_rename_minor(struct g_geom *gp, const char *newname)
>>  	struct g_provider *pp;
>>  	zvol_state_t *zv;
>>
>> -	ASSERT(MUTEX_HELD(&spa_namespace_lock));
>> +	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
>>  	g_topology_assert();
>>
>>  	pp = LIST_FIRST(&gp->provider);
>> @@ -2471,7 +2448,7 @@ zvol_rename_minors(const char *oldname, const char *newname)
>>  	newnamelen = strlen(newname);
>>
>>  	DROP_GIANT();
>> -	mutex_enter(&spa_namespace_lock);
>> +	mutex_enter(&zfsdev_state_lock);
>>  	g_topology_lock();
>>
>>  	LIST_FOREACH(gp, &zfs_zvol_class.geom, geom) {
>> @@ -2494,6 +2471,6 @@ zvol_rename_minors(const char *oldname, const char *newname)
>>  	}
>>
>>  	g_topology_unlock();
>> -	mutex_exit(&spa_namespace_lock);
>> +	mutex_exit(&zfsdev_state_lock);
>>  	PICKUP_GIANT();
>>  }
>> --
>> 1.8.4.2
>>
>
>> From 57d5a383b585c32c77af54e8cdacaddf8ce7584f Mon Sep 17 00:00:00 2001
>> From: Richard Kojedzinszky <krichy at cflinux.hu>
>> Date: Wed, 18 Dec 2013 22:11:21 +0100
>> Subject: [PATCH 2/2] ZFS snapshot handling fix
>>
>> ---
>>  .../compat/opensolaris/kern/opensolaris_lookup.c   | 13 +++---
>>  .../opensolaris/uts/common/fs/zfs/zfs_ctldir.c     | 53 +++++++++++++++-------
>>  2 files changed, 42 insertions(+), 24 deletions(-)
>>
>> diff --git a/sys/cddl/compat/opensolaris/kern/opensolaris_lookup.c b/sys/cddl/compat/opensolaris/kern/opensolaris_lookup.c
>> index 94383d6..4cac053 100644
>> --- a/sys/cddl/compat/opensolaris/kern/opensolaris_lookup.c
>> +++ b/sys/cddl/compat/opensolaris/kern/opensolaris_lookup.c
>> @@ -81,6 +81,8 @@ traverse(vnode_t **cvpp, int lktype)
>>  	 * progress on this vnode.
>>  	 */
>>
>> +	vn_lock(cvp, lktype);
>> +
>>  	for (;;) {
>>  		/*
>>  		 * Reached the end of the mount chain?
>> @@ -89,13 +91,7 @@ traverse(vnode_t **cvpp, int lktype)
>>  		if (vfsp == NULL)
>>  			break;
>>  		error = vfs_busy(vfsp, 0);
>> -		/*
>> -		 * tvp is NULL for *cvpp vnode, which we can't unlock.
>> -		 */
>> -		if (tvp != NULL)
>> -			vput(cvp);
>> -		else
>> -			vrele(cvp);
>> +		VOP_UNLOCK(cvp, 0);
>>  		if (error)
>>  			return (error);
>>
>> @@ -107,6 +103,9 @@ traverse(vnode_t **cvpp, int lktype)
>>  		vfs_unbusy(vfsp);
>>  		if (error != 0)
>>  			return (error);
>> +
>> +		VN_RELE(cvp);
>> +
>>  		cvp = tvp;
>>  	}
>>
>> diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
>> index 28ab1fa..d3464b4 100644
>> --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
>> +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
>> @@ -112,6 +112,25 @@ snapentry_compare(const void *a, const void *b)
>>  		return (0);
>>  }
>>
>> +/* Return the zfsctl_snapdir_t object from current vnode, following
>> + * the lock orders in zfsctl_snapdir_lookup() to avoid deadlocks.
>> + * On return the passed in vp is unlocked */
>> +static zfsctl_snapdir_t *
>> +zfsctl_snapshot_get_snapdir(vnode_t *vp, vnode_t **dvpp)
>> +{
>> +	gfs_dir_t *dp = vp->v_data;
>> +	*dvpp = dp->gfsd_file.gfs_parent;
>> +	zfsctl_snapdir_t *sdp;
>> +
>> +	VN_HOLD(*dvpp);
>> +	VOP_UNLOCK(vp, 0);
>> +	vn_lock(*dvpp, LK_SHARED | LK_RETRY | LK_CANRECURSE);
>> +	sdp = (*dvpp)->v_data;
>> +	VOP_UNLOCK(*dvpp, 0);
>> +
>> +	return (sdp);
>> +}
>> +
>>  #ifdef sun
>>  vnodeops_t *zfsctl_ops_root;
>>  vnodeops_t *zfsctl_ops_snapdir;
>> @@ -1013,6 +1032,8 @@ zfsctl_snapdir_lookup(ap)
>>  			 * The snapshot was unmounted behind our backs,
>>  			 * try to remount it.
>>  			 */
>> +			VOP_UNLOCK(*vpp, 0);
>> +			VN_HOLD(*vpp);
>>  			VERIFY(zfsctl_snapshot_zname(dvp, nm, MAXNAMELEN, snapname) == 0);
>>  			goto domount;
>>  		} else {
>> @@ -1064,7 +1085,6 @@ zfsctl_snapdir_lookup(ap)
>>  	sep->se_name = kmem_alloc(strlen(nm) + 1, KM_SLEEP);
>>  	(void) strcpy(sep->se_name, nm);
>>  	*vpp = sep->se_root = zfsctl_snapshot_mknode(dvp, dmu_objset_id(snap));
>> -	VN_HOLD(*vpp);
>>  	avl_insert(&sdp->sd_snaps, sep, where);
>>
>>  	dmu_objset_rele(snap, FTAG);
>> @@ -1075,6 +1095,7 @@ domount:
>>  	(void) snprintf(mountpoint, mountpoint_len,
>>  	    "%s/" ZFS_CTLDIR_NAME "/snapshot/%s",
>>  	    dvp->v_vfsp->mnt_stat.f_mntonname, nm);
>> +	VN_HOLD(*vpp);
>>  	err = mount_snapshot(curthread, vpp, "zfs", mountpoint, snapname, 0);
>>  	kmem_free(mountpoint, mountpoint_len);
>>  	if (err == 0) {
>> @@ -1464,16 +1485,18 @@ zfsctl_snapshot_inactive(ap)
>>  	int locked;
>>  	vnode_t *dvp;
>>
>> -	if (vp->v_count > 0)
>> -		goto end;
>> -
>> -	VERIFY(gfs_dir_lookup(vp, "..", &dvp, cr, 0, NULL, NULL) == 0);
>> -	sdp = dvp->v_data;
>> -	VOP_UNLOCK(dvp, 0);
>> +	sdp = zfsctl_snapshot_get_snapdir(vp, &dvp);
>>
>>  	if (!(locked = MUTEX_HELD(&sdp->sd_lock)))
>>  		mutex_enter(&sdp->sd_lock);
>>
>> +	vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
>> +
>> +	if (vp->v_count > 0) {
>> +		mutex_exit(&sdp->sd_lock);
>> +		return (0);
>> +	}
>> +
>>  	ASSERT(!vn_ismntpt(vp));
>>
>>  	sep = avl_first(&sdp->sd_snaps);
>> @@ -1494,7 +1517,6 @@ zfsctl_snapshot_inactive(ap)
>>  		mutex_exit(&sdp->sd_lock);
>>  	VN_RELE(dvp);
>>
>> -end:
>>  	/*
>>  	 * Dispose of the vnode for the snapshot mount point.
>>  	 * This is safe to do because once this entry has been removed
>> @@ -1595,20 +1617,17 @@ zfsctl_snapshot_lookup(ap)
>>  static int
>>  zfsctl_snapshot_vptocnp(struct vop_vptocnp_args *ap)
>>  {
>> -	zfsvfs_t *zfsvfs = ap->a_vp->v_vfsp->vfs_data;
>> -	vnode_t *dvp, *vp;
>> +	vnode_t *dvp, *vp = ap->a_vp;
>>  	zfsctl_snapdir_t *sdp;
>>  	zfs_snapentry_t *sep;
>> -	int error;
>> +	int error = 0;
>>
>> -	ASSERT(zfsvfs->z_ctldir != NULL);
>> -	error = zfsctl_root_lookup(zfsvfs->z_ctldir, "snapshot", &dvp,
>> -	    NULL, 0, NULL, kcred, NULL, NULL, NULL);
>> -	if (error != 0)
>> -		return (error);
>> -	sdp = dvp->v_data;
>> +	sdp = zfsctl_snapshot_get_snapdir(vp, &dvp);
>>
>>  	mutex_enter(&sdp->sd_lock);
>> +
>> +	vn_lock(vp, LK_SHARED | LK_RETRY);
>> +
>>  	sep = avl_first(&sdp->sd_snaps);
>>  	while (sep != NULL) {
>>  		vp = sep->se_root;
>> --
>> 1.8.4.2
>>
>
> -- 
> Pawel Jakub Dawidek                       http://www.wheelsystems.com
> FreeBSD committer                         http://www.FreeBSD.org
> Am I Evil? Yes, I Am!                     http://mobter.com
>