kern/184677 / ZFS snapshot handling deadlocks

Tue Dec 24 06:10:28 UTC 2013

Dear devs,

A third one, again the snapshot dirs:

# cd /tmp/test/.zfs/snapshot && yes .. | xargs stat > /dev/null &
# yes /tmp/test/.zfs/snapshot/snap | xargs umount > /dev/null

Will deadlock in a few seconds.

The first holds a lock on .zfs/snapshot vnode, when looking up .. it tries 
to lock .zfs, while the other just tries to do this in reversed order.

In a vop_lookup, why is the passed in vnode, and the returned vnode must 
be locked? Why is not enough to have it held?

Thanks in advance,
Kojedzinszky Richard
Euronet Magyarorszag Informatikai Zrt.

On Mon, 23 Dec 2013, krichy at tvnetwork.hu wrote:

> Date: Mon, 23 Dec 2013 17:55:08 +0100 (CET)
> From: krichy at tvnetwork.hu
> To: Pawel Jakub Dawidek <pjd at FreeBSD.org>
> Cc: freebsd-fs at freebsd.org, avg at freebsd.org
> Subject: Re: kern/184677 / ZFS snapshot handling deadlocks
> 
> Dear Pawel,
>
> Thanks for your response. I hope someone will review my work.
>
> Secondly, I think I;ve found another deadlock scenario:
>
> When 'zfs send -R' finishes, in zfsdev_close() it has spa_namespace_lock 
> held. Then it does an unmount of all snapshots, which in turn goes to 
> zfsctl_snapshot_inactive(), in which sdp->sd_lock is being locked.
>
> Concurrently, a zfsctl_snapdir_lookup(), holding the same sdp->sd_lock does a 
> snapshot mount, which somewhere tries to acquire spa_namespace_lock. So it 
> deadlocks. Currently I dont know how to workaround this.
>
> Is spa_namespace_lock need to be held in zfsctl_snapshot_inactive()?
>
> What if we release it function enter, and re-acquire upon exit?
>
> Thanks in advance,
>
>
> Kojedzinszky Richard
> Euronet Magyarorszag Informatikai Zrt.
>
> On Fri, 20 Dec 2013, Pawel Jakub Dawidek wrote:
>
>> Date: Fri, 20 Dec 2013 14:44:05 +0100
>> From: Pawel Jakub Dawidek <pjd at FreeBSD.org>
>> To: krichy at tvnetwork.hu
>> Cc: freebsd-fs at freebsd.org
>> Subject: Re: kern/184677 / ZFS snapshot handling deadlocks
>> 
>> On Thu, Dec 19, 2013 at 04:46:11PM +0100, krichy at tvnetwork.hu wrote:
>>> Dear devs,
>>> 
>>> I am attaching a more clear patch/fix for my snapshot handling issues
>>> (0002), but I would be happy if some ZFS expert would comment it. I am
>>> trying to solve it at least for two weeks now, and an ACK or a NACK would
>>> be nice from someone. Also a commit is reverted since it also caused
>>> deadlocks. I've read its comments, which also eliminates deadlocks, but I
>>> did not find any reference how to produce that deadlock. In my view
>>> reverting that makes my issues disappear, but I dont know what new cases
>>> will it rise.
>> 
>> Richard,
>> 
>> I won't be able to analyse it myself anytime soon, because of other
>> obligations, but I forwarded you e-mail to the zfs-devel@ mailing list
>> (it is closed, but gathers FreeBSD ZFS devs). Hopefully someone from
>> there will be able to help you.
>> 
>>> I've rewritten traverse() to make more like upstream, added two extra
>>> VN_HOLD()s to snapdir_lookup() when traverse returned the same vnode what
>>> was passed to it (I dont know even that upon creating a snapshot vnode why
>>> is that extra two holds needed for the vnode.) And unfortunately, due to
>>> FreeBSD calls vop_inactive callbacks with vnodes locked, that could also
>>> cause deadlocks, so zfsctl_snapshot_inactive() and
>>> zfsctl_snapshot_vptocnp() has been rewritten to work that around.
>>> 
>>> After this, one may also get a deadlock, when a simple access would call
>>> into zfsctl_snapshot_lookup(). The documentation says, that those vnodes
>>> should always be covered, but after some stress test, sometimes we hit
>>> that call, and that can cause again deadlocks. In our environment I've
>>> just uncommented that callback, which returns ENODIR on some calls, but at
>>> least not a deadlock.
>>> 
>>> The attached script can be used to reproduce my cases (would one confirm
>>> that?), and after the patches applied, they disappear (confirm?).
>>> 
>>> Thanks,
>>> 
>>> 
>>> Kojedzinszky Richard
>>> Euronet Magyarorszag Informatikai Zrt.
>>> 
>>> On Tue, 17 Dec 2013, krichy at tvnetwork.hu wrote:
>>> 
>>>> Date: Tue, 17 Dec 2013 14:50:16 +0100 (CET)
>>>> From: krichy at tvnetwork.hu
>>>> To: pjd at freebsd.org
>>>> Cc: freebsd-fs at freebsd.org
>>>> Subject: Re: kern/184677 (fwd)
>>>> 
>>>> Dear devs,
>>>> 
>>>> I will sum up my experience regarding the issue:
>>>> 
>>>> The sympton is that a concurrent 'zfs send -R' and some activity on the
>>>> snapshot dir (or in the snapshot) may cause a deadlock.
>>>> 
>>>> After investigating the problem, I found that zfs send umounts the 
>>>> snapshots,
>>>> and that causes the deadlock, so later I tested only with concurrent 
>>>> umount
>>>> and the "activity". More later I found that listing the snapshots in
>>>> .zfs/snapshot/ and unounting them can cause the found deadlock, so I used
>>>> them for the tests. But for my surprise, instead of a deadlock, a 
>>>> recursive
>>>> lock panic has arised.
>>>> 
>>>> The vnode for the ".zfs/snapshot/" directory contains zfs's 
>>>> zfsctl_snapdir_t
>>>> structure (sdp). This contains a tree of mounted snapshots, and each 
>>>> entry
>>>> (sep) contains the vnode of entry on which the snapshot is mounted on top
>>>> (se_root). The strange is that the se_root member does not hold a 
>>>> reference
>>>> for the vnode, just a simple pointer to it.
>>>> 
>>>> Upon entry lookup (zfsctl_snapdir_lookup()) the "snapshot" vnode is 
>>>> locked,
>>>> the zfsctl_snapdir_t's tree is locked, and searched for the mount if it
>>>> exists already. If it founds no entry, does the mount. In the case of an
>>>> entry was found, the se_root member contains the vnode which the snapshot 
>>>> is
>>>> mounted on. Thus, a reference is taken for it, and the traverse() call 
>>>> will
>>>> resolve to the real root vnode of the mounted snapshot, returning it as
>>>> locked. (Examining the traverse() code I've found that it did not follow
>>>> FreeBSD's lock order recommendation described in sys/kern/vfs_subr.c.)
>>>> 
>>>> On the other way, when an umount is issued, the se_root vnode looses its 
>>>> last
>>>> reference (as only the mountpoint holds one for it), it goes through the
>>>> vinactive() path, to zfsctl_snapshot_inactive(). In FreeBSD this is 
>>>> called
>>>> with a locked vnode, so this is a deadlock race condition. While
>>>> zfsctl_snapdir_lookup() holds the mutex for the sdp tree, and traverse()
>>>> tries to acquire the se_root, zfsctl_snapshot_inactive() holds the lock 
>>>> on
>>>> se_root while tries to access the sdp lock.
>>>> 
>>>> The zfsctl_snapshot_inactive() has an if statement checking the 
>>>> v_usecount,
>>>> which is incremented in zfsctl_snapdir_lookup(), but in that context it 
>>>> is
>>>> not covered by VI_LOCK. And it seems to me that FreeBSD's vinactive() 
>>>> path
>>>> assumes that the vnode remains inactive (as opposed to illumos, at least 
>>>> how
>>>> i read the code). So zfsctl_snapshot_inactive() must free resources while 
>>>> in
>>>> a locked state. I was a bit confused, and probably that is why the 
>>>> previously
>>>> posted patch is as is.
>>>> 
>>>> Maybe if I had some clues on the directions of this problem, I could have
>>>> worked more for a nicer, shorter solution.
>>>> 
>>>> Please someone comment on my post.
>>>> 
>>>> Regards,
>>>> 
>>>> 
>>>> 
>>>> Kojedzinszky Richard
>>>> Euronet Magyarorszag Informatikai Zrt.
>>>> 
>>>> On Mon, 16 Dec 2013, krichy at tvnetwork.hu wrote:
>>>> 
>>>>> Date: Mon, 16 Dec 2013 16:52:16 +0100 (CET)
>>>>> From: krichy at tvnetwork.hu
>>>>> To: pjd at freebsd.org
>>>>> Cc: freebsd-fs at freebsd.org
>>>>> Subject: Re: kern/184677 (fwd)
>>>>> 
>>>>> Dear PJD,
>>>>> 
>>>>> I am a happy FreeBSD user, I am sure you've read my previous posts
>>>>> regarding some issues in ZFS. Please give some advice for me, where to 
>>>>> look
>>>>> for solutions, or how could I help to resolve those issues.
>>>>> 
>>>>> Regards,
>>>>> Kojedzinszky Richard
>>>>> Euronet Magyarorszag Informatikai Zrt.
>>>>> 
>>>>> ---------- Forwarded message ----------
>>>>> Date: Mon, 16 Dec 2013 15:23:06 +0100 (CET)
>>>>> From: krichy at tvnetwork.hu
>>>>> To: freebsd-fs at freebsd.org
>>>>> Subject: Re: kern/184677
>>>>> 
>>>>> 
>>>>> Seems that pjd did a change which eliminated the zfsdev_state_lock on 
>>>>> Fri
>>>>> Aug 12 07:04:16 2011 +0000, which might introduced a new deadlock
>>>>> situation. Any comments on this?
>>>>> 
>>>>> 
>>>>> Kojedzinszky Richard
>>>>> Euronet Magyarorszag Informatikai Zrt.
>>>>> 
>>>>> On Mon, 16 Dec 2013, krichy at tvnetwork.hu wrote:
>>>>> 
>>>>>> Date: Mon, 16 Dec 2013 11:08:11 +0100 (CET)
>>>>>> From: krichy at tvnetwork.hu
>>>>>> To: freebsd-fs at freebsd.org
>>>>>> Subject: kern/184677
>>>>>> 
>>>>>> Dear devs,
>>>>>> 
>>>>>> I've attached a patch, which makes the recursive lockmgr disappear, and
>>>>>> makes the reported bug to disappear. I dont know if I followed any
>>>>>> guidelines well, or not, but at least it works for me. Please some
>>>>>> ZFS/FreeBSD fs expert review it, and fix it where it needed.
>>>>>> 
>>>>>> But unfortunately, my original problem is still not solved, maybe the 
>>>>>> same
>>>>>> as Ryan's:
>>>>>> http://lists.freebsd.org/pipermail/freebsd-fs/2013-December/018707.html
>>>>>> 
>>>>>> Tracing the problem down is that zfsctl_snapdir_lookup() tries to 
>>>>>> acquire
>>>>>> spa_namespace_lock while when finishing a zfs send -R does a
>>>>>> zfsdev_close(), and that also holds the same mutex. And this causes a
>>>>>> deadlock scenario. I looked at illumos's code, and for some reason they
>>>>>> use another mutex on zfsdev_close(), which therefore may not deadlock 
>>>>>> with
>>>>>> zfsctl_snapdir_lookup(). But I am still investigating the problem.
>>>>>> 
>>>>>> I would like to help making ZFS more stable on freebsd also with its 
>>>>>> whole
>>>>>> functionality. I would be very thankful if some expert would give some
>>>>>> advice, how to solve these bugs. PJD, Steven, Xin?
>>>>>> 
>>>>>> Thanks in advance,
>>>>>> 
>>>>>> 
>>>>>> Kojedzinszky Richard
>>>>>> Euronet Magyarorszag Informatikai Zrt.
>>>>> _______________________________________________
>>>>> freebsd-fs at freebsd.org mailing list
>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>>>>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
>>>>> 
>>>> 
>> 
>> 
>>> From 39298da838d006ad225e41529d7b7f240fccfe73 Mon Sep 17 00:00:00 2001
>>> From: Richard Kojedzinszky <krichy at cflinux.hu>
>>> Date: Mon, 16 Dec 2013 15:39:11 +0100
>>> Subject: [PATCH 1/2] Revert "Eliminate the zfsdev_state_lock entirely and
>>>  replace it with the"
>>> 
>>> This reverts commit 1d8972b3f353f986eb5b85bc108b1c0d946d3218.
>>> 
>>> Conflicts:
>>> 	sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c
>>> ---
>>>  .../opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h  |   1 +
>>>  .../opensolaris/uts/common/fs/zfs/vdev_geom.c      |  14 ++-
>>>  .../opensolaris/uts/common/fs/zfs/zfs_ioctl.c      |  16 +--
>>>  .../contrib/opensolaris/uts/common/fs/zfs/zvol.c   | 119 
>>> +++++++++------------
>>>  4 files changed, 70 insertions(+), 80 deletions(-)
>>> 
>>> diff --git 
>>> a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h 
>>> b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h
>>> index af2def2..8272c4d 100644
>>> --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h
>>> +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zfs_ioctl.h
>>> @@ -383,6 +383,7 @@ extern void *zfsdev_get_soft_state(minor_t minor,
>>>  extern minor_t zfsdev_minor_alloc(void);
>>>
>>>  extern void *zfsdev_state;
>>> +extern kmutex_t zfsdev_state_lock;
>>>
>>>  #endif	/* _KERNEL */
>>> 
>>> diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c 
>>> b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
>>> index 15685a5..5c3e9f3 100644
>>> --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
>>> +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
>>> @@ -581,7 +581,7 @@ vdev_geom_open(vdev_t *vd, uint64_t *psize, uint64_t 
>>> *max_psize,
>>>  	struct g_provider *pp;
>>>  	struct g_consumer *cp;
>>>  	size_t bufsize;
>>> -	int error;
>>> +	int error, lock;
>>>
>>>  	/*
>>>  	 * We must have a pathname, and it must be absolute.
>>> @@ -593,6 +593,12 @@ vdev_geom_open(vdev_t *vd, uint64_t *psize, uint64_t 
>>> *max_psize,
>>>
>>>  	vd->vdev_tsd = NULL;
>>> 
>>> +	if (mutex_owned(&spa_namespace_lock)) {
>>> +		mutex_exit(&spa_namespace_lock);
>>> +		lock = 1;
>>> +	} else {
>>> +		lock = 0;
>>> +	}
>>>  	DROP_GIANT();
>>>  	g_topology_lock();
>>>  	error = 0;
>>> @@ -624,7 +630,11 @@ vdev_geom_open(vdev_t *vd, uint64_t *psize, uint64_t 
>>> *max_psize,
>>>  	    !ISP2(cp->provider->sectorsize)) {
>>>  		ZFS_LOG(1, "Provider %s has unsupported sectorsize.",
>>>  		    vd->vdev_path);
>>> +
>>> +		g_topology_lock();
>>>  		vdev_geom_detach(cp, 0);
>>> +		g_topology_unlock();
>>> +
>>>  		error = EINVAL;
>>>  		cp = NULL;
>>>  	} else if (cp->acw == 0 && (spa_mode(vd->vdev_spa) & FWRITE) != 0) {
>>> @@ -647,6 +657,8 @@ vdev_geom_open(vdev_t *vd, uint64_t *psize, uint64_t 
>>> *max_psize,
>>>  	}
>>>  	g_topology_unlock();
>>>  	PICKUP_GIANT();
>>> +	if (lock)
>>> +		mutex_enter(&spa_namespace_lock);
>>>  	if (cp == NULL) {
>>>  		vd->vdev_stat.vs_aux = VDEV_AUX_OPEN_FAILED;
>>>  		return (error);
>>> diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c 
>>> b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c
>>> index e9fba26..91becde 100644
>>> --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c
>>> +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c
>>> @@ -5635,7 +5635,7 @@ zfsdev_minor_alloc(void)
>>>  	static minor_t last_minor;
>>>  	minor_t m;
>>> 
>>> -	ASSERT(MUTEX_HELD(&spa_namespace_lock));
>>> +	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
>>>
>>>  	for (m = last_minor + 1; m != last_minor; m++) {
>>>  		if (m > ZFSDEV_MAX_MINOR)
>>> @@ -5655,7 +5655,7 @@ zfs_ctldev_init(struct cdev *devp)
>>>  	minor_t minor;
>>>  	zfs_soft_state_t *zs;
>>> 
>>> -	ASSERT(MUTEX_HELD(&spa_namespace_lock));
>>> +	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
>>>
>>>  	minor = zfsdev_minor_alloc();
>>>  	if (minor == 0)
>>> @@ -5676,7 +5676,7 @@ zfs_ctldev_init(struct cdev *devp)
>>>  static void
>>>  zfs_ctldev_destroy(zfs_onexit_t *zo, minor_t minor)
>>>  {
>>> -	ASSERT(MUTEX_HELD(&spa_namespace_lock));
>>> +	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
>>>
>>>  	zfs_onexit_destroy(zo);
>>>  	ddi_soft_state_free(zfsdev_state, minor);
>>> @@ -5706,9 +5706,9 @@ zfsdev_open(struct cdev *devp, int flag, int mode, 
>>> struct thread *td)
>>>
>>>  	/* This is the control device. Allocate a new minor if requested. */
>>>  	if (flag & FEXCL) {
>>> -		mutex_enter(&spa_namespace_lock);
>>> +		mutex_enter(&zfsdev_state_lock);
>>>  		error = zfs_ctldev_init(devp);
>>> -		mutex_exit(&spa_namespace_lock);
>>> +		mutex_exit(&zfsdev_state_lock);
>>>  	}
>>>
>>>  	return (error);
>>> @@ -5723,14 +5723,14 @@ zfsdev_close(void *data)
>>>  	if (minor == 0)
>>>  		return;
>>> 
>>> -	mutex_enter(&spa_namespace_lock);
>>> +	mutex_enter(&zfsdev_state_lock);
>>>  	zo = zfsdev_get_soft_state(minor, ZSST_CTLDEV);
>>>  	if (zo == NULL) {
>>> -		mutex_exit(&spa_namespace_lock);
>>> +		mutex_exit(&zfsdev_state_lock);
>>>  		return;
>>>  	}
>>>  	zfs_ctldev_destroy(zo, minor);
>>> -	mutex_exit(&spa_namespace_lock);
>>> +	mutex_exit(&zfsdev_state_lock);
>>>  }
>>>
>>>  static int
>>> diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c 
>>> b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c
>>> index 72d4502..aec5219 100644
>>> --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c
>>> +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zvol.c
>>> @@ -104,11 +104,12 @@ static char *zvol_tag = "zvol_tag";
>>>  #define	ZVOL_DUMPSIZE		"dumpsize"
>>>
>>>  /*
>>> - * The spa_namespace_lock protects the zfsdev_state structure from being
>>> - * modified while it's being used, e.g. an open that comes in before a
>>> - * create finishes.  It also protects temporary opens of the dataset so 
>>> that,
>>> + * This lock protects the zfsdev_state structure from being modified
>>> + * while it's being used, e.g. an open that comes in before a create
>>> + * finishes.  It also protects temporary opens of the dataset so that,
>>>   * e.g., an open doesn't get a spurious EBUSY.
>>>   */
>>> +kmutex_t zfsdev_state_lock;
>>>  static uint32_t zvol_minors;
>>>
>>>  typedef struct zvol_extent {
>>> @@ -249,7 +250,7 @@ zvol_minor_lookup(const char *name)
>>>  	struct g_geom *gp;
>>>  	zvol_state_t *zv = NULL;
>>> 
>>> -	ASSERT(MUTEX_HELD(&spa_namespace_lock));
>>> +	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
>>>
>>>  	g_topology_lock();
>>>  	LIST_FOREACH(gp, &zfs_zvol_class.geom, geom) {
>>> @@ -465,11 +466,11 @@ zvol_name2minor(const char *name, minor_t *minor)
>>>  {
>>>  	zvol_state_t *zv;
>>> 
>>> -	mutex_enter(&spa_namespace_lock);
>>> +	mutex_enter(&zfsdev_state_lock);
>>>  	zv = zvol_minor_lookup(name);
>>>  	if (minor && zv)
>>>  		*minor = zv->zv_minor;
>>> -	mutex_exit(&spa_namespace_lock);
>>> +	mutex_exit(&zfsdev_state_lock);
>>>  	return (zv ? 0 : -1);
>>>  }
>>>  #endif	/* sun */
>>> @@ -489,10 +490,10 @@ zvol_create_minor(const char *name)
>>>
>>>  	ZFS_LOG(1, "Creating ZVOL %s...", name);
>>> 
>>> -	mutex_enter(&spa_namespace_lock);
>>> +	mutex_enter(&zfsdev_state_lock);
>>>
>>>  	if (zvol_minor_lookup(name) != NULL) {
>>> -		mutex_exit(&spa_namespace_lock);
>>> +		mutex_exit(&zfsdev_state_lock);
>>>  		return (SET_ERROR(EEXIST));
>>>  	}
>>> 
>>> @@ -500,20 +501,20 @@ zvol_create_minor(const char *name)
>>>  	error = dmu_objset_own(name, DMU_OST_ZVOL, B_TRUE, FTAG, &os);
>>>
>>>  	if (error) {
>>> -		mutex_exit(&spa_namespace_lock);
>>> +		mutex_exit(&zfsdev_state_lock);
>>>  		return (error);
>>>  	}
>>>
>>>  #ifdef sun
>>>  	if ((minor = zfsdev_minor_alloc()) == 0) {
>>>  		dmu_objset_disown(os, FTAG);
>>> -		mutex_exit(&spa_namespace_lock);
>>> +		mutex_exit(&zfsdev_state_lock);
>>>  		return (SET_ERROR(ENXIO));
>>>  	}
>>>
>>>  	if (ddi_soft_state_zalloc(zfsdev_state, minor) != DDI_SUCCESS) {
>>>  		dmu_objset_disown(os, FTAG);
>>> -		mutex_exit(&spa_namespace_lock);
>>> +		mutex_exit(&zfsdev_state_lock);
>>>  		return (SET_ERROR(EAGAIN));
>>>  	}
>>>  	(void) ddi_prop_update_string(minor, zfs_dip, ZVOL_PROP_NAME,
>>> @@ -525,7 +526,7 @@ zvol_create_minor(const char *name)
>>>  	    minor, DDI_PSEUDO, 0) == DDI_FAILURE) {
>>>  		ddi_soft_state_free(zfsdev_state, minor);
>>>  		dmu_objset_disown(os, FTAG);
>>> -		mutex_exit(&spa_namespace_lock);
>>> +		mutex_exit(&zfsdev_state_lock);
>>>  		return (SET_ERROR(EAGAIN));
>>>  	}
>>> 
>>> @@ -536,7 +537,7 @@ zvol_create_minor(const char *name)
>>>  		ddi_remove_minor_node(zfs_dip, chrbuf);
>>>  		ddi_soft_state_free(zfsdev_state, minor);
>>>  		dmu_objset_disown(os, FTAG);
>>> -		mutex_exit(&spa_namespace_lock);
>>> +		mutex_exit(&zfsdev_state_lock);
>>>  		return (SET_ERROR(EAGAIN));
>>>  	}
>>> 
>>> @@ -587,7 +588,7 @@ zvol_create_minor(const char *name)
>>>
>>>  	zvol_minors++;
>>> 
>>> -	mutex_exit(&spa_namespace_lock);
>>> +	mutex_exit(&zfsdev_state_lock);
>>>
>>>  	zvol_geom_run(zv);
>>> 
>>> @@ -609,7 +610,7 @@ zvol_remove_zv(zvol_state_t *zv)
>>>  	minor_t minor = zv->zv_minor;
>>>  #endif
>>> 
>>> -	ASSERT(MUTEX_HELD(&spa_namespace_lock));
>>> +	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
>>>  	if (zv->zv_total_opens != 0)
>>>  		return (SET_ERROR(EBUSY));
>>> 
>>> @@ -635,15 +636,15 @@ zvol_remove_minor(const char *name)
>>>  	zvol_state_t *zv;
>>>  	int rc;
>>> 
>>> -	mutex_enter(&spa_namespace_lock);
>>> +	mutex_enter(&zfsdev_state_lock);
>>>  	if ((zv = zvol_minor_lookup(name)) == NULL) {
>>> -		mutex_exit(&spa_namespace_lock);
>>> +		mutex_exit(&zfsdev_state_lock);
>>>  		return (SET_ERROR(ENXIO));
>>>  	}
>>>  	g_topology_lock();
>>>  	rc = zvol_remove_zv(zv);
>>>  	g_topology_unlock();
>>> -	mutex_exit(&spa_namespace_lock);
>>> +	mutex_exit(&zfsdev_state_lock);
>>>  	return (rc);
>>>  }
>>> 
>>> @@ -755,7 +756,7 @@ zvol_update_volsize(objset_t *os, uint64_t volsize)
>>>  	dmu_tx_t *tx;
>>>  	int error;
>>> 
>>> -	ASSERT(MUTEX_HELD(&spa_namespace_lock));
>>> +	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
>>>
>>>  	tx = dmu_tx_create(os);
>>>  	dmu_tx_hold_zap(tx, ZVOL_ZAP_OBJ, TRUE, NULL);
>>> @@ -786,7 +787,7 @@ zvol_remove_minors(const char *name)
>>>  	namelen = strlen(name);
>>>
>>>  	DROP_GIANT();
>>> -	mutex_enter(&spa_namespace_lock);
>>> +	mutex_enter(&zfsdev_state_lock);
>>>  	g_topology_lock();
>>>
>>>  	LIST_FOREACH_SAFE(gp, &zfs_zvol_class.geom, geom, gptmp) {
>>> @@ -804,7 +805,7 @@ zvol_remove_minors(const char *name)
>>>  	}
>>>
>>>  	g_topology_unlock();
>>> -	mutex_exit(&spa_namespace_lock);
>>> +	mutex_exit(&zfsdev_state_lock);
>>>  	PICKUP_GIANT();
>>>  }
>>> 
>>> @@ -818,10 +819,10 @@ zvol_set_volsize(const char *name, major_t maj, 
>>> uint64_t volsize)
>>>  	uint64_t old_volsize = 0ULL;
>>>  	uint64_t readonly;
>>> 
>>> -	mutex_enter(&spa_namespace_lock);
>>> +	mutex_enter(&zfsdev_state_lock);
>>>  	zv = zvol_minor_lookup(name);
>>>  	if ((error = dmu_objset_hold(name, FTAG, &os)) != 0) {
>>> -		mutex_exit(&spa_namespace_lock);
>>> +		mutex_exit(&zfsdev_state_lock);
>>>  		return (error);
>>>  	}
>>> 
>>> @@ -888,7 +889,7 @@ zvol_set_volsize(const char *name, major_t maj, 
>>> uint64_t volsize)
>>>  out:
>>>  	dmu_objset_rele(os, FTAG);
>>> 
>>> -	mutex_exit(&spa_namespace_lock);
>>> +	mutex_exit(&zfsdev_state_lock);
>>>
>>>  	return (error);
>>>  }
>>> @@ -899,36 +900,19 @@ zvol_open(struct g_provider *pp, int flag, int 
>>> count)
>>>  {
>>>  	zvol_state_t *zv;
>>>  	int err = 0;
>>> -	boolean_t locked = B_FALSE;
>>> 
>>> -	/*
>>> -	 * Protect against recursively entering spa_namespace_lock
>>> -	 * when spa_open() is used for a pool on a (local) ZVOL(s).
>>> -	 * This is needed since we replaced upstream zfsdev_state_lock
>>> -	 * with spa_namespace_lock in the ZVOL code.
>>> -	 * We are using the same trick as spa_open().
>>> -	 * Note that calls in zvol_first_open which need to resolve
>>> -	 * pool name to a spa object will enter spa_open()
>>> -	 * recursively, but that function already has all the
>>> -	 * necessary protection.
>>> -	 */
>>> -	if (!MUTEX_HELD(&spa_namespace_lock)) {
>>> -		mutex_enter(&spa_namespace_lock);
>>> -		locked = B_TRUE;
>>> -	}
>>> +	mutex_enter(&zfsdev_state_lock);
>>>
>>>  	zv = pp->private;
>>>  	if (zv == NULL) {
>>> -		if (locked)
>>> -			mutex_exit(&spa_namespace_lock);
>>> +		mutex_exit(&zfsdev_state_lock);
>>>  		return (SET_ERROR(ENXIO));
>>>  	}
>>>
>>>  	if (zv->zv_total_opens == 0)
>>>  		err = zvol_first_open(zv);
>>>  	if (err) {
>>> -		if (locked)
>>> -			mutex_exit(&spa_namespace_lock);
>>> +		mutex_exit(&zfsdev_state_lock);
>>>  		return (err);
>>>  	}
>>>  	if ((flag & FWRITE) && (zv->zv_flags & ZVOL_RDONLY)) {
>>> @@ -950,15 +934,13 @@ zvol_open(struct g_provider *pp, int flag, int 
>>> count)
>>>  #endif
>>>
>>>  	zv->zv_total_opens += count;
>>> -	if (locked)
>>> -		mutex_exit(&spa_namespace_lock);
>>> +	mutex_exit(&zfsdev_state_lock);
>>>
>>>  	return (err);
>>>  out:
>>>  	if (zv->zv_total_opens == 0)
>>>  		zvol_last_close(zv);
>>> -	if (locked)
>>> -		mutex_exit(&spa_namespace_lock);
>>> +	mutex_exit(&zfsdev_state_lock);
>>>  	return (err);
>>>  }
>>> 
>>> @@ -968,18 +950,12 @@ zvol_close(struct g_provider *pp, int flag, int 
>>> count)
>>>  {
>>>  	zvol_state_t *zv;
>>>  	int error = 0;
>>> -	boolean_t locked = B_FALSE;
>>> 
>>> -	/* See comment in zvol_open(). */
>>> -	if (!MUTEX_HELD(&spa_namespace_lock)) {
>>> -		mutex_enter(&spa_namespace_lock);
>>> -		locked = B_TRUE;
>>> -	}
>>> +	mutex_enter(&zfsdev_state_lock);
>>>
>>>  	zv = pp->private;
>>>  	if (zv == NULL) {
>>> -		if (locked)
>>> -			mutex_exit(&spa_namespace_lock);
>>> +		mutex_exit(&zfsdev_state_lock);
>>>  		return (SET_ERROR(ENXIO));
>>>  	}
>>> 
>>> @@ -1002,8 +978,7 @@ zvol_close(struct g_provider *pp, int flag, int 
>>> count)
>>>  	if (zv->zv_total_opens == 0)
>>>  		zvol_last_close(zv);
>>> 
>>> -	if (locked)
>>> -		mutex_exit(&spa_namespace_lock);
>>> +	mutex_exit(&zfsdev_state_lock);
>>>  	return (error);
>>>  }
>>> 
>>> @@ -1658,12 +1633,12 @@ zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int 
>>> flag, cred_t *cr, int *rvalp)
>>>  	int error = 0;
>>>  	rl_t *rl;
>>> 
>>> -	mutex_enter(&spa_namespace_lock);
>>> +	mutex_enter(&zfsdev_state_lock);
>>>
>>>  	zv = zfsdev_get_soft_state(getminor(dev), ZSST_ZVOL);
>>>
>>>  	if (zv == NULL) {
>>> -		mutex_exit(&spa_namespace_lock);
>>> +		mutex_exit(&zfsdev_state_lock);
>>>  		return (SET_ERROR(ENXIO));
>>>  	}
>>>  	ASSERT(zv->zv_total_opens > 0);
>>> @@ -1677,7 +1652,7 @@ zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int 
>>> flag, cred_t *cr, int *rvalp)
>>>  		dki.dki_ctype = DKC_UNKNOWN;
>>>  		dki.dki_unit = getminor(dev);
>>>  		dki.dki_maxtransfer = 1 << (SPA_MAXBLOCKSHIFT - 
>>> zv->zv_min_bs);
>>> -		mutex_exit(&spa_namespace_lock);
>>> +		mutex_exit(&zfsdev_state_lock);
>>>  		if (ddi_copyout(&dki, (void *)arg, sizeof (dki), flag))
>>>  			error = SET_ERROR(EFAULT);
>>>  		return (error);
>>> @@ -1687,7 +1662,7 @@ zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int 
>>> flag, cred_t *cr, int *rvalp)
>>>  		dkm.dki_lbsize = 1U << zv->zv_min_bs;
>>>  		dkm.dki_capacity = zv->zv_volsize >> zv->zv_min_bs;
>>>  		dkm.dki_media_type = DK_UNKNOWN;
>>> -		mutex_exit(&spa_namespace_lock);
>>> +		mutex_exit(&zfsdev_state_lock);
>>>  		if (ddi_copyout(&dkm, (void *)arg, sizeof (dkm), flag))
>>>  			error = SET_ERROR(EFAULT);
>>>  		return (error);
>>> @@ -1697,14 +1672,14 @@ zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int 
>>> flag, cred_t *cr, int *rvalp)
>>>  			uint64_t vs = zv->zv_volsize;
>>>  			uint8_t bs = zv->zv_min_bs;
>>> 
>>> -			mutex_exit(&spa_namespace_lock);
>>> +			mutex_exit(&zfsdev_state_lock);
>>>  			error = zvol_getefi((void *)arg, flag, vs, bs);
>>>  			return (error);
>>>  		}
>>>
>>>  	case DKIOCFLUSHWRITECACHE:
>>>  		dkc = (struct dk_callback *)arg;
>>> -		mutex_exit(&spa_namespace_lock);
>>> +		mutex_exit(&zfsdev_state_lock);
>>>  		zil_commit(zv->zv_zilog, ZVOL_OBJ);
>>>  		if ((flag & FKIOCTL) && dkc != NULL && dkc->dkc_callback) {
>>>  			(*dkc->dkc_callback)(dkc->dkc_cookie, error);
>>> @@ -1730,10 +1705,10 @@ zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int 
>>> flag, cred_t *cr, int *rvalp)
>>>  			}
>>>  			if (wce) {
>>>  				zv->zv_flags |= ZVOL_WCE;
>>> -				mutex_exit(&spa_namespace_lock);
>>> +				mutex_exit(&zfsdev_state_lock);
>>>  			} else {
>>>  				zv->zv_flags &= ~ZVOL_WCE;
>>> -				mutex_exit(&spa_namespace_lock);
>>> +				mutex_exit(&zfsdev_state_lock);
>>>  				zil_commit(zv->zv_zilog, ZVOL_OBJ);
>>>  			}
>>>  			return (0);
>>> @@ -1828,7 +1803,7 @@ zvol_ioctl(dev_t dev, int cmd, intptr_t arg, int 
>>> flag, cred_t *cr, int *rvalp)
>>>  		break;
>>>
>>>  	}
>>> -	mutex_exit(&spa_namespace_lock);
>>> +	mutex_exit(&zfsdev_state_lock);
>>>  	return (error);
>>>  }
>>>  #endif	/* sun */
>>> @@ -1844,12 +1819,14 @@ zvol_init(void)
>>>  {
>>>  	VERIFY(ddi_soft_state_init(&zfsdev_state, sizeof (zfs_soft_state_t),
>>>  	    1) == 0);
>>> +	mutex_init(&zfsdev_state_lock, NULL, MUTEX_DEFAULT, NULL);
>>>  	ZFS_LOG(1, "ZVOL Initialized.");
>>>  }
>>>
>>>  void
>>>  zvol_fini(void)
>>>  {
>>> +	mutex_destroy(&zfsdev_state_lock);
>>>  	ddi_soft_state_fini(&zfsdev_state);
>>>  	ZFS_LOG(1, "ZVOL Deinitialized.");
>>>  }
>>> @@ -1889,7 +1866,7 @@ zvol_dump_init(zvol_state_t *zv, boolean_t resize)
>>>  	uint64_t version = spa_version(spa);
>>>  	enum zio_checksum checksum;
>>> 
>>> -	ASSERT(MUTEX_HELD(&spa_namespace_lock));
>>> +	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
>>>  	ASSERT(vd->vdev_ops == &vdev_root_ops);
>>>
>>>  	error = dmu_free_long_range(zv->zv_objset, ZVOL_OBJ, 0,
>>> @@ -2437,7 +2414,7 @@ zvol_rename_minor(struct g_geom *gp, const char 
>>> *newname)
>>>  	struct g_provider *pp;
>>>  	zvol_state_t *zv;
>>> 
>>> -	ASSERT(MUTEX_HELD(&spa_namespace_lock));
>>> +	ASSERT(MUTEX_HELD(&zfsdev_state_lock));
>>>  	g_topology_assert();
>>>
>>>  	pp = LIST_FIRST(&gp->provider);
>>> @@ -2471,7 +2448,7 @@ zvol_rename_minors(const char *oldname, const char 
>>> *newname)
>>>  	newnamelen = strlen(newname);
>>>
>>>  	DROP_GIANT();
>>> -	mutex_enter(&spa_namespace_lock);
>>> +	mutex_enter(&zfsdev_state_lock);
>>>  	g_topology_lock();
>>>
>>>  	LIST_FOREACH(gp, &zfs_zvol_class.geom, geom) {
>>> @@ -2494,6 +2471,6 @@ zvol_rename_minors(const char *oldname, const char 
>>> *newname)
>>>  	}
>>>
>>>  	g_topology_unlock();
>>> -	mutex_exit(&spa_namespace_lock);
>>> +	mutex_exit(&zfsdev_state_lock);
>>>  	PICKUP_GIANT();
>>>  }
>>> --
>>> 1.8.4.2
>>> 
>> 
>>> From 57d5a383b585c32c77af54e8cdacaddf8ce7584f Mon Sep 17 00:00:00 2001
>>> From: Richard Kojedzinszky <krichy at cflinux.hu>
>>> Date: Wed, 18 Dec 2013 22:11:21 +0100
>>> Subject: [PATCH 2/2] ZFS snapshot handling fix
>>> 
>>> ---
>>>  .../compat/opensolaris/kern/opensolaris_lookup.c   | 13 +++---
>>>  .../opensolaris/uts/common/fs/zfs/zfs_ctldir.c     | 53 
>>> +++++++++++++++-------
>>>  2 files changed, 42 insertions(+), 24 deletions(-)
>>> 
>>> diff --git a/sys/cddl/compat/opensolaris/kern/opensolaris_lookup.c 
>>> b/sys/cddl/compat/opensolaris/kern/opensolaris_lookup.c
>>> index 94383d6..4cac053 100644
>>> --- a/sys/cddl/compat/opensolaris/kern/opensolaris_lookup.c
>>> +++ b/sys/cddl/compat/opensolaris/kern/opensolaris_lookup.c
>>> @@ -81,6 +81,8 @@ traverse(vnode_t **cvpp, int lktype)
>>>  	 * progress on this vnode.
>>>  	 */
>>> 
>>> +	vn_lock(cvp, lktype);
>>> +
>>>  	for (;;) {
>>>  		/*
>>>  		 * Reached the end of the mount chain?
>>> @@ -89,13 +91,7 @@ traverse(vnode_t **cvpp, int lktype)
>>>  		if (vfsp == NULL)
>>>  			break;
>>>  		error = vfs_busy(vfsp, 0);
>>> -		/*
>>> -		 * tvp is NULL for *cvpp vnode, which we can't unlock.
>>> -		 */
>>> -		if (tvp != NULL)
>>> -			vput(cvp);
>>> -		else
>>> -			vrele(cvp);
>>> +		VOP_UNLOCK(cvp, 0);
>>>  		if (error)
>>>  			return (error);
>>> 
>>> @@ -107,6 +103,9 @@ traverse(vnode_t **cvpp, int lktype)
>>>  		vfs_unbusy(vfsp);
>>>  		if (error != 0)
>>>  			return (error);
>>> +
>>> +		VN_RELE(cvp);
>>> +
>>>  		cvp = tvp;
>>>  	}
>>> 
>>> diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c 
>>> b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
>>> index 28ab1fa..d3464b4 100644
>>> --- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
>>> +++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c
>>> @@ -112,6 +112,25 @@ snapentry_compare(const void *a, const void *b)
>>>  		return (0);
>>>  }
>>> 
>>> +/* Return the zfsctl_snapdir_t object from current vnode, following
>>> + * the lock orders in zfsctl_snapdir_lookup() to avoid deadlocks.
>>> + * On return the passed in vp is unlocked */
>>> +static zfsctl_snapdir_t *
>>> +zfsctl_snapshot_get_snapdir(vnode_t *vp, vnode_t **dvpp)
>>> +{
>>> +	gfs_dir_t *dp = vp->v_data;
>>> +	*dvpp = dp->gfsd_file.gfs_parent;
>>> +	zfsctl_snapdir_t *sdp;
>>> +
>>> +	VN_HOLD(*dvpp);
>>> +	VOP_UNLOCK(vp, 0);
>>> +	vn_lock(*dvpp, LK_SHARED | LK_RETRY | LK_CANRECURSE);
>>> +	sdp = (*dvpp)->v_data;
>>> +	VOP_UNLOCK(*dvpp, 0);
>>> +
>>> +	return (sdp);
>>> +}
>>> +
>>>  #ifdef sun
>>>  vnodeops_t *zfsctl_ops_root;
>>>  vnodeops_t *zfsctl_ops_snapdir;
>>> @@ -1013,6 +1032,8 @@ zfsctl_snapdir_lookup(ap)
>>>  			 * The snapshot was unmounted behind our backs,
>>>  			 * try to remount it.
>>>  			 */
>>> +			VOP_UNLOCK(*vpp, 0);
>>> +			VN_HOLD(*vpp);
>>>  			VERIFY(zfsctl_snapshot_zname(dvp, nm, MAXNAMELEN, 
>>> snapname) == 0);
>>>  			goto domount;
>>>  		} else {
>>> @@ -1064,7 +1085,6 @@ zfsctl_snapdir_lookup(ap)
>>>  	sep->se_name = kmem_alloc(strlen(nm) + 1, KM_SLEEP);
>>>  	(void) strcpy(sep->se_name, nm);
>>>  	*vpp = sep->se_root = zfsctl_snapshot_mknode(dvp, 
>>> dmu_objset_id(snap));
>>> -	VN_HOLD(*vpp);
>>>  	avl_insert(&sdp->sd_snaps, sep, where);
>>>
>>>  	dmu_objset_rele(snap, FTAG);
>>> @@ -1075,6 +1095,7 @@ domount:
>>>  	(void) snprintf(mountpoint, mountpoint_len,
>>>  	    "%s/" ZFS_CTLDIR_NAME "/snapshot/%s",
>>>  	    dvp->v_vfsp->mnt_stat.f_mntonname, nm);
>>> +	VN_HOLD(*vpp);
>>>  	err = mount_snapshot(curthread, vpp, "zfs", mountpoint, snapname, 0);
>>>  	kmem_free(mountpoint, mountpoint_len);
>>>  	if (err == 0) {
>>> @@ -1464,16 +1485,18 @@ zfsctl_snapshot_inactive(ap)
>>>  	int locked;
>>>  	vnode_t *dvp;
>>> 
>>> -	if (vp->v_count > 0)
>>> -		goto end;
>>> -
>>> -	VERIFY(gfs_dir_lookup(vp, "..", &dvp, cr, 0, NULL, NULL) == 0);
>>> -	sdp = dvp->v_data;
>>> -	VOP_UNLOCK(dvp, 0);
>>> +	sdp = zfsctl_snapshot_get_snapdir(vp, &dvp);
>>>
>>>  	if (!(locked = MUTEX_HELD(&sdp->sd_lock)))
>>>  		mutex_enter(&sdp->sd_lock);
>>> 
>>> +	vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
>>> +
>>> +	if (vp->v_count > 0) {
>>> +		mutex_exit(&sdp->sd_lock);
>>> +		return (0);
>>> +	}
>>> +
>>>  	ASSERT(!vn_ismntpt(vp));
>>>
>>>  	sep = avl_first(&sdp->sd_snaps);
>>> @@ -1494,7 +1517,6 @@ zfsctl_snapshot_inactive(ap)
>>>  		mutex_exit(&sdp->sd_lock);
>>>  	VN_RELE(dvp);
>>> 
>>> -end:
>>>  	/*
>>>  	 * Dispose of the vnode for the snapshot mount point.
>>>  	 * This is safe to do because once this entry has been removed
>>> @@ -1595,20 +1617,17 @@ zfsctl_snapshot_lookup(ap)
>>>  static int
>>>  zfsctl_snapshot_vptocnp(struct vop_vptocnp_args *ap)
>>>  {
>>> -	zfsvfs_t *zfsvfs = ap->a_vp->v_vfsp->vfs_data;
>>> -	vnode_t *dvp, *vp;
>>> +	vnode_t *dvp, *vp = ap->a_vp;
>>>  	zfsctl_snapdir_t *sdp;
>>>  	zfs_snapentry_t *sep;
>>> -	int error;
>>> +	int error = 0;
>>> 
>>> -	ASSERT(zfsvfs->z_ctldir != NULL);
>>> -	error = zfsctl_root_lookup(zfsvfs->z_ctldir, "snapshot", &dvp,
>>> -	    NULL, 0, NULL, kcred, NULL, NULL, NULL);
>>> -	if (error != 0)
>>> -		return (error);
>>> -	sdp = dvp->v_data;
>>> +	sdp = zfsctl_snapshot_get_snapdir(vp, &dvp);
>>>
>>>  	mutex_enter(&sdp->sd_lock);
>>> +
>>> +	vn_lock(vp, LK_SHARED | LK_RETRY);
>>> +
>>>  	sep = avl_first(&sdp->sd_snaps);
>>>  	while (sep != NULL) {
>>>  		vp = sep->se_root;
>>> --
>>> 1.8.4.2
>>> 
>> 
>> -- 
>> Pawel Jakub Dawidek                       http://www.wheelsystems.com
>> FreeBSD committer                         http://www.FreeBSD.org
>> Am I Evil? Yes, I Am!                     http://mobter.com
>> 
> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
>