[PATCH] microoptimize by trying to avoid locking a locked mutex

Fri Nov 6 00:55:29 UTC 2015

On Thu, Nov 05, 2015 at 04:35:22PM -0700, Ian Lepore wrote:
> On Thu, 2015-11-05 at 14:19 -0800, John Baldwin wrote:
> > On Thursday, November 05, 2015 01:45:19 PM Adrian Chadd wrote:
> > > On 5 November 2015 at 11:26, Mateusz Guzik <mjguzik at gmail.com>
> > > wrote:
> > > > On Thu, Nov 05, 2015 at 11:04:13AM -0800, John Baldwin wrote:
> > > > > On Thursday, November 05, 2015 04:26:28 PM Konstantin Belousov
> > > > > wrote:
> > > > > > On Thu, Nov 05, 2015 at 12:32:18AM +0100, Mateusz Guzik
> > > > > > wrote:
> > > > > > > mtx_lock will unconditionally try to grab the lock and if
> > > > > > > that fails,
> > > > > > > will call __mtx_lock_sleep which will immediately try to do
> > > > > > > the same
> > > > > > > atomic op again.
> > > > > > > 
> > > > > > > So, the obvious microoptimization is to check the state in
> > > > > > > __mtx_lock_sleep and avoid the operation if the lock is not
> > > > > > > free.
> > > > > > > 
> > > > > > > This gives me ~40% speedup in a microbenchmark of 40 find
> > > > > > > processes
> > > > > > > traversing tmpfs and contending on mount mtx (only used as
> > > > > > > an easy
> > > > > > > benchmark, I have WIP patches to get rid of it).
> > > > > > > 
> > > > > > > Second part of the patch is optional and just checks the
> > > > > > > state of the
> > > > > > > lock prior to doing any atomic operations, but it gives a
> > > > > > > very modest
> > > > > > > speed up when applied on top of the __mtx_lock_sleep
> > > > > > > change. As such,
> > > > > > > I'm not going to defend this part.
> > > > > > Shouldn't the same consideration applied to all spinning
> > > > > > loops, i.e.
> > > > > > also to the spin/thread mutexes, and to the spinning parts of
> > > > > > sx and
> > > > > > lockmgr ?
> > > > > 
> > > > > I agree.  I think both changes are good and worth doing in our
> > > > > other
> > > > > primitives.
> > > > > 
> > > > 
> > > > I glanced over e.g. rw_rlock and it did not have the issue, now
> > > > that I
> > > > see _sx_xlock_hard it wuld indeed use fixing.
> > > > 
> > > > Expect a patch in few h for all primitives I'll find. I'll stress
> > > > test
> > > > the kernel, but it is unlikely I'll do microbenchmarks for
> > > > remaining
> > > > primitives.
> > > 
> > > Is this stuff you're proposing still valid for non-x86 platforms?
> > 
> > Yes.  It just does a read before trying the atomic compare and swap
> > and
> > falls through to the hard case as if the atomic op failed if the
> > result
> > of the read would result in a compare failure.
> > 
> 
> The atomic ops include barriers, the new pre-read of the variable
> doesn't.  Will that cause problems, especially for code inside a loop
> where the compiler may decide to shuffle things around?
> 

Lock value is volatile, so the compiler should not screw us up. I removed
the store to the variable due to a different concern. In worst case we
would have slept instead of spinning. (see below)

> I suspect the performance gain will be biggest on the platforms where
> atomic ops are expensive (I gather from various code comments that's
> the case on x86).
> 

I don't know about other architectures, on x86 atomic op performance on
contended cachelines deteriorates drastically.

=======================================================

For the most port the patch below does 2 simple kinds of changes:
1. in macros for lock/unlock primitives the lock value is fetched and
  compared against relevant _UNLOCKED macro prior to the atomic op
2. in functions:

before:
while (!_mtx_obtain_lock(m, tid)) {
    	v = m->mtx_lock;
	if (v != MTX_UNOWNED) {
		....
	}
	.....
}

after:
for (;;) {
	if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid))
		break;
	v = m->mtx_lock;
	if (v != MTX_UNOWNED) {
		....
	}
	.....
}

The original patch preloaded the 'v' variable. If the value was
MTX_UNOWNED and _mtx_obtain_lock failed, we just lost the race.  In
order to spin waiting for the other thread to release the lock, we have
to re-read the variable. Thus for simplicity it is no longer stored in
'v'. Note this is contrary to kib's earlier suggestion which would put
such threads directly to sleep. I don't have proper measurements to have
any strong opinion here, I went the aforementioned route to minimise
changes.

Note this is a trivial patch to improve the situation a little bit, I
have no interest in trying to polish these primitives at this time.

For overall results, on a machine with 32GB of RAM and the following:
CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz (2800.06-MHz K8-class
CPU)
FreeBSD/SMP: 2 package(s) x 10 core(s) x 2 SMT threads

this reduced make -j 40 kernel in a 10.2 jail from ~2:15 to ~2:05.

=======================================================

diff --git a/sys/kern/kern_lock.c b/sys/kern/kern_lock.c
index aa67180..198e93e 100644
--- a/sys/kern/kern_lock.c
+++ b/sys/kern/kern_lock.c
@@ -787,8 +787,10 @@ __lockmgr_args(struct lock *lk, u_int flags, struct lock_object *ilk,
 			break;
 		}
 
-		while (!atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED,
-		    tid)) {
+		for (;;) {
+			if (lk->lk_lock == LK_UNLOCKED &&
+			    atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, tid))
+				break;
 #ifdef HWPMC_HOOKS
 			PMC_SOFT_CALL( , , lock, failed);
 #endif
@@ -1124,7 +1126,11 @@ __lockmgr_args(struct lock *lk, u_int flags, struct lock_object *ilk,
 			    __func__, iwmesg, file, line);
 		}
 
-		while (!atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, tid)) {
+		for (;;) {
+			if (lk->lk_lock == LK_UNLOCKED &&
+			    atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, tid))
+				break;
+
 #ifdef HWPMC_HOOKS
 			PMC_SOFT_CALL( , , lock, failed);
 #endif
diff --git a/sys/kern/kern_mutex.c b/sys/kern/kern_mutex.c
index bec8f6b..51e82a3 100644
--- a/sys/kern/kern_mutex.c
+++ b/sys/kern/kern_mutex.c
@@ -419,7 +419,9 @@ __mtx_lock_sleep(volatile uintptr_t *c, uintptr_t tid, int opts,
 	all_time -= lockstat_nsecs(&m->lock_object);
 #endif
 
-	while (!_mtx_obtain_lock(m, tid)) {
+	for (;;) {
+		if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid))
+			break;
 #ifdef KDTRACE_HOOKS
 		spin_cnt++;
 #endif
@@ -602,8 +604,9 @@ _mtx_lock_spin_cookie(volatile uintptr_t *c, uintptr_t tid, int opts,
 #ifdef KDTRACE_HOOKS
 	spin_time -= lockstat_nsecs(&m->lock_object);
 #endif
-	while (!_mtx_obtain_lock(m, tid)) {
-
+	for (;;) {
+		if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid))
+			break;
 		/* Give interrupts a chance while we spin. */
 		spinlock_exit();
 		while (m->mtx_lock != MTX_UNOWNED) {
@@ -675,7 +678,9 @@ retry:
 			    m->lock_object.lo_name, file, line));
 		WITNESS_CHECKORDER(&m->lock_object,
 		    opts | LOP_NEWORDER | LOP_EXCLUSIVE, file, line, NULL);
-		while (!_mtx_obtain_lock(m, tid)) {
+		for (;;) {
+			if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid))
+				break;
 			if (m->mtx_lock == tid) {
 				m->mtx_recurse++;
 				break;
diff --git a/sys/kern/kern_rwlock.c b/sys/kern/kern_rwlock.c
index 6541724..6a904d2 100644
--- a/sys/kern/kern_rwlock.c
+++ b/sys/kern/kern_rwlock.c
@@ -771,7 +771,9 @@ __rw_wlock_hard(volatile uintptr_t *c, uintptr_t tid, const char *file,
 	all_time -= lockstat_nsecs(&rw->lock_object);
 	state = rw->rw_lock;
 #endif
-	while (!_rw_write_lock(rw, tid)) {
+	for (;;) {
+		if (rw->rw_lock == RW_UNLOCKED && _rw_write_lock(rw, tid))
+			break;
 #ifdef KDTRACE_HOOKS
 		spin_cnt++;
 #endif
diff --git a/sys/kern/kern_sx.c b/sys/kern/kern_sx.c
index 96e117b..2a81c04 100644
--- a/sys/kern/kern_sx.c
+++ b/sys/kern/kern_sx.c
@@ -544,7 +544,10 @@ _sx_xlock_hard(struct sx *sx, uintptr_t tid, int opts, const char *file,
 	all_time -= lockstat_nsecs(&sx->lock_object);
 	state = sx->sx_lock;
 #endif
-	while (!atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid)) {
+	for (;;) {
+		if (sx->sx_lock == SX_LOCK_UNLOCKED &&
+		    atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid))
+			break;
 #ifdef KDTRACE_HOOKS
 		spin_cnt++;
 #endif
diff --git a/sys/sys/mutex.h b/sys/sys/mutex.h
index a9ec072..0443922 100644
--- a/sys/sys/mutex.h
+++ b/sys/sys/mutex.h
@@ -185,7 +185,7 @@ void	thread_lock_flags_(struct thread *, int, const char *, int);
 #define __mtx_lock(mp, tid, opts, file, line) do {			\
 	uintptr_t _tid = (uintptr_t)(tid);				\
 									\
-	if (!_mtx_obtain_lock((mp), _tid))				\
+	if (((mp)->mtx_lock != MTX_UNOWNED || !_mtx_obtain_lock((mp), _tid)))\
 		_mtx_lock_sleep((mp), _tid, (opts), (file), (line));	\
 	else								\
 		LOCKSTAT_PROFILE_OBTAIN_LOCK_SUCCESS(adaptive__acquire,	\
@@ -203,7 +203,7 @@ void	thread_lock_flags_(struct thread *, int, const char *, int);
 	uintptr_t _tid = (uintptr_t)(tid);				\
 									\
 	spinlock_enter();						\
-	if (!_mtx_obtain_lock((mp), _tid)) {				\
+	if (((mp)->mtx_lock != MTX_UNOWNED || !_mtx_obtain_lock((mp), _tid))) {\
 		if ((mp)->mtx_lock == _tid)				\
 			(mp)->mtx_recurse++;				\
 		else							\
@@ -232,7 +232,7 @@ void	thread_lock_flags_(struct thread *, int, const char *, int);
 									\
 	if ((mp)->mtx_recurse == 0)					\
 		LOCKSTAT_PROFILE_RELEASE_LOCK(adaptive__release, mp);	\
-	if (!_mtx_release_lock((mp), _tid))				\
+	if ((mp)->mtx_lock != _tid || !_mtx_release_lock((mp), _tid))	\
 		_mtx_unlock_sleep((mp), (opts), (file), (line));	\
 } while (0)
 
diff --git a/sys/sys/rwlock.h b/sys/sys/rwlock.h
index f8947c5..6b4f505 100644
--- a/sys/sys/rwlock.h
+++ b/sys/sys/rwlock.h
@@ -96,7 +96,7 @@
 #define	__rw_wlock(rw, tid, file, line) do {				\
 	uintptr_t _tid = (uintptr_t)(tid);				\
 									\
-	if (!_rw_write_lock((rw), _tid))				\
+	if ((rw)->rw_lock != RW_UNLOCKED || !_rw_write_lock((rw), _tid))\
 		_rw_wlock_hard((rw), _tid, (file), (line));		\
 	else 								\
 		LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(rw__acquire, rw,	\
@@ -112,7 +112,7 @@
 	else {								\
 		LOCKSTAT_PROFILE_RELEASE_RWLOCK(rw__release, rw,	\
 		    LOCKSTAT_WRITER);					\
-		if (!_rw_write_unlock((rw), _tid))			\
+		if ((rw)->rw_lock != _tid || !_rw_write_unlock((rw), _tid))\
 			_rw_wunlock_hard((rw), _tid, (file), (line));	\
 	}								\
 } while (0)
diff --git a/sys/sys/sx.h b/sys/sys/sx.h
index 96a664f..144cab5 100644
--- a/sys/sys/sx.h
+++ b/sys/sys/sx.h
@@ -150,7 +150,8 @@ __sx_xlock(struct sx *sx, struct thread *td, int opts, const char *file,
 	uintptr_t tid = (uintptr_t)td;
 	int error = 0;
 
-	if (!atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid))
+	if (sx->sx_lock != SX_LOCK_UNLOCKED ||
+	    !atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid))
 		error = _sx_xlock_hard(sx, tid, opts, file, line);
 	else 
 		LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(sx__acquire, sx,
@@ -168,7 +169,8 @@ __sx_xunlock(struct sx *sx, struct thread *td, const char *file, int line)
 	if (sx->sx_recurse == 0)
 		LOCKSTAT_PROFILE_RELEASE_RWLOCK(sx__release, sx,
 		    LOCKSTAT_WRITER);
-	if (!atomic_cmpset_rel_ptr(&sx->sx_lock, tid, SX_LOCK_UNLOCKED))
+	if (sx->sx_lock != tid ||
+	    !atomic_cmpset_rel_ptr(&sx->sx_lock, tid, SX_LOCK_UNLOCKED))
 		_sx_xunlock_hard(sx, tid, file, line);
 }