[PATCH] microoptimize by trying to avoid locking a locked mutex
Mateusz Guzik
mjguzik at gmail.com
Fri Nov 6 00:55:29 UTC 2015
On Thu, Nov 05, 2015 at 04:35:22PM -0700, Ian Lepore wrote:
> On Thu, 2015-11-05 at 14:19 -0800, John Baldwin wrote:
> > On Thursday, November 05, 2015 01:45:19 PM Adrian Chadd wrote:
> > > On 5 November 2015 at 11:26, Mateusz Guzik <mjguzik at gmail.com>
> > > wrote:
> > > > On Thu, Nov 05, 2015 at 11:04:13AM -0800, John Baldwin wrote:
> > > > > On Thursday, November 05, 2015 04:26:28 PM Konstantin Belousov
> > > > > wrote:
> > > > > > On Thu, Nov 05, 2015 at 12:32:18AM +0100, Mateusz Guzik
> > > > > > wrote:
> > > > > > > mtx_lock will unconditionally try to grab the lock and if
> > > > > > > that fails,
> > > > > > > will call __mtx_lock_sleep which will immediately try to do
> > > > > > > the same
> > > > > > > atomic op again.
> > > > > > >
> > > > > > > So, the obvious microoptimization is to check the state in
> > > > > > > __mtx_lock_sleep and avoid the operation if the lock is not
> > > > > > > free.
> > > > > > >
> > > > > > > This gives me ~40% speedup in a microbenchmark of 40 find
> > > > > > > processes
> > > > > > > traversing tmpfs and contending on mount mtx (only used as
> > > > > > > an easy
> > > > > > > benchmark, I have WIP patches to get rid of it).
> > > > > > >
> > > > > > > Second part of the patch is optional and just checks the
> > > > > > > state of the
> > > > > > > lock prior to doing any atomic operations, but it gives a
> > > > > > > very modest
> > > > > > > speed up when applied on top of the __mtx_lock_sleep
> > > > > > > change. As such,
> > > > > > > I'm not going to defend this part.
> > > > > > Shouldn't the same consideration applied to all spinning
> > > > > > loops, i.e.
> > > > > > also to the spin/thread mutexes, and to the spinning parts of
> > > > > > sx and
> > > > > > lockmgr ?
> > > > >
> > > > > I agree. I think both changes are good and worth doing in our
> > > > > other
> > > > > primitives.
> > > > >
> > > >
> > > > I glanced over e.g. rw_rlock and it did not have the issue, now
> > > > that I
> > > > see _sx_xlock_hard it wuld indeed use fixing.
> > > >
> > > > Expect a patch in few h for all primitives I'll find. I'll stress
> > > > test
> > > > the kernel, but it is unlikely I'll do microbenchmarks for
> > > > remaining
> > > > primitives.
> > >
> > > Is this stuff you're proposing still valid for non-x86 platforms?
> >
> > Yes. It just does a read before trying the atomic compare and swap
> > and
> > falls through to the hard case as if the atomic op failed if the
> > result
> > of the read would result in a compare failure.
> >
>
> The atomic ops include barriers, the new pre-read of the variable
> doesn't. Will that cause problems, especially for code inside a loop
> where the compiler may decide to shuffle things around?
>
Lock value is volatile, so the compiler should not screw us up. I removed
the store to the variable due to a different concern. In worst case we
would have slept instead of spinning. (see below)
> I suspect the performance gain will be biggest on the platforms where
> atomic ops are expensive (I gather from various code comments that's
> the case on x86).
>
I don't know about other architectures, on x86 atomic op performance on
contended cachelines deteriorates drastically.
=======================================================
For the most port the patch below does 2 simple kinds of changes:
1. in macros for lock/unlock primitives the lock value is fetched and
compared against relevant _UNLOCKED macro prior to the atomic op
2. in functions:
before:
while (!_mtx_obtain_lock(m, tid)) {
v = m->mtx_lock;
if (v != MTX_UNOWNED) {
....
}
.....
}
after:
for (;;) {
if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid))
break;
v = m->mtx_lock;
if (v != MTX_UNOWNED) {
....
}
.....
}
The original patch preloaded the 'v' variable. If the value was
MTX_UNOWNED and _mtx_obtain_lock failed, we just lost the race. In
order to spin waiting for the other thread to release the lock, we have
to re-read the variable. Thus for simplicity it is no longer stored in
'v'. Note this is contrary to kib's earlier suggestion which would put
such threads directly to sleep. I don't have proper measurements to have
any strong opinion here, I went the aforementioned route to minimise
changes.
Note this is a trivial patch to improve the situation a little bit, I
have no interest in trying to polish these primitives at this time.
For overall results, on a machine with 32GB of RAM and the following:
CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz (2800.06-MHz K8-class
CPU)
FreeBSD/SMP: 2 package(s) x 10 core(s) x 2 SMT threads
this reduced make -j 40 kernel in a 10.2 jail from ~2:15 to ~2:05.
=======================================================
diff --git a/sys/kern/kern_lock.c b/sys/kern/kern_lock.c
index aa67180..198e93e 100644
--- a/sys/kern/kern_lock.c
+++ b/sys/kern/kern_lock.c
@@ -787,8 +787,10 @@ __lockmgr_args(struct lock *lk, u_int flags, struct lock_object *ilk,
break;
}
- while (!atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED,
- tid)) {
+ for (;;) {
+ if (lk->lk_lock == LK_UNLOCKED &&
+ atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, tid))
+ break;
#ifdef HWPMC_HOOKS
PMC_SOFT_CALL( , , lock, failed);
#endif
@@ -1124,7 +1126,11 @@ __lockmgr_args(struct lock *lk, u_int flags, struct lock_object *ilk,
__func__, iwmesg, file, line);
}
- while (!atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, tid)) {
+ for (;;) {
+ if (lk->lk_lock == LK_UNLOCKED &&
+ atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, tid))
+ break;
+
#ifdef HWPMC_HOOKS
PMC_SOFT_CALL( , , lock, failed);
#endif
diff --git a/sys/kern/kern_mutex.c b/sys/kern/kern_mutex.c
index bec8f6b..51e82a3 100644
--- a/sys/kern/kern_mutex.c
+++ b/sys/kern/kern_mutex.c
@@ -419,7 +419,9 @@ __mtx_lock_sleep(volatile uintptr_t *c, uintptr_t tid, int opts,
all_time -= lockstat_nsecs(&m->lock_object);
#endif
- while (!_mtx_obtain_lock(m, tid)) {
+ for (;;) {
+ if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid))
+ break;
#ifdef KDTRACE_HOOKS
spin_cnt++;
#endif
@@ -602,8 +604,9 @@ _mtx_lock_spin_cookie(volatile uintptr_t *c, uintptr_t tid, int opts,
#ifdef KDTRACE_HOOKS
spin_time -= lockstat_nsecs(&m->lock_object);
#endif
- while (!_mtx_obtain_lock(m, tid)) {
-
+ for (;;) {
+ if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid))
+ break;
/* Give interrupts a chance while we spin. */
spinlock_exit();
while (m->mtx_lock != MTX_UNOWNED) {
@@ -675,7 +678,9 @@ retry:
m->lock_object.lo_name, file, line));
WITNESS_CHECKORDER(&m->lock_object,
opts | LOP_NEWORDER | LOP_EXCLUSIVE, file, line, NULL);
- while (!_mtx_obtain_lock(m, tid)) {
+ for (;;) {
+ if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid))
+ break;
if (m->mtx_lock == tid) {
m->mtx_recurse++;
break;
diff --git a/sys/kern/kern_rwlock.c b/sys/kern/kern_rwlock.c
index 6541724..6a904d2 100644
--- a/sys/kern/kern_rwlock.c
+++ b/sys/kern/kern_rwlock.c
@@ -771,7 +771,9 @@ __rw_wlock_hard(volatile uintptr_t *c, uintptr_t tid, const char *file,
all_time -= lockstat_nsecs(&rw->lock_object);
state = rw->rw_lock;
#endif
- while (!_rw_write_lock(rw, tid)) {
+ for (;;) {
+ if (rw->rw_lock == RW_UNLOCKED && _rw_write_lock(rw, tid))
+ break;
#ifdef KDTRACE_HOOKS
spin_cnt++;
#endif
diff --git a/sys/kern/kern_sx.c b/sys/kern/kern_sx.c
index 96e117b..2a81c04 100644
--- a/sys/kern/kern_sx.c
+++ b/sys/kern/kern_sx.c
@@ -544,7 +544,10 @@ _sx_xlock_hard(struct sx *sx, uintptr_t tid, int opts, const char *file,
all_time -= lockstat_nsecs(&sx->lock_object);
state = sx->sx_lock;
#endif
- while (!atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid)) {
+ for (;;) {
+ if (sx->sx_lock == SX_LOCK_UNLOCKED &&
+ atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid))
+ break;
#ifdef KDTRACE_HOOKS
spin_cnt++;
#endif
diff --git a/sys/sys/mutex.h b/sys/sys/mutex.h
index a9ec072..0443922 100644
--- a/sys/sys/mutex.h
+++ b/sys/sys/mutex.h
@@ -185,7 +185,7 @@ void thread_lock_flags_(struct thread *, int, const char *, int);
#define __mtx_lock(mp, tid, opts, file, line) do { \
uintptr_t _tid = (uintptr_t)(tid); \
\
- if (!_mtx_obtain_lock((mp), _tid)) \
+ if (((mp)->mtx_lock != MTX_UNOWNED || !_mtx_obtain_lock((mp), _tid)))\
_mtx_lock_sleep((mp), _tid, (opts), (file), (line)); \
else \
LOCKSTAT_PROFILE_OBTAIN_LOCK_SUCCESS(adaptive__acquire, \
@@ -203,7 +203,7 @@ void thread_lock_flags_(struct thread *, int, const char *, int);
uintptr_t _tid = (uintptr_t)(tid); \
\
spinlock_enter(); \
- if (!_mtx_obtain_lock((mp), _tid)) { \
+ if (((mp)->mtx_lock != MTX_UNOWNED || !_mtx_obtain_lock((mp), _tid))) {\
if ((mp)->mtx_lock == _tid) \
(mp)->mtx_recurse++; \
else \
@@ -232,7 +232,7 @@ void thread_lock_flags_(struct thread *, int, const char *, int);
\
if ((mp)->mtx_recurse == 0) \
LOCKSTAT_PROFILE_RELEASE_LOCK(adaptive__release, mp); \
- if (!_mtx_release_lock((mp), _tid)) \
+ if ((mp)->mtx_lock != _tid || !_mtx_release_lock((mp), _tid)) \
_mtx_unlock_sleep((mp), (opts), (file), (line)); \
} while (0)
diff --git a/sys/sys/rwlock.h b/sys/sys/rwlock.h
index f8947c5..6b4f505 100644
--- a/sys/sys/rwlock.h
+++ b/sys/sys/rwlock.h
@@ -96,7 +96,7 @@
#define __rw_wlock(rw, tid, file, line) do { \
uintptr_t _tid = (uintptr_t)(tid); \
\
- if (!_rw_write_lock((rw), _tid)) \
+ if ((rw)->rw_lock != RW_UNLOCKED || !_rw_write_lock((rw), _tid))\
_rw_wlock_hard((rw), _tid, (file), (line)); \
else \
LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(rw__acquire, rw, \
@@ -112,7 +112,7 @@
else { \
LOCKSTAT_PROFILE_RELEASE_RWLOCK(rw__release, rw, \
LOCKSTAT_WRITER); \
- if (!_rw_write_unlock((rw), _tid)) \
+ if ((rw)->rw_lock != _tid || !_rw_write_unlock((rw), _tid))\
_rw_wunlock_hard((rw), _tid, (file), (line)); \
} \
} while (0)
diff --git a/sys/sys/sx.h b/sys/sys/sx.h
index 96a664f..144cab5 100644
--- a/sys/sys/sx.h
+++ b/sys/sys/sx.h
@@ -150,7 +150,8 @@ __sx_xlock(struct sx *sx, struct thread *td, int opts, const char *file,
uintptr_t tid = (uintptr_t)td;
int error = 0;
- if (!atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid))
+ if (sx->sx_lock != SX_LOCK_UNLOCKED ||
+ !atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid))
error = _sx_xlock_hard(sx, tid, opts, file, line);
else
LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(sx__acquire, sx,
@@ -168,7 +169,8 @@ __sx_xunlock(struct sx *sx, struct thread *td, const char *file, int line)
if (sx->sx_recurse == 0)
LOCKSTAT_PROFILE_RELEASE_RWLOCK(sx__release, sx,
LOCKSTAT_WRITER);
- if (!atomic_cmpset_rel_ptr(&sx->sx_lock, tid, SX_LOCK_UNLOCKED))
+ if (sx->sx_lock != tid ||
+ !atomic_cmpset_rel_ptr(&sx->sx_lock, tid, SX_LOCK_UNLOCKED))
_sx_xunlock_hard(sx, tid, file, line);
}
More information about the freebsd-current
mailing list