Re: git: ff1ae3b3639d - main - rangelocks: restore caching of the single rl entry in the struct thread
Date: Thu, 29 Aug 2024 20:22:51 UTC
On Thu, Aug 29, 2024 at 04:04:10PM -0400, Mark Johnston wrote:
> On Tue, Aug 06, 2024 at 04:06:39AM +0000, Konstantin Belousov wrote:
> > The branch main has been updated by kib:
> >
> > URL: https://cgit.FreeBSD.org/src/commit/?id=ff1ae3b3639d39a6485cfc655bf565cd04b9caa6
> >
> > commit ff1ae3b3639d39a6485cfc655bf565cd04b9caa6
> > Author: Konstantin Belousov <kib@FreeBSD.org>
> > AuthorDate: 2023-08-23 23:29:50 +0000
> > Commit: Konstantin Belousov <kib@FreeBSD.org>
> > CommitDate: 2024-08-06 04:05:58 +0000
> >
> > rangelocks: restore caching of the single rl entry in the struct thread
> >
> > Reviewed by: markj, Olivier Certner <olce.freebsd@certner.fr>
> > Tested by: pho
> > Sponsored by: The FreeBSD Foundation
> > Differential revision: https://reviews.freebsd.org/D41787
> > ---
> > sys/kern/kern_rangelock.c | 33 +++++++++++++++++++++++++++------
> > sys/kern/kern_thread.c | 1 +
> > sys/sys/rangelock.h | 1 +
> > 3 files changed, 29 insertions(+), 6 deletions(-)
> >
> > diff --git a/sys/kern/kern_rangelock.c b/sys/kern/kern_rangelock.c
> > index 355b2125dd9b..2f16569b19aa 100644
> > --- a/sys/kern/kern_rangelock.c
> > +++ b/sys/kern/kern_rangelock.c
> > @@ -82,8 +82,15 @@ static struct rl_q_entry *
> > rlqentry_alloc(vm_ooffset_t start, vm_ooffset_t end, int flags)
> > {
> > struct rl_q_entry *e;
> > -
> > - e = uma_zalloc_smr(rl_entry_zone, M_WAITOK);
> > + struct thread *td;
> > +
> > + td = curthread;
> > + if (td->td_rlqe != NULL) {
> > + e = td->td_rlqe;
> > + td->td_rlqe = NULL;
> > + } else {
> > + e = uma_zalloc_smr(rl_entry_zone, M_WAITOK);
> > + }
> > e->rl_q_next = NULL;
> > e->rl_q_free = NULL;
> > e->rl_q_start = start;
> > @@ -95,6 +102,12 @@ rlqentry_alloc(vm_ooffset_t start, vm_ooffset_t end, int flags)
> > return (e);
> > }
> >
> > +void
> > +rangelock_entry_free(struct rl_q_entry *e)
> > +{
> > + uma_zfree_smr(rl_entry_zone, e);
> > +}
> > +
> > void
> > rangelock_init(struct rangelock *lock)
> > {
> > @@ -106,6 +119,7 @@ void
> > rangelock_destroy(struct rangelock *lock)
> > {
> > struct rl_q_entry *e, *ep;
> > + struct thread *td;
> >
> > MPASS(!lock->sleepers);
> > for (e = (struct rl_q_entry *)atomic_load_ptr(&lock->head);
> > @@ -386,8 +400,10 @@ rangelock_lock_int(struct rangelock *lock, bool trylock, vm_ooffset_t start,
> > vm_ooffset_t end, int locktype)
> > {
> > struct rl_q_entry *e, *free, *x, *xp;
> > + struct thread *td;
> > enum RL_INSERT_RES res;
> >
> > + td = curthread;
> > for (res = RL_LOCK_RETRY; res == RL_LOCK_RETRY;) {
> > free = NULL;
> > e = rlqentry_alloc(start, end, locktype);
> > @@ -401,10 +417,15 @@ rangelock_lock_int(struct rangelock *lock, bool trylock, vm_ooffset_t start,
> > e = NULL;
> > }
> > for (x = free; x != NULL; x = xp) {
> > - MPASS(!rl_e_is_marked(x));
> > - xp = x->rl_q_free;
> > - MPASS(!rl_e_is_marked(xp));
> > - uma_zfree_smr(rl_entry_zone, x);
> > + MPASS(!rl_e_is_marked(x));
> > + xp = x->rl_q_free;
> > + MPASS(!rl_e_is_marked(xp));
> > + if (td->td_rlqe == NULL) {
> > + smr_synchronize(rl_smr);
>
> I think I had the impression that this was a rare case, but empirically
> it is not. As far as I can see, this smr_synchronize() call happens
> every time an entry is freed, which could be very frequent.
> smr_synchronize() bumps the global sequence counter and blocks until all
> CPUs have had a chance to observe the new value, so is quite expensive.
>
> I didn't try benchmarking yet, but pwrite3 from will-it-scale should be
> a good candidate.
>
> Why do we maintain this per-thread cache at all? IMO it would make more
> sense to unconditionally free the entry here. Or perhaps I'm missing
> something here.
Normally threads perform i/o syscalls, and for these syscalls, they need
to perform just one rangelock op. In absence of the races with conflicting
locks, the rangelock acquisition does not need another rl_q_entry item.
In previous rangelock implementation, this was a significant improvement
esp. on microbenchmarks where rangelock overhead could be reached, like
repeated write or read of same single byte.
Remembering that, I decided to keep the caching.