[PATCH] randomized delay in locking primitives, take 2

Mon Aug 1 16:55:27 UTC 2016

On Mon, Aug 01, 2016 at 09:38:09AM -0700, Maxim Sobolev wrote:
> On Sun, Jul 31, 2016 at 1:36 PM, Mateusz Guzik <mjguzik at gmail.com> wrote:
> 
> > On Sun, Jul 31, 2016 at 07:03:08AM -0700, Adrian Chadd wrote:
> > > Hi,
> > >
> > > Did you test on any 1, 2, 4, 8 cpu machines? just to see if there are
> > > any performance degredations on lower count CPUs?
> > >
> >
> > I did not test on machines which physically that few cpus, but I did
> > test the impact on microbenchmark with 2 and 4 threads on the 80-way
> > machine. There was no difference.
> >
> 
> Well, arguably running 4 threads on a 80-way machine is not quite the same
> as running the same 4 threads on 4-way or 8-way machine. Unless you
> actually bind your test threads to a specific CPUs, on a bigger system
> scheduler is free to migrate your thread on another CPU if all 4 are
> spinning, this might not the option for smaller box. I suggest you at very
> least re-run your benchmark on a virtual machine with small CPUs count
> assigned, it should be quite easy to do so on your monster box.
> 

The test does bind threads to cpus. Further, the dealy is autotuned
based on the number of cpus in the system. I also once more stress that
the backoff is extremely small and we don't take full advantage of it
specifically so that detrimental behaviour is very unlikely.

So here even on the 80-way machine with 80 * 25 initial delay there was
no performance loss with only few threads competing. The worry could
have been it helps only if all cpu threads compete and hurts performance
otherwise.

On an actual 4-way machine will have only 4 * 25 delays configured with
4 * 25 * 10 upper limit. But note the actual delay is cpu_ticks() % that
value, so even that is often smaller.

tl;dr this is really a simple patch and at least on amd64 a pure win

-- 
Mateusz Guzik <mjguzik gmail.com>