svn commit: r309143 - head/sys/libkern

Sat Dec 10 21:11:41 UTC 2016

On Wed, Dec 07, 2016 at 11:41:21AM -0800, Gleb Smirnoff wrote:
> On Fri, Nov 25, 2016 at 05:00:10PM +0100, Mateusz Guzik wrote:
> M> On Fri, Nov 25, 2016 at 01:49:33PM +0000, Fabien Thomas wrote:
> M> > Author: fabient
> M> > Date: Fri Nov 25 13:49:33 2016
> M> > New Revision: 309143
> M> > URL: https://svnweb.freebsd.org/changeset/base/309143
> M> > 
> M> > Log:
> M> >   In a dual processor system (2*6 cores) during IPSec throughput tests,
> M> >   we see a lot of contention on the arc4 lock, used to generate the IV
> M> >   of the ESP output packets.
> M> >   
> M> >   The idea of this patch is to split this mutex in order to reduce the
> M> >   contention on this lock.
> M> >   
> M> > +MALLOC_DEFINE(M_ARC4RANDOM, "arc4random", "arc4random structures");
> M> >  
> M> > -static u_int8_t arc4_randbyte(void);
> M> > +struct arc4_s {
> M> > +	u_int8_t i, j;
> M> > +	int numruns;
> M> > +	u_int8_t sbox[256];
> M> > +	time_t t_reseed;
> M> > +
> M> > +	struct mtx mtx;
> M> > +};
> M> > +
> M> 
> M> Why is the lock at the end? Right now you got false sharing with the
> M> next entry.
> M> 
> M> That said, I suggest:
> M> 1. moving thelock to the beginning
> M> 2. annotating the struct with __aligned(CACHE_LINE_SIZE)
> 
> It could be even better not to allocate them with regular malloc at all,
> but to put them into the per cpu areas. I haven't tried, but looks like
> the DPCPU mechanism defined in sys/pcpu.h would fit that well.
> 

Oh. Now that I read your comment I re-read the patch and it looks like
indeed the patch allocate the struct for each online cpu, so perhaps it
should indeed migrate and automagically get numa friendly at some point.

This prompts me to look some more and I think this does not really
scale, just got rid of an issue immediately visible with lock profiling.

arc4rand contains:
[..]
        if (reseed || atomic_cmpset_int(&arc4rand_iniseed_state,
                        ARC4_ENTR_HAVE, ARC4_ENTR_SEED)) {
                ARC4_FOREACH(arc4)
                        arc4_randomstir(arc4);
        }

Which on amd64 always dirties the cacheline containing
arc4rand_iniseed_state. 

In my kernel the relevant cacheline is:
ffffffff81583f80 b vn_io_fault_prefault
ffffffff81583f88 b vn_io_faults_cnt
ffffffff81583f90 B arc4rand_iniseed_state
ffffffff81583f98 b arc4inst
		  ^^^^^^ the array
ffffffff81583fa0 b inet_ntoa.buf
ffffffff81583fb0 b static_ltree

That is, not only competing cpus do the atomic op, they additionally
ping pong the cacheline when computing the address of the target object.

At the very least, attempting atomic_cmpset_int only if it the initial
condition holds should improve things. The lock array pointer can be
__read_mostly-ified after the macro gets sorted out.

-- 
Mateusz Guzik <mjguzik gmail.com>