svn commit: r323393 - in head/sys: sys vm

Mon Sep 11 07:30:12 UTC 2017

On Mon, Sep 11, 2017 at 11:56:21AM +1000, Bruce Evans wrote:
> On Sun, 10 Sep 2017, Mateusz Guzik wrote:
>
> > Log:
> >  Move vmmeter atomic counters into dedicated cache lines
> >
> >  Prior to the change they were subject to extreme false sharing.
> >  In particular this change shaves about 3 seconds real time of -j 80
buildkernel.
>
> Changes that small are hard to measure.  Kernel builds take about 3
seconds
> real time altogether (cheating slightly -- this is for a FreeBSD-4 kernel
> where the sources are held constant to provide a benchmark and a
bloatometer
> for newer kernels and userlands which take 10-20 times longer).
>

Well, for me building 11.0 on tmpfs with 11.0 world reliably takes 1:25
(and now 1:22).

> > Modified: head/sys/sys/vmmeter.h
> >
==============================================================================
> > --- head/sys/sys/vmmeter.h  Sun Sep 10 18:08:25 2017        (r323392)
> > +++ head/sys/sys/vmmeter.h  Sun Sep 10 19:00:38 2017        (r323393)
> > @@ -60,6 +60,12 @@ struct vmtotal {
> > #if defined(_KERNEL) || defined(_WANT_VMMETER)
> > #include <sys/counter.h>
> >
> > +#ifdef _KERNEL
> > +#define VMMETER_ALIGNED    __aligned(CACHE_LINE_SIZE)
> > +#else
> > +#define VMMETER_ALIGNED
> > +#endif
> > +
>
> Using this breaks the "ABI" which is used by at least vmstat(1).
>
> This has some style bugs:
> - verbose macro name.  Using this unimproves the formatting later.
> - space instead of tab after #define.  This vfile is vaguely
KNF-conformant
>   and had only 1 instance of this style bug before (for the idempotency
>   ifndef, which has 3 other style bugs:
>   - missing blank line before its #endif
>   - tab instead of space before the comment on its #endif
>   - backwards comment on its #endif).
>
> > /*
> >  * System wide statistics counters.
> >  * Locking:
> > @@ -126,14 +132,15 @@ struct vmmeter {
> >     u_int v_free_target;    /* (c) pages desired free */
> >     u_int v_free_min;       /* (c) pages desired free */
> >     u_int v_free_count;     /* (f) pages free */
> > -   u_int v_wire_count;     /* (a) pages wired down */
> > -   u_int v_active_count;   /* (q) pages active */
> >     u_int v_inactive_target; /* (c) pages desired inactive */
> > -   u_int v_inactive_count; /* (q) pages inactive */
> > -   u_int v_laundry_count;  /* (q) pages eligible for laundering */
>
> Moving these also breaks the "ABI".
>

The original patched attempted to preserve it, but I was told it does
not matter. Apparently that's indeed true, as evidenced by another
change made few months ago which introduce counter type.

> >     u_int v_pageout_free_min;   /* (c) min pages reserved for kernel */
> >     u_int v_interrupt_free_min; /* (c) reserved pages for int code */
> >     u_int v_free_severe;    /* (c) severe page depletion point */
> > +   u_int v_wire_count VMMETER_ALIGNED; /* (a) pages wired down */
>
> The more complicated declaration unimproves the formatting, especially
> since the macro name is so long.

See way below.

> Is there a technical reason to move to the end?  style(9) requires
> sorting on alignment, with the most aligned fields first, not last
> (it says to sort on size, but means alignment).  Is there a reason
> to reverse this here.
>

There is. The annotation only guarantees the address of the field will
be aligned. Whatever follows in the struct will start filling the same
cacheline, unless annotated as well. i.e. in order to prevent the next
field from collapsing into the line, the top of the struct would have
to look like this:

        u_int v_wire_count VMMETER_ALIGNED; /* (a) pages wired down */
        u_int v_active_count VMMETER_ALIGNED; /* (a) pages active */
        u_int v_inactive_count VMMETER_ALIGNED; /* (a) pages inactive */
        u_int v_laundry_count VMMETER_ALIGNED; /* (a) pages eligible for
        counter_u64_t v_swtch VMMETER_ALIGNED;

That's significantly crappier.

__exclusive_cache_line annotation of the vm_cnt object itself puts it
into a dedicated section where the linker guarantees whatever folllows
will not share the line.

> > +   u_int v_active_count VMMETER_ALIGNED; /* (a) pages active */
> > +   u_int v_inactive_count VMMETER_ALIGNED; /* (a) pages inactive */
> > +   u_int v_laundry_count VMMETER_ALIGNED; /* (a) pages eligible for
> > +                                             laundering */
>
> This does more than move the other fields and unimprove their formatting.
> It also changes the lock annotations from (q) to (a).
>
> The formatting is ugliest for the last field.
>

Agreed.

> Naming the macro something like VMMA would things fit better.  It is
> stupid to abbreviate VIRTUAL_MEMORY to VM but spell out ALIGNED,
> especially when the former is global and the latter is local.  If
> the namespace were documented, then the documentation would say that
> vm, v_ and VM* are reserved, perhaps without underscores, so it would
> not be namespace pollution to use VMMA.
>

I'm beyond indifferent on how to name the macro. This name was suggested
in the review and I just rolled with it. If you want it changed, please
argue with alc and/or markj.

That said, looking now at the struct I think its use should be retired
from the kernel. It can remain in headers for userspace use.

First, there is a bunch of counter(9) fields. I don't know the original
reasoning. I would expect these counters to be statically defined in a
per-cpu struct.

Then we are left with a bunch of read-only fields. They can be grouped
together in a __read_mostly annotated object.

One field with the free page queue lock is v_free_count. It can be
annotated by hand in whatever way makes sense.

And then there are the 4 atomics. They can just be regular globals
with __exclusive_cache_line.

However, I'm not interested in pursuing this. My primary motivation for
the change here was to get rid of an obvious & important bottleneck and
to stabilize performance over time. Interestingly, as different kernel
versions produce different binaries, the vm_cnt object's offset within
whatever cacheline it used was changing. This in turn was altering
false sharing, making performance fluctuate a little bit.

-- 
Mateusz Guzik <mjguzik gmail.com>