svn commit: r358733 - head/sys/sys

Thu Mar 12 02:16:21 UTC 2020

On 3/10/20, Konstantin Belousov <kostikbel at gmail.com> wrote:
> On Tue, Mar 10, 2020 at 12:22:09AM +0100, Mateusz Guzik wrote:
>> On 3/9/20, Konstantin Belousov <kostikbel at gmail.com> wrote:
>> > On Mon, Mar 09, 2020 at 01:56:17AM +0100, Mateusz Guzik wrote:
>> >> On 3/8/20, Mateusz Guzik <mjg at freebsd.org> wrote:
>> >> > Author: mjg
>> >> > Date: Sun Mar  8 00:22:32 2020
>> >> > New Revision: 358733
>> >> > URL: https://svnweb.freebsd.org/changeset/base/358733
>> >> >
>> >> > Log:
>> >> >   seqc: tidy up
>> >> >
>> >> >   - avoid hand-rolled read
>> >> >   - match begin/end in terms of fence style
>> >> >
>> >>
>> >> There were off lists questions about this so let me clarify.
>> >>
>> >> The first bit is a cosmetic change, but the second one is not.
>> >>
>> >> > Modified:
>> >> >   head/sys/sys/seqc.h
>> >> >
>> >> > Modified: head/sys/sys/seqc.h
>> >> > ==============================================================================
>> >> > --- head/sys/sys/seqc.h	Sat Mar  7 15:37:23 2020	(r358732)
>> >> > +++ head/sys/sys/seqc.h	Sun Mar  8 00:22:32 2020	(r358733)
>> >> > @@ -66,7 +66,8 @@ static __inline void
>> >> >  seqc_write_end(seqc_t *seqcp)
>> >> >  {
>> >> >
>> >> > -	atomic_store_rel_int(seqcp, *seqcp + 1);
>> >> > +	atomic_thread_fence_rel();
>> >> > +	*seqcp += 1;
>> >> >  	MPASS(!seqc_in_modify(*seqcp));
>> >> >  	critical_exit();
>> >> >  }
>> >>
>> >> For correct operation the counter has to be modified *before* any work
>> >> is done and *after* it is completed.
>> >>
>> >> Consider a trivial case:
>> >> seqc++;
>> >> foo[0] = 0;
>> >> foo[1] = 1;
>> >> seqc++;
>> >>
>> >> There are 2 ways in which this can be mucked with:
>> >> - the compiler can be looking at reordering or reworking the increment
>> >> (e.g., to collapse it to just += 2 instead). a compiler barrier
>> >> prevents that.
>> >> - even with generated machine code which increments in correct places
>> >> the cpu can be looking at reordering the operation (which is heavily
>> >> dependent on architecture), in particular it could end up issuing
>> >> seqc++; foo[1] = 1; which would defeat the point. This is where
>> >> release fences come in.
>> >>
>> >> With the patched code there is:
>> >> seqc++;
>> >> atomic_thread_fence_rel(); /* make sure seqc++ is issued before foo
>> >> starts getting modified */
>> >> foo[0] = 0;
>> >> foo[1] = 1;
>> >> atomic_thread_fence_rel(); /* make sure modifications to foo don't
>> >> leak past this spot */
>> >> seqc++;
>> >>
>> >> In comparison, the previous code was:
>> >> seqc++;
>> >> atomic_thread_fence_rel(); /* make sure seqc++ is issued before foo
>> >> starts getting modified */
>> >> foo[0] = 0;
>> >> foo[1] = 1;
>> >> atomic_store_rel_int(seqcp, *seqcp + 1); /* make sure modifications to
>> >> too don't leak past this spot and update seqc immediately */
>> >>
>> >> There are 2 differences here -- one is an improvement and second one
>> >> does not matter:
>> >> the win is: the previous code forces the compiler to compute an
>> >> incremented value and then store it. On amd64 this translates to the
>> >> following (with rdx holding the address of the counter):
>> >>
>> >> mov    (%rdx),%eax /* load the value */
>> >> add    $0x1,%eax    /* add 1 */
>> >> mov    %eax,(%rdx) /* store it */
>> >>
>> >> On patched kernel this is:
>> >> addl   $0x1,(%rdx)
>> >>
>> >> which is clearly much nicer.
>> > I am not sure that the new code on amd64 is much nicer.
>> >
>> > But the point was that on non-amd64, i.e. armv8 and potentially on
>> > powerpc 3.0, the patch substitutes release store with full barrier. The
>> > later is (much) slower.
>> >
>>
>> If an arch performs something significantly more expensive here it's
>> probably a performance bug in the port.
> It is a question of how much hardware support for fine-grained fences.
> Relaxed architectures add some optimized operations already, but did not
> (yet) implemented complete C11 optimized set.
> I think they eventually converge.
>
> Until they did not, I think it is more important to not pessimize
> to-be Tier1 arches than to collapse three fast integer instructions
> on amd64 into one rw op.
>
>> But perhaps more importantly
>> there are significantly more frequent users of
>> atomic_thread_fence_rel, like refcount(9). If the issue is real that
>> should be looked at.
> For refcount(9) use of release fence removal, we would need
> 	atomic_fetchadd_rel_int().
>

To reiterate, if atomic_thread_fence_rel is indeed very expensive on
certain architectures compared to other _rel variants, the kernel is
already heavily pessimized because of use of said standalone fence in
refcount(9) and few other places and this needs to be fixed.

If the cost can't be reduced (e.g., perhaps this can do a no-op store
+ fence somewhere akin to how lock xadd 0,(%esp) trickery used to
work), then the atomic APIs do indeed not to be completed w.r.t.
adding all the missing acq/rel variants (that's at least fetchadd and
swap).

We can also extend the API to do relevant addition/subtraction without
guaranteeing atomicity vs concurrent updates, but while providing
relevant fences. This would preserve the intended state achieved with
this commit while not interfering with others.

I can add the missing amd64 bits if other people cover their
architectures, but otherwise I'm not interested in doing anything
here.

seqc itself, apart from being a minor consumer compared to refcount,
was already using the standalone fence even prior to this change thus
I don't think there is any rush to do anything with regards to this
particular commit.

>>
>> >>
>> >> the not a problem: the code does not issue the release fence after
>> >> incrementing the counter the second time, meaning that in principle it
>> >> can happen arbitrarily late, possibly disturbing code which waits for
>> >> the counter to become even. This is not a problem because in whoever
>> >> modifies it is supposed to have a lock and release it shortly after,
>> >> which also posts the release fence and consequently makes sure the
>> >> counter update is visible. Also note these routines come with
>> >> preemption disablement around the modification -- if the thread gets
>> >> preempted as a result of critical_exit from seqc_write_end, it will
>> >> also publish the updated counter. If the updater never does anything
>> >> which flushes the store buffer then in the absolute worst case they
>> >> will get preempted, at which point once more we are covered. Note that
>> >> code which wants more deterministic behavior will want to
>> >> seqc_read_any, test the counter once and if it is caught uneven resort
>> >> to locking.
>> >>
>> >> > @@ -84,7 +85,7 @@ seqc_read(const seqc_t *seqcp)
>> >> >  	seqc_t ret;
>> >> >
>> >> >  	for (;;) {
>> >> > -		ret = atomic_load_acq_int(__DECONST(seqc_t *, seqcp));
>> >> > +		ret = seqc_read_any(seqcp);
>> >> >  		if (__predict_false(seqc_in_modify(ret))) {
>> >> >  			cpu_spinwait();
>> >> >  			continue;
>> >> > _______________________________________________
>> >> > svn-src-all at freebsd.org mailing list
>> >> > https://lists.freebsd.org/mailman/listinfo/svn-src-all
>> >> > To unsubscribe, send any mail to
>> >> > "svn-src-all-unsubscribe at freebsd.org"
>> >> >
>> >>
>> >>
>> >> --
>> >> Mateusz Guzik <mjguzik gmail.com>
>> >
>>
>>
>> --
>> Mateusz Guzik <mjguzik gmail.com>
>

-- 
Mateusz Guzik <mjguzik gmail.com>