atomic ops
    Attilio Rao 
    attilio at freebsd.org
       
    Wed Oct 29 16:33:38 UTC 2014
    
    
  
On Wed, Oct 29, 2014 at 3:59 PM, John Baldwin <jhb at freebsd.org> wrote:
> On Tuesday, October 28, 2014 4:08:27 pm Attilio Rao wrote:
>> On Tue, Oct 28, 2014 at 6:53 PM, Andrew Turner <andrew at fubar.geek.nz> wrote:
>> > On Tue, 28 Oct 2014 15:33:06 +0100
>> > Attilio Rao <attilio at freebsd.org> wrote:
>> >> On Tue, Oct 28, 2014 at 3:25 PM, Andrew Turner <andrew at fubar.geek.nz>
>> >> wrote:
>> >> > On Tue, 28 Oct 2014 14:18:41 +0100
>> >> > Attilio Rao <attilio at freebsd.org> wrote:
>> >> >
>> >> >> On Tue, Oct 28, 2014 at 3:52 AM, Mateusz Guzik <mjguzik at gmail.com>
>> >> >> wrote:
>> >> >> > As was mentioned sometime ago, our situation related to atomic
>> >> >> > ops is not ideal.
>> >> >> >
>> >> >> > atomic_load_acq_* and atomic_store_rel_* (at least on amd64)
>> >> >> > provide full memory barriers, which is stronger than needed.
>> >> >> >
>> >> >> > Moreover, load is implemented as lock cmpchg on var address, so
>> >> >> > it is addditionally slower especially when cpus compete.
>> >> >>
>> >> >> I already explained this once privately: fully memory barriers is
>> >> >> not stronger than needed.
>> >> >> FreeBSD has a different semantic than Linux. We historically
>> >> >> enforce a full barrier on _acq() and _rel() rather then just a
>> >> >> read and write barrier, hence we need a different implementation
>> >> >> than Linux. There is code that relies on this property, like the
>> >> >> locking primitives (release a mutex, for instance).
>> >> >
>> >> > On 32-bit ARM prior to ARMv8 (i.e. all chips we currently support)
>> >> > there are only full barriers. On both 32 and 64-bit ARMv8 ARM has
>> >> > added support for load-acquire and store-release atomic
>> >> > instructions. For the use in atomic instructions we can assume
>> >> > these only operate of the address passed to them.
>> >> >
>> >> > It is unlikely we will use them in the 32-bit port however I would
>> >> > like to know the expected semantics of these atomic functions to
>> >> > make sure we get them correct in the arm64 port. I have been
>> >> > advised by one of the ARM Linux kernel maintainers on the problems
>> >> > they have found using these instructions but have yet to determine
>> >> > what our atomic functions guarantee.
>> >>
>> >> For FreeBSD the "reference doc" is atomic(9).
>> >> It clearly states:
>> >
>> > There may also be a difference between what it states, how they are
>> > implemented, and what developers assume they do. I'm trying to make
>> > sure I get them correct.
>>
>> atomic(9) is our reference so there might be no difference between
>> what it states and what all architectures implement.
>> I can say that x86 follows atomic(9) well. I'm not competent enough to
>> judge if all the !x86 arches follow it completely.
>> I can understand that developers may get confused. The FreeBSD scheme
>> is pretty unique. It comes from the fact that historically the membar
>> support was made to initially support x86. The super-widespread Linux
>> design, instead, tried to catch all architectures in its description.
>> It become very well known and I think it also "pushed" for companies
>> like Intel to invest in improving performance of things like explicit
>> read/write barriers, etc.
>
> Actually, it was designed to support ia64 (and specifically the .acq and
> .rel modifiers on the ld, st, and cmpxchg instructions).  Some of the
> langage is wrong (and is my fault) in that they are not "read" and
> "write" barriers.  They truly are "acquire" and "release".  That said,
> x86 has stronger barriers than that, partly because on i386 there wasn't
> a whole lot of options (though atomic_store_rel on even i386 should just
> be a simple store).
>
>> >> The second variant of each operation includes a read memory barrier.
>> >> This barrier ensures that the effects of this operation are completed
>> >> before the effects of any later data accesses.  As a result, the
>> >> opera- tion is said to have acquire semantics as it acquires a
>> >> pseudo-lock requiring further operations to wait until it has
>> >> completed.  To denote this, the suffix ``_acq'' is inserted into the
>> >> function name immediately prior to the ``_<type>'' suffix.  For
>> >> example, to subtract two integers ensuring that any later writes will
>> >> happen after the subtraction is per- formed, use
>> >> atomic_subtract_acq_int().
>> >
>> > It depends on the point we guarantee the acquire barrier to be. On ARMv8
>> > the function will be a load/modify/write sequence. If we use a
>> > load-acquire operation for atomic_subtract_acq_int, for example, for a
>> > pointer P and value to subtract X:
>> >
>> > loop:
>> >  load-acquire *P to N
>> >  perform N = N - X
>> >  store-exclusive N to *P
>> >  if the store failed goto loop
>> >
>> > where N and X are both registers.
>> >
>> > This will mean no access after this loop will happen before it, but
>> > they may happen within it, e.g. if there was a later access A the
>> > following may be possible:
>> >
>> > Load P
>> > Access A
>> > Store P
>>
>> No, this will be broken in FreeBSD if "Access A" is later.
>> If "Access A" is prior the membar it doesn't really matter if it gets
>> interleaved with any of the operations in the atomic instruction.
>> Ideally, it could even surpass the Store P itself.
>> But if "Access A" is later (and you want to implement an _acq()
>> barrier) then it cannot absolutely gets in the middle of the atomic_*
>> operation.
>
> Eh, that isn't broken.  It is subtle however.  The reason it isn't broken
> is that if any access to P occurs afer the 'load P', then the store will
> fail and the load-acquire will be retried, if A was accessed during the
> atomi op, the load-acquire during the try will discard that and force A
> to be re-accessed.  If P is not accessed during the atomic op, then it is
> safe to access A during the atomic op itself.
This is specific to armv8, which I know 0 about. Good to know.
>From a general point of view the description didn't seem ok.
>> > We know the store will happen as if it fails, e.g. another processor
>> > access *P, the store will have failed and will iterate over the loop.
>> >
>> > The other point is we can guarantee any store-release, and therefore
>> > any prior access, has happened before a later load-acquire even if it's
>> > on another processor.
>>
>> No, we can never guarantee on the visibility of the operations by other CPUs.
>> We just make guarantee on how the operations are posted on the system
>> bus (or how they are locally visible).
>> Keeping in mind that FreeBSD model cames from x86, you can sense that
>> some things are sized on the x86 model, which doesn't have any rule or
>> ordering on global visibility of the operations.
>
> 1) Again, it's actually based on ia64.
>
> 2) x86 _does_ have rules on ordering of global visiblity in that most
>    stores (aside from some SSE special cases) will become visible in
>    program order.  Now, you can't force the _timing_ of when the stores
>    become visible (and this is true in general, in MI code you can't
>    assume that a barrier is equivalent to a cache flush).
Yes, this is what I mean. You can't have guarantee on the global
timing of the memory accesses.
Attilio
-- 
Peace can only be achieved by understanding - A. Einstein
    
    
More information about the freebsd-arch
mailing list