Memory barrier

Mon Sep 7 19:55:34 UTC 2015

> > Case 1:
> >    bus_write_1(region_0, ...);
> >    /* barrier here */
> >    DELAY(some_time);
> >
> > Case 2:
> >    bus_write_1(region_0, ...);
> >    /* barrier here */
> >    bus_write_1(region_2, ...);
> >
> > In the first one, I want the write to reach the device before the thread busy-waits.
> >
> > In the second one, I want the write to a device (e.g. power management) to
> >  complete before the write to another starts/completes.
> 
> I believe that the bus_write semantic includes the required serialization.
> E.g., on x86 all CPU write buffers are flushed before the write instruction
> is declared completed, because this is the semantic of the uncacheable
> memory.  For powerpc, the system automatically inserts powerpc_iomb() after
> the write, which is full sync.  I am not aware of other architectures.

I've found the implementation of the bus_space_barrier for the ARM architecture (the one in which I'm interested):

   generic_bs_barrier(bus_space_tag_t t, bus_space_handle_t bsh, bus_size_t offset,
       bus_size_t len, int flags)
   {

           /*
            * dsb() will drain the L1 write buffer and establish a memory access
            * barrier point on platforms where that has meaning.  On a write we
            * also need to drain the L2 write buffer, because most on-chip memory
            * mapped devices are downstream of the L2 cache.  Note that this needs
            * to be done even for memory mapped as Device type, because while
            * Device memory is not cached, writes to it are still buffered.
            */
           dsb();
           if (flags & BUS_SPACE_BARRIER_WRITE) {
                   cpu_l2cache_drain_writebuf();
           }
   }

The ARM architecture specifies two _data_ barrier instructions: DMB and DSB. The first synchronizes memory accesses, and the second synchronizes both memory accesses and instruction execution. So, DSB is the answer to Case 1, and DMB or DSB is the answer to Case 2.

The implementation above brings something of which I was not aware: it also drains the L2 write buffer. Older implementations of the "PL310 Store Buffer did not have any automatic draining mechanism." (ARM CoreLink Level 2 Cache Controller (L2C-310 or PL310), r3 releases, Software Developers Errata Notice.) In newer implementations, the writes to device memory are "Put in store buffer, not merged, immediately drained to L3." (CoreLink Level 2 Cache Controller L2C-310 Technical Reference Manual	Revision: r3p3.)

Leonardo