memory barriers in bus_dmamap_sync() ?

Wed Jan 11 16:59:53 UTC 2012

On Jan 11, 2012, at 9:29 AM, Luigi Rizzo wrote:

> On Wed, Jan 11, 2012 at 10:05:28AM -0500, John Baldwin wrote:
>> On Tuesday, January 10, 2012 5:41:00 pm Luigi Rizzo wrote:
>>> On Tue, Jan 10, 2012 at 01:52:49PM -0800, Adrian Chadd wrote:
>>>> On 10 January 2012 13:37, Luigi Rizzo <rizzo at iet.unipi.it> wrote:
>>>>> I was glancing through manpages and implementations of bus_dma(9)
>>>>> and i am a bit unclear on what this API (in particular, bus_dmamap_sync() )
>>>>> does in terms of memory barriers.
>>>>> 
>>>>> I see that the x86/amd64 and ia64 code only does the bounce buffers.
>> 
>> That is because x86 in general does not need memory barriers. ...
> 
> maybe they are not called memory barriers but for instance
> how do i make sure, even on the x86, that a write to the NIC ring
> is properly flushed before the write to the 'start' register occurs ?
> 

Flushed from where?  The CPU's cache or the device memory and pci bus?  I already told you that x86/64 is fundamentally designed around bus snooping, and John already told you that we map device memory to be uncached.  Also, PCI guarantees that reads and writes are retired in order, and that reads are therefore flushing barriers.  So lets take two scenarios.  In the first scenario, the NIC descriptors are in device memory, so the DMA has to do bus_space accesses to write them.

Scenario 1
1.  driver writes to the descriptors.  These may or may not hang out in the cpu's cache, though they probably won't because we map PCI device memory as uncachable.  But let's say for the sake of argument that they are cached.
2. driver writes to the 'go' register on the card.  This may or may not be in the cpu's cache, as in step 1.
3. The writes get flushed out of the cpu and onto the host bus.  Again, the x86/64 architecture guarantees that these writes won't be reordered.
4. The writes get onto the PCI bus and buffered at the first bridge.
5. PCI ordering rules keep the writes in order, and they eventually make it to the card in the same order that the driver executed them.

Scenario 2
1. driver writes to the descriptors in host memory.  This memory is mapped as cache-able, so these writes hang out in the CPU.
2. driver writes to the 'go' register on the card.  This may or may not hang out in the cpu's cache, but likely won't as discussed previously.
3. The 'go' write eventually makes its way down to the card, and the card starts its processing.
4. the card masters a PCI read for the descriptor data, and the request goes up the pci bus to the host bridge
5. thanks to the fundamental design guarantees on x86/64, the pci host bridge, memory controller, and cpu all snoop each other.  In this case, the cpu sees the read come from the pci host bridge, knows that its for data that's in its cache, and intercepts and fills the request.  Coherency is preserved!

Explicit barriers aren't needed in either scenario; everything will retire correctly and in order.  The only caveat is the buffering that happens on the PCI bus.  A write by the host might take a relatively long and indeterminate time to reach the card thanks to this buffering and the bus being busy.  To guarantee that you know when the write has been delivered and retired, you can do a read immediately after the write.  On some systems, this might also boost the transaction priority of the write and get it down faster, but that's really not a reliably guarantee.  All you'll know is that when the read completes, the write prior to it has also completed.

Where barriers _are_ needed is in interrupt handlers, and I can discuss that if you're interested.

Scott