Partial cacheline flush problems on ARM and MIPS

Thu Aug 23 23:26:26 UTC 2012

On Aug 23, 2012, at 5:09 PM, Ian Lepore wrote:

> On Thu, 2012-08-23 at 15:50 -0600, Warner Losh wrote: 
>> On Aug 23, 2012, at 3:28 PM, Ian Lepore wrote:
>>> A recent innocuous change to the USB driver code caused intermittant
>>> errors in the umass(4) driver on ARM and MIPS platforms, and this
>> 
>> I think the proper solution is to segregate DMA and non-DMA parts of structures so that you don't have both sharing a cache line.
>> 
>> I also wonder why we don't allocate the DMA memory for these structures separately from the non-DMA parts.  This would eliminate the USB_CACHE_BYTES kludge (which is CPU dependent, not arch dependent) and move the knowledge of this junk into busdma layer where it belongs.  From my understanding of the issue, this would completely eliminate the problem forever!
>> 
>> Sharing a cacheline between something that is DMA aware and something that is just begging for trouble.  We're  doing more work than we need to to support this dubious feature and we'd be miles ahead if we could not share at all.
>> 
>> Warner
>> 
> 
> It seems to me that what we have here is a new type of constraint on DMA
> operations, and we need a way to communicate that constraint from the
> part of the platform support code that knows about it to the drivers and
> driver support code that needs to know.  The way we communicate DMA
> constraints is with a busdma tag, but right now that tag only
> communicates constraints that were needed for ISA and PCI busses, namely
> buffer alignment, boundary-crossing restrictions, and exclusion
> regions.  
> 
> Now we have a new type of constraint, I think of it as "granularity".
> In effect, we have a DMA system that can only do DMA in cacheline sized
> chunks.  Even when the IO size -- and thus the number of "bits on the
> wire" -- is less than the cacheline size, at the end of the DMA
> operation (which includes the software-assisted coherency operations)
> the number of bytes in memory that may be modified is the size of a
> cacheline.  This is because "the DMA system" is not just the engine that
> moves bytes around, it's the combination of hardware and software that
> work together to maintain cache coherency.

But this isn't new.  It is an alignment requirement, which carries with it an implicit size requirement.  If you enforce the alignment, and force all 'sub buffers' to have this alignment, you don't need the new thing.

> Ideally we'd find a way to communicate this new constraint using the
> existing mechanism, the busdma tag, and ideally we'd not have to change
> every existing call to bus_dma_tag_create() to add a new parm.  As I
> understand it, parent tags are now passed down through the newbus
> hierarchy consistantly, such that a tag at the nexus level could
> describe a platform requirement such as granularity, and all devices and
> the helper code they use will have access to that constraint via
> inheritance from ancestors' tags.  Maybe we could have a new flavor of
> bus_dma_tag_create() that takes a struct of parms, and existing calls
> wouldn't have to be changed.

Wouldn't a simpler solution be to just make this alignment requirement be part of the global parent tag on these platforms and to make sure all drivers on those platforms use it and don't cop-out and pass NULL?

> Communicating the constraint is only part of the problem; it also has to
> be easy for drivers to work with that constraint, especially drivers
> that are not targeted specifically at platforms with granular DMA.  I
> think we can achieve a huge chunk of that purely within the arm/mips
> implementation of bus_dmamem_alloc(), but even so there would be a new
> conceptual limitation on using that routine: it is specifically for
> allocating DMA buffers, and that means that there will be a new a rule
> that the CPU cannot access any memory within that buffer while an IO
> operation is in progress.

I don't think we should pander to drivers that don't know how to do DMA properly.  We get it almost right in bus_dma now.  However, going from almost right to completely right is hard and we keep uncovering edge cases that bite us.  Wouldn't it be better to eliminate all these weird edge cases?

> I'd also like to say there's a new rule that you cannot subdivide a
> buffer obtained from bus_dmamem_alloc() into multiple buffers, or into a
> combination of DMA and CPU-accessed data.  That would be bad news for
> the USB subsystem, and perhaps other drivers.  If this idea is either
> impossible or particularly contentious, then I guess we'd need some new
> helper routines so that a driver can subdivide the memory in a way that
> doesn't violate any constraints implied by the tag used to allocate the
> buffer.

When the USB subsystem went into the tree, this was one of the criticisms that was ignored.  It has come back to bite us time and time again.  Perhaps it is time to fix it once and for all.

> Not all IO occurs using buffers obtained from bus_dmamem_alloc(), and I
> doubt we can reasonably ever require that it be so.  

True, but the I/O that's not in memory from bus_dmamem_alloc is page aligned.

> I think the only
> hope we have of handling that problem is to bounce the requests that
> don't meet the granularity constraint, just as we'd have to do if the
> caller-supplied buffer fell into an exclusion zone or violated an
> alignment or boundary constraint.  When I've tossed this idea out in the
> past there was instant resistance.  Yeah, bounce buffers are massively
> inefficient, but my experience has been that most of the IO that isn't
> aligned and sized to a multiple of a cacheline is small IO (a few to a
> few dozen bytes).  I've never seen a case of page-sized or larger IO
> requests that required partial-cacheline handling.  I'm sure some
> examples exist, but they're probably more the exception than the rule.
> (And the bad performance you'd get from bouncing and copying massive
> bulk data flow would be a powerful incentive to track down the culprit
> and improve the driver.)

That's also the underlying idea in the bus_dma stuff.  You give the constraints, you get the buffers and if you have a buffer that's outside the constraints it gets bounced.  That's why the sync operations on on DMA items, not on cache line items.  While cache lines are one issue, memory placement can be another.  Floppy drives, for example, couldn't DMA past the first 16MB and if you have a buffer passed in that's outside of that, it bounces.  If this bouncing produces slower code, then the drivers should be updated to have better alignment.

The USB subsystem is making assumptions about the underlying cache mechanisms that aren't really true.  Ideally, we could get it to stop doing that.

Warner