Call for testing and review, busdma changes

Mon Dec 24 20:43:14 UTC 2012

On Sun, 2012-12-23 at 14:22 -1000, Jeff Roberson wrote:
> On Sun, 9 Dec 2012, Ian Lepore wrote:
> 
> > On Sun, 2012-12-09 at 08:48 -1000, Jeff Roberson wrote:
> >>
> >> On Sun, 9 Dec 2012, Ian Lepore wrote:
> >>
> >>> On Sat, 2012-12-08 at 08:51 -1000, Jeff Roberson wrote:
> >>>> Hello,
> >>>>
> >>>> http://people.freebsd.org/~jeff/physbio.diff
> >>>>
 [...]
> >>>
> >>> Testing on armv4/v5 platforms, the patches applied and compiled cleanly,
> >>> but fault-abort with a NULL deref on what appears to be the first call
> >>> to bus_dmamap_load().  I'll dig deeper into debugging that after
> >>> browsing the new code so I can figure out where to start debugging.
> >>
> >> Can you give me the file and line of the fault?
> >>
> >
> > Better than that, I can give you patches (attached).  There were two
> > little glitches... the allocated slist wasn't getting attached to the
> > map, and when building the slist in _bus_dmamap_load_buffer() the slist
> > needs to participate in the coalescing logic near the end of the loop.
> 
> I understand the first problem.  The second, I'm not sure about.  When 
> we're going through bounce pages we use virtual to virtual.  Why do you 
> need to flush the cache lines for the source addresses?  The device will 
> only do io to the bounce pages so they are the only that need other 
> synchronization.
> 

As I indicated in my following paragraph, I didn't think my fix for
constructing the sync list was perfect, it was just good enough to allow
more testing to progress.  In particular I don't think it would be right
if bouncing were required, but it never is on the arm platforms I have
to test with.

The logic in your patch didn't work in the normal non-bounce case.  The
main loop that does the mapping works on a single page at a time.  For
each page it decides whether to use a bounce page instead, then it
either merges the page into the current segment if it can, or starts a
new segment.  Your patch has it create a new sync list entry for every
non-bounced page, but the sync list is sized to maxsegs not maxpages.

So your logic writes outside the bounds of the sync list array, and my
logic includes the virtual ranges of pages that will be bounced in the
sync list (which I think should be harmless anyway, just a bit
inefficient).  The logic is wrong in both cases, but since bouncing
never happens on my hardware I knew my tweak would let me do more
testing.

In the past, I've never paid close attention to the bounce logic, I've
tended to read around it as if it weren't there.  Recently I've begun to
look at it and the more I look the more I think that to the degree that
it works at all, it works by accident.  I think mostly the logic just
never gets used on the arm platforms that people are working with,
because there'd be more problems reported if it were.

For example... there is no handling of the PREREAD case when syncing
bounce pages.  A POSTREAD invalidate is insufficient; if any of the
cache lines are dirty at the time the DMA starts, there are several ways
those lines can get written back to physical memory while the DMA is in
progress, corrupting the data in the IO buffer.

Another example... there is an unordered list of bounce pages which are
substituted for real pages as needed, one page at a time.  No attempt is
made to ensure that a run of contiguous physical pages that need to be
bounced turns into a corresponding single run of contiguous pages in the
bounce zone.  In other words, it seems like over time as the list
becomes less ordered the bounce logic would increasingly create
multi-segment DMA.  The reason that seems like a problem is that on
every system I've checked (arm and i386) getting to multi-user mode
creates about 100 dma tags, and typically only 5 or 6 of those tags
allow more than one segment.  I understand that this is not "incorrect
operation" of the busdma code, but it does seem to be circumstantial
evidence that this logic isn't being exercised much, since it appears
that few drivers can actually handle multi-segment DMA.

While the work I've done so far on arm busdma has concentrated on the
buffer allocation and correct operation of the sync routines in the
non-bounce cases, I think I'll be looking closely at the map load
(currently seems a bit inefficient) and bounce-handling logic soon.

> >
> > I'm not positive the attached quick-fix is perfect, but it allows one of
> > my arm units here to boot and run and pass some basic IO smoke tests.
> > I'll be trying it on other arm platforms later today.
> >
> >>>
> >>> Just from an initial quick browsing of the new code for armv4, I notice
> >>> these things...
> >>>
> >>> The new routine __bus_dmamap_mayblock() does (*flags) |= BUS_DMA_WAITOK;
> >>> which is a no-op because BUS_DMA_WAITOK is zero.  I'm not sure what the
> >>> intent was, but I think any modification of the WAIT-related flags would
> >>> be conceptually wrong because it overrides the caller's instructions on
> >>> whether or not to do a deferred load when resources aren't available.
> >>
> >> The intent of the original load function seems to be that it is always
> >> blocking.  Most architectures force the flag.  The mbuf and uio load
> >> routines don't support blocking at all because the callback lacks the
> >> information needed to restart the operation.
> >>
> >
> > The manpage states that bus_dmamap_load() never blocks for any reason.
> > The mbuf and uio flavors say they are variations of bus_dmapmap_load();
> > I take the implication of that to mean that they also are defined as
> > never blocking.
> >
> > The docs then say that WAITOK vs. NOWAIT control the scheduling of a
> > deferred load, and that NOWAIT is forced in the mbuf and uio cases.
> > Your refactored code already handles that by adding NOWAIT to the
> > caller-specified flags for mbufs and uio.
> >
> > So the code will do the right thing now (with your patches), but only
> > because (*flags) |= BUS_DMA_WAITOK is a no-op.
> 
> This code is problematic.  We can only support WAITOK operations for 
> bus_dmamap_load() because the callbacks from the swi when bounce pages are 
> available can only remember one pointer.  To fix this for real we'd have 
> to remember the original type and call the appropriate load function 
> again.  This could be solved by having a { type, pointer, length } 
> structure that is passed between the front and back-end.  This would solve 
> your mbuf problem as well.  I think I will do this next.

I'm lost here.  I don't understand your point about only being able to
remember one pointer.  We only ever need to remember one pointer,
because a map can only be associated with a single buffer at a time.

The fact that a deferred load can't be easily implemented in the case of
mbufs and uio might be why the manpage says the NOWAIT flag is implied
in those cases.  There's no reason why the busdma code can't handle
deferred loading for mbufs and uio -- as you say, it seems to only
involve storing a bit of extra info for the callback.  I think the hard
part of the problem is elsewhere.  For a transmit mbuf that needs a
deferred mapping, does the driver's interface to the network stack allow
for the concept of "this call has neither suceeded nor failed, but
either outcome is possible, I'll let you know later" (I have no idea
what the answer to that is but I can see how it could quickly become
complex for drivers and other higher layers to deal with).  For a
deferred uio mapping, what if the userland pmap is no longer valid when
the callback happens?  What if userland has freed the memory?  I suspect
the implied NOWAIT requirement exists to wish away the need to handle
such things.

My original point was that the WAITOK flag has a value of zero.  Your
code gives the impression that it forces all callers to accept a
deferred mapping, overriding whatever flag they may have passed, but in
fact it doesn't do anything at all.

Hmm, I just looked at non-arm implementations more closely and I see
that some of them have the same logic as your patches, they OR-in a zero
in an apparent attempt to override the caller's flags.  That's contrary
to what the manpage says and IMO contrary to common sense -- if a caller
has said that it is unable to handle a deferred mapping callback, it's
unlikely that anything is going to work correctly if a deferred callback
happens (in fact, it sounds like a crash or panic waiting to happen).  I
think the only reason the existing code hasn't caused problems is
because the flag value is zero and it actually does nothing except
mislead when you read it.

> By the way I have an mbuf branch that pushes data and mbuf structure into 
> separate cachelines.  It's not only faster on arm and the like but on x86 
> as well.  So someone should pursue this and you'd have way fewer partial 
> line flushes.
> 
[...]
> You can skip the [mbuf] sync because you know the information in the mbuf header 
> is not necessary?
> 

Not exactly.  In the general case, if the buffer isn't aligned and
padded to a cacheline boundary, we don't know anything about the memory
before and after the buffer which partially shares a cacheline we're
about to operate on, so the existing code attempts to preserve the
contents of that unknown memory across the sync operation.  (It cannot
possibly be reliably sucessful in doing so, but that's another story.)

In the case of an mbuf, where the data area following the header is
always misaligned to a cacheline, we do know something about the memory
before the data buffer: we know that it's an mbuf header which will not
be accessed by the cpu while the dma is in progress.  Likewise if the
length isn't a multiple of the line size we know the data after the
mapped length is still part of a buffer which is allocated to a multiple
of the line size and the cpu won't touch that memory during the dma.

Using that knowledge we can handle a PREREAD or PREWRITE by doing a
wbinv on the cacheline (this is no different than the general case), and
then we can handle the POSTREAD as if the buffer were aligned.  We don't
have to attempt to preserve partial cachelines before/after the buffer
because we know the cpu hasn't touched that memory during the dma, so no
cache line load could have happened.

In the changes you mentioned, how do you ensure the mbuf header and data
end up in separate cache lines without compile-time knowledge of cache
line size?

Even if the embedded data buffer in an mbuf becomes cache-aligned, the
special mbuf awareness will still be needed to handle the fact that the
mapped data length may not be a multiple of the cache line size.  We'll
still need the special knowledge that it's an mbuf so that we can sync
the entire last cacheline without attempting to preserve the part beyond
the end of the dma buffer.  (If this idea makes you vaguely uneasy,
you're not alone.  However, I've been assured by Scott and Warner that
it's always safe to treat mbufs this way.)

> >
> > I have patches to fix the mbuf partial sync and other things related to
> > partial sync problems, and my next step will be to try to rework them to
> > fit in with what you've done.  Based on what I've seen so far in your
> > refactoring, I think just passing a flag to say it's an mbuf, from the
> > MI to the MD mapping routine, will provide enough info.
> 
> I think I described a better approach above that will solve mutliple 
> problems.  Let me know what you think.
> 
> Jeff

In the long run we can't just reduce the incidence of partial cacheline
flushes, they have to be eliminated completely.  We can eliminate them
for mbufs because of arbitrary rules for mbufs that are supposedly
universally followed in the existing code already.

We can also eliminate them for any buffers which are allocated by
bus_dmamem_alloc(), basically for reasons similar to mbufs: we can
ensure that allocated buffers are always aligned and padded
appropriately, and establish a rule that says you must not allow any
other access to memory in the allocated buffer during dma.  Most
especially that means you can't allocate a big buffer and sub-divide it
in such a way that there is concurrent cpu and dma access, or multiple
concurrent dma operations, within a single buffer returned by
bus_dmamem_alloc().

Still unresolved is what to do about the remaining cases -- attempts to
do dma in arbitrary buffers not obtained from bus_dmamem_alloc() which
are not aligned and padded appropriately.  There was some discussion a
while back, but no clear resolution.  I decided not to get bogged down
by that fact and to fix the mbuf and allocated-buffer situations that we
know how to deal with for now.

-- Ian