Call for testing and review, busdma changes
jroberson at jroberson.net
Tue Dec 25 00:02:28 UTC 2012
On Mon, 24 Dec 2012, Ian Lepore wrote:
> On Mon, 2012-12-24 at 11:21 -1000, Jeff Roberson wrote:
>> On Mon, 24 Dec 2012, Ian Lepore wrote:
>>> On Sun, 2012-12-23 at 14:22 -1000, Jeff Roberson wrote:
>>>> On Sun, 9 Dec 2012, Ian Lepore wrote:
>>>>> On Sun, 2012-12-09 at 08:48 -1000, Jeff Roberson wrote:
>>>>>> On Sun, 9 Dec 2012, Ian Lepore wrote:
>>>>>>> On Sat, 2012-12-08 at 08:51 -1000, Jeff Roberson wrote:
>>>>>>> Testing on armv4/v5 platforms, the patches applied and compiled cleanly,
>>>>>>> but fault-abort with a NULL deref on what appears to be the first call
>>>>>>> to bus_dmamap_load(). I'll dig deeper into debugging that after
>>>>>>> browsing the new code so I can figure out where to start debugging.
>>>>>> Can you give me the file and line of the fault?
>>>>> Better than that, I can give you patches (attached). There were two
>>>>> little glitches... the allocated slist wasn't getting attached to the
>>>>> map, and when building the slist in _bus_dmamap_load_buffer() the slist
>>>>> needs to participate in the coalescing logic near the end of the loop.
>>>> I understand the first problem. The second, I'm not sure about. When
>>>> we're going through bounce pages we use virtual to virtual. Why do you
>>>> need to flush the cache lines for the source addresses? The device will
>>>> only do io to the bounce pages so they are the only that need other
>>> As I indicated in my following paragraph, I didn't think my fix for
>>> constructing the sync list was perfect, it was just good enough to allow
>>> more testing to progress. In particular I don't think it would be right
>>> if bouncing were required, but it never is on the arm platforms I have
>>> to test with.
>>> The logic in your patch didn't work in the normal non-bounce case. The
>>> main loop that does the mapping works on a single page at a time. For
>>> each page it decides whether to use a bounce page instead, then it
>>> either merges the page into the current segment if it can, or starts a
>>> new segment. Your patch has it create a new sync list entry for every
>>> non-bounced page, but the sync list is sized to maxsegs not maxpages.
>>> So your logic writes outside the bounds of the sync list array, and my
>>> logic includes the virtual ranges of pages that will be bounced in the
>>> sync list (which I think should be harmless anyway, just a bit
>>> inefficient). The logic is wrong in both cases, but since bouncing
>>> never happens on my hardware I knew my tweak would let me do more
>> Ok, so I need to add coalescing and perhaps fix the bounds checking for
>> virtual ranges. Correct? I can do that simply enough. I have some other
>> changes I will make concurrently and then get another build that I would
>> like you to test. Thank you for all the feedback so far.
> Okay, but did you see that Olivier committed my busdma changes recently?
> It makes for a pretty significant by-hand merge with your work. I've
> done that merge -- I think I sent it to you last week -- but what
> Olivier committed was a wee bit different than what I merged with then.
I'm going to move this patch further along before I re-merge. I want to
get buy in on the next set of API changes I'm going to make.
> I think the little patch I sent originally to do the sync list
> coalescing based on segments is probably good enough for now. It will
> result in syncing virtual ranges that are actually being bounced, but
> the arm code has always done that due to bugs (it skips the virtual
> range sync only if the entire buffer fits in one bounce page).
I will look at this next. mips has the same pattern and I want to make
sure it's correct there too.
> I just had a close look at the armv6 implementation of mapping and
> syncing for the first time, and... oh my. It's creating (malloc'ing!) a
> sync list entry for every individual page of every mapped buffer. I
> think it should be doing something more like my quickie-patch that bases
> the sync list entries on the segments since it has to sync l2 cache
> based on physical ranges for some chips.
If you look at my branch I made the two arm busdma implementations
identical in this regard.
>>> In the past, I've never paid close attention to the bounce logic, I've
>>> tended to read around it as if it weren't there. Recently I've begun to
>>> look at it and the more I look the more I think that to the degree that
>>> it works at all, it works by accident. I think mostly the logic just
>>> never gets used on the arm platforms that people are working with,
>>> because there'd be more problems reported if it were.
>> I think it is even rarely used on x86 anymore. The number of poorly
>> behaved DMA engines has been shrinking.
>>> For example... there is no handling of the PREREAD case when syncing
>>> bounce pages. A POSTREAD invalidate is insufficient; if any of the
>>> cache lines are dirty at the time the DMA starts, there are several ways
>>> those lines can get written back to physical memory while the DMA is in
>>> progress, corrupting the data in the IO buffer.
>>> Another example... there is an unordered list of bounce pages which are
>>> substituted for real pages as needed, one page at a time. No attempt is
>>> made to ensure that a run of contiguous physical pages that need to be
>>> bounced turns into a corresponding single run of contiguous pages in the
>>> bounce zone. In other words, it seems like over time as the list
>>> becomes less ordered the bounce logic would increasingly create
>>> multi-segment DMA. The reason that seems like a problem is that on
>>> every system I've checked (arm and i386) getting to multi-user mode
>>> creates about 100 dma tags, and typically only 5 or 6 of those tags
>>> allow more than one segment. I understand that this is not "incorrect
>>> operation" of the busdma code, but it does seem to be circumstantial
>>> evidence that this logic isn't being exercised much, since it appears
>>> that few drivers can actually handle multi-segment DMA.
>>> While the work I've done so far on arm busdma has concentrated on the
>>> buffer allocation and correct operation of the sync routines in the
>>> non-bounce cases, I think I'll be looking closely at the map load
>>> (currently seems a bit inefficient) and bounce-handling logic soon.
>> I'm shocked that we're sending I/O of larger than a page size to devices
>> that only support 1 sg entry. Rather than memcpying and bouncing these
>> drivers should be written to chop up the I/O themselves. For network
>> devices they should not advertise jumbo frame support and for disk devices
>> they can specify the maximum io size as one page already. The original
>> pages that we send down would almost never be contiguous until very
>> recently so this wouldn't have worked even without bounce pages.
>>>>> I'm not positive the attached quick-fix is perfect, but it allows one of
>>>>> my arm units here to boot and run and pass some basic IO smoke tests.
>>>>> I'll be trying it on other arm platforms later today.
>>> The fact that a deferred load can't be easily implemented in the case of
>>> mbufs and uio might be why the manpage says the NOWAIT flag is implied
>>> in those cases. There's no reason why the busdma code can't handle
>>> deferred loading for mbufs and uio -- as you say, it seems to only
>>> involve storing a bit of extra info for the callback. I think the hard
>>> part of the problem is elsewhere. For a transmit mbuf that needs a
>>> deferred mapping, does the driver's interface to the network stack allow
>>> for the concept of "this call has neither suceeded nor failed, but
>>> either outcome is possible, I'll let you know later" (I have no idea
>>> what the answer to that is but I can see how it could quickly become
>>> complex for drivers and other higher layers to deal with). For a
>>> deferred uio mapping, what if the userland pmap is no longer valid when
>>> the callback happens? What if userland has freed the memory? I suspect
>>> the implied NOWAIT requirement exists to wish away the need to handle
>>> such things.
>> Yes for uio the operation will have to be able to work on something other
>> than the current pmap. Many busdma implementations remember the pmap
>> because that could also be true when sync is called. Also actual uio
>> operations to user pages are probably not initiated by anything in tree.
>> Maybe some command interfaces for storage drivers. It is definitely an
>> edge case. I would rather not support it but I guess it does make such
>> things easier to implement and not require a copy.
>> Most storage drivers have the logic that you describe above. They watch
>> for EINPROGRESS. Then the callback can supply an error if the operation
>> fails later. I would like to make this more uniform because cam drivers
>> are about to support multiple different memory descriptors. It would be a
>> shame if they each had different semantics. So I need deferred load of
>> many types to work.
> Yeah, I've done some low-level storage driver stuff myself (mmc/sd) and
> I can see how easy the deferred load solutions are to implement in that
> sort of driver that's already structured to operate asychronously. I'm
> not very familiar with how network hardware drivers interface with the
> rest of the network stack. I have some idea, I'm just not sure of all
> the subtleties involved and whether there are any implications for
> something like a deferred load.
> This is one of those situations where I tend to say to myself... the
> folks who designed this stuff and imposed the "no deferred load"
> restriction on mbufs and uio but not other cases were not stupid or
> lazy, so they must have had some other reason. I'd want to know what
> that was before I went too far with trying to undo it.
In network drivers you just discard the mbuf. Networks are lossy. If you
persistently are out of bounce buffers you'll fail to transmit. The flags
were manually or'd in for these consumers to simplify the implementation.
I intend to change the underlying mechanism to support it but not the mbuf
>>> My original point was that the WAITOK flag has a value of zero. Your
>>> code gives the impression that it forces all callers to accept a
>>> deferred mapping, overriding whatever flag they may have passed, but in
>>> fact it doesn't do anything at all.
>> Yes I realized that. I just made a commit to eliminate that on my branch.
>> Thank you for pointing it out.
>>> Hmm, I just looked at non-arm implementations more closely and I see
>>> that some of them have the same logic as your patches, they OR-in a zero
>>> in an apparent attempt to override the caller's flags. That's contrary
>>> to what the manpage says and IMO contrary to common sense -- if a caller
>>> has said that it is unable to handle a deferred mapping callback, it's
>>> unlikely that anything is going to work correctly if a deferred callback
>>> happens (in fact, it sounds like a crash or panic waiting to happen). I
>>> think the only reason the existing code hasn't caused problems is
>>> because the flag value is zero and it actually does nothing except
>>> mislead when you read it.
>> Yes, they all seem to respect NOWAIT as well. So oring it in did nothing.
>>>> By the way I have an mbuf branch that pushes data and mbuf structure into
>>>> separate cachelines. It's not only faster on arm and the like but on x86
>>>> as well. So someone should pursue this and you'd have way fewer partial
>>>> line flushes.
>>>> You can skip the [mbuf] sync because you know the information in the mbuf header
>>>> is not necessary?
>>> Not exactly. In the general case, if the buffer isn't aligned and
>>> padded to a cacheline boundary, we don't know anything about the memory
>>> before and after the buffer which partially shares a cacheline we're
>>> about to operate on, so the existing code attempts to preserve the
>>> contents of that unknown memory across the sync operation. (It cannot
>>> possibly be reliably sucessful in doing so, but that's another story.)
>>> In the case of an mbuf, where the data area following the header is
>>> always misaligned to a cacheline, we do know something about the memory
>>> before the data buffer: we know that it's an mbuf header which will not
>>> be accessed by the cpu while the dma is in progress. Likewise if the
>>> length isn't a multiple of the line size we know the data after the
>>> mapped length is still part of a buffer which is allocated to a multiple
>>> of the line size and the cpu won't touch that memory during the dma.
>>> Using that knowledge we can handle a PREREAD or PREWRITE by doing a
>>> wbinv on the cacheline (this is no different than the general case), and
>>> then we can handle the POSTREAD as if the buffer were aligned. We don't
>>> have to attempt to preserve partial cachelines before/after the buffer
>>> because we know the cpu hasn't touched that memory during the dma, so no
>>> cache line load could have happened.
>>> In the changes you mentioned, how do you ensure the mbuf header and data
>>> end up in separate cache lines without compile-time knowledge of cache
>>> line size?
>>> Even if the embedded data buffer in an mbuf becomes cache-aligned, the
>>> special mbuf awareness will still be needed to handle the fact that the
>>> mapped data length may not be a multiple of the cache line size. We'll
>>> still need the special knowledge that it's an mbuf so that we can sync
>>> the entire last cacheline without attempting to preserve the part beyond
>>> the end of the dma buffer. (If this idea makes you vaguely uneasy,
>>> you're not alone. However, I've been assured by Scott and Warner that
>>> it's always safe to treat mbufs this way.)
>> I understand now. It is safe in the usual cases but not in the general
>> case. The mbuf API allows for 'ext bufs' which have reference to
>> arbitrary external memory. There are no guarantees about the alignment of
>> this memory or what comes before or after it. It could really be
>> anything. In practice I think you'll be safe with all code that is in
>> tree but I'm not 100% sure.
> The ext bufs are exactly the part that makes me vaguely uneasy, but I
> figured I'd run with what I was told until it was proven unworkable.
>>> Still unresolved is what to do about the remaining cases -- attempts to
>>> do dma in arbitrary buffers not obtained from bus_dmamem_alloc() which
>>> are not aligned and padded appropriately. There was some discussion a
>>> while back, but no clear resolution. I decided not to get bogged down
>>> by that fact and to fix the mbuf and allocated-buffer situations that we
>>> know how to deal with for now.
>> It sounds like you're hitting the common cases. It seems ok to keep the
>> partial flush logic for the less common cases and deal with the
>> performance penalty.
> I've got to keep stressing... it's not a performance penalty thing, it's
> a correct operation thing. Partial cacheline flush logic cannot be
> written in a way that's guaranteed to always give correct results,
> because the software can't prevent the hardware from doing an eviction
> and writing back stale data even while the code that attempts to
> preserve the data is running. Some day when all the other code is
> right, the place where we currently detect a partial flush and try to
> preserve the surrounding data will be a panic call. Until then we'll
> keep relying on the partial flush usually working right by accident.
I understand now. If you're sharing the line with someone who is active
you can't do it safely just by flushing and cleaning before the DMA.
They could re-dirty the page. I was just thinking about the sync
operation and existing dirty data. Technically you should bounce in this
case as well since it can't be made safe. Not bouncing for mbufs is a
nice optimization I see.
> But yeah, taking care of the common cases first was my priority.
> -- Ian
More information about the freebsd-sparc64