busdma dflt_lock on amd64 > 4 GB

Scott Long scottl at samsco.org
Wed Oct 26 09:29:30 PDT 2005

Jacques Caron wrote:

> Hi Scott,
> Thanks for the input. I'm utterly lost in unknown terrain, but I'm 
> trying to understand...
> At 16:09 26/10/2005, Scott Long wrote:
>> So, the panic is doing exactly what it is supposed to do.  It's guarding
>> against bugs in the driver.  The workaround for this is to use the 
>> NOWAIT flag in all instances of bus_dmamap_load() where deferals can
>> happen.
> As pointed out by Soren, this is not documented in man bus_dma :-/ It 
> says bus_dmamap_load flags are supposed to be 0, and BUS_DMA_ALLOCNOW 
> should be set at tag creation to avoid EINPROGRESS. I'm not sure the two 
> would actually be equivalent, either.

They are not.  The point of the ALLOCNOW flag is to try to avoid 
mallocing buffers later when the map is created.  It's not the solution 
to any problem, just a shortcut.  It's really only useful for drivers
that allocate maps on the fly instead of pre-allocating them.

> And from what I understand, even a 
> call to bus_dma_tag_create with BUS_DMA_ALLOCNOW can be successful but 
> not actually allocate what will be needed later (see below).

Like I replied to Soeren, each bounce zone is guaranteed to have enough
pages for one transaction by one consumer.  Allocating more maps can
increase the size of the pool, but allocating more tags cannot.  Again,
this is to guard against over-allocation.  busdma doesn't know whether
a tag will be used for static buffers or dynamic buffers, and static 
buffers tend to be large and not require a map or bouncing.  It used to
be that bus_dma_tag_create() would always increase the page allocation 
in the zone, but then we got into problems with drivers wanting large
static allocations and fooling busdma into exhausting physical memory by
allocating too many bounce pages, non of which were needed.

Another approach that I've been considering is adding a
BUS_DMA_STATICMAP flag bus_dma_tag_create() that tells it to not
allocate bounce pages and not allow deferals.  Then the bounce page
limit heuristics can be removed from tags that don't have that flag,
and the code will be more simple and predictable.  But, since any time
you touch busdma you have to consider many dozens of drivers, it's not
something that I'm ready to do without more thought.

>>   This, however, means that using bounce pages still remains fragile 
>> and that the driver is still likely to return ENOMEM to the upper 
>> layers.  C'est la vie, I guess.  At one time I had patches that
>> made ATA use the busdma API correctly (it is one of the few remaining
>> that does not), but they rotted over time.
> So what would be the "correct" way? Move the part that's after the DMA 
> setup in the callback? I suppose there are limitations as to what can 
> happen in the callback, though, so it would complicate things quite a bit.
> Obviously, a lockfunc would be needed in this situation, right?

I sent a long email on this on Dec 14, 2004.  I'll pull it up and 
forward it out.  What I really should do is publish a definitive article
on the whole topic.

As for 'limitations as to what can happen in the callback', there are
none if you use the correct code structure.

> Also, I believe many other drivers just have lots of BUS_DMA_ALLOCNOW or 
> BUS_DMA_NOWAIT all over the place, I'm not sure that's the "correct" 
> way, is it?

Most network drivers use these because they prefer to handle the ENOMEM
case rather than handle the possibility of out-of-order packets caused
by deferals (though this is really not possible; busdma guards against
it).  The network stack is designed to handle loss both on the
trasmitting end as well as the receiving end, unlike the storage layer.
Keep in mind that this discussion started with talking about ATA =-)

>> No.  Some tags specifically should not permit deferals.
> How do they do that? Setting BUS_DMA_ALLOCNOW in the tag, or 
> BUS_DMA_NOWAIT in the map_load, or both, or something else?

They set it by using NULL as the lockfunc.

> What should 
> make one decide when deferrals should not be permitted?

Static allocations should never require bouncing, and thus should
never have a deferal.  The assertion is there to make sure that a
driver doesn't accidentally try to use a tag created for static
buffers for dynamic buffers.

> It is my 
> impression that quite a few drivers happily decide they don't like 
> deferrals at all whatever happens...

Again, these are mostly network drivers, and the network stack is
designed to reliably handle this.  The storage stack tries a little
bit to handle it, but it's not reliable.  Nor should it have to
handle it; direct I/O _must_always_succeed_.  What if you're out of
RAM and the VM system tries to write some pages to swap in order to
free up RAM, but those writes fail on ENOMEM?  Again, FreeBSD has
shown excellent handling in high memory pressure situations over the
years where other OS's die horribly.  This is one of the reasons why.

>> Just about every other modern driver honors the API correctly.
> Depends what you mean by "correctly". I'm not sure using BUS_DMA_NOWAIT 
> is the right way to go as it fails if there is contention for bounce 
> buffers.
>> Bounce pages cannot be reclaimed to the system, so overallocating just
>> wastes memory.
> I'm not talking about over-allocating, but rather allocating what is 
> needed: I don't understand why bus_dma_tag_create limits the total 
> number of bounce pages in a bounce zone to maxsize if BUS_DMA_ALLOCNOW 
> is set (which prevents bus_dmamap_create from allocating any further 
> bounce pages as long as there's only one map per tag, which seems pretty 
> common), while bus_dmamap_create will allocate maxsize additional pages 
> if BUS_DMA_ALLOCNOW was not set.

Actually, one map per tag is not common.  If the ATA driver supported
tagged queuing (which I assume that it will someday for SATAII, yes?)
then it would have multiple maps.  Just about every other modern block
driver supports multiple concurrent transactions and thus multiple maps.

> The end result is that the ata driver is limited to 32 bounce pages 
> whatever the number of instances (I guess that's channels, or disks?), 
> while other drivers get hundreds of bounce pages which they hardly use. 
> Maybe this is intended and it's just the way the ata driver uses tags 
> and maps that is wrong, maybe it's the busdma logic that is wrong, I 
> don't know...

If a map is being created for every drive in the system, and the result
is that not enough bounce pages are being reserved for all three drives
to operate concurrently, then there might be a bug in busdma.  We should
discuss this offline.

>>   The whole point of the deferal mechanism is to allow
>> you to allocate enough pages for a normal load while also being able to
>> handle sporadic spikes in load (like when the syncer runs) without
>> trapping memory.
> In this case 32 bounce pages (out of 8 GB RAM) for 6 disks seems like a 
> very tight bottleneck to me.

If that's all that is needed to saturate non-tagged ATA, then there is
nothing wrong with that.  But once tagged queuing comes into the
picture, more resources will need to be reserved of course.  This should
all just work, since it works for other drivers, but I'm happy to help
investigate bugs.


More information about the freebsd-amd64 mailing list