Getting/Forcing Greater than 4KB Buffer Allocations

Thu Jul 19 17:36:04 UTC 2007

> I'm trying to catch up on this thread, but I'm utterly confused as to
> what you're looking for.  Let's try talking through a few scenarios
> here:

My goal is simple.  I've modified my driver to support up to 8 segments
in an mbuf and I want to verify that it works correctly.  It's simple to
test when every mbuf has the same number of segments, but I want to make
sure my code is robust enough to support cases where one mbuf is made of
3 segments while the next is made of 5 segments.  The best case would be

to get a distribution of sizes from the min to the max (i.e. 1 to 8).
I'm not trying to test for performance, only for proper operation under
a worst case load.

> 
> 1. Your hardware has slots for 3 SG elements, and all three MUST be
> filled without exception.  Therefore, you want segments that 
> are 4k, 4k,
> and 1k (or some slight variation of that if the buffer is misaligned).
> To do this, set the maxsegs to 3 and the maxsegsize to 4k.  This will
> ensure that busdma does no coalescing (more on this topic later) and
> will always give you 3 segments for 9k of contiguous buffers.  If the
> actual buffer winds up being <= 8k, busdma won't guarantee that you'll
> get 3 segments, and you'll have to fake something up in your 
> driver.  If
> the buffer winds up being an fragmented mbuf chain, it also won't
> guarantee that you'll get 3 segments either, but that's 
> already handled
> now via m_defrag().

My hardware supports multiples of 255 buffer descriptors (255, 510,
765, etc.).  If all mbufs have 1 segment (common for 1500 MTU) then
I can handle multiples of 255 mbufs.  If all mbufs have 3 segments,
(common for 9000 MTU) then I can handle multiples of 85 mbufs.  If
the mbufs have varying number of segments (anywhere from 1 to 8)
then a varying number of mbufs can be buffered.  This last case is
the most complicated to handle and I want to make sure my code is
robust enough to handle it.  I've found that reducing the system 
memory from 8GB to 2GB has allowed me to see both 2 segment and
3 segment mbufs (the former I assume occurs because of coalescing)
but I haven't been able to load the system in such a way to cause
any other number of segments to occur.

> 
> 2. Your hardware can only handle 4k segments, but is less 
> restrictive on
> the min/max number of segements.  The solution is the same as above.

No practical limit on the segment size.  Anything between 1 byte and 
9KB is fine.

> 
> 3. Your hardware has slots for 8 SG elements, and all 8 MUST be filled
> without exception.  There's no easy solution for this, as 
> it's a fairly
> bizarre situation.  I'll only discuss it further if you confirm that
> it's actually the case here.

The number of SG elements can vary anywhere from 1 to 8.  If the first
SG element has 2 slots then there's no problem with the second SG
element having 8 slots, and then the third having 4 slots.  The only 
difficulty comes in keeping the ring full since the number of slots
used won't always match the number of slots available.  I think I can
handle this correctly but it's difficult to test since all of the 
SG entries have the same number of slots (which also happens to be 
evenly divisible by the total number of slots available in the ring).

> 
> As for coalescing segments, I'm considering a new busdma back-end that
> greatly streamlines loads by eliminating cycle-consuming tasks like
> segment coalescing.  The original justification for 
> coalescing was that
> DMA engines operated faster with fewer segments.  That might still be
> true, but the extra host CPU cycles and cache-line misses probably
> result in a net loss.  I'm also going to axe bounce-buffer 
> support since
> it bloats the I cache.  The target for this new back-end is 
> drivers that
> support hardware that don't need these services and that are also
> sensitive to the amount of host CPU cycles being consumed, i.e. modern
> 1Gb and 10Gb adapters.  The question I have is whether this 
> new back-end
> should be accessible directly through yet another bus_dmamap_load_foo
> variant that the drivers need to know specifically about, or 
> indirectly
> and automatically via the existing bus_dmamap_load_foo variants.  The
> tradeoff is further API pollution vs the opportunity for even more
> efficiency through no indirect function calls and no cache misses from
> accessing the busdma tag.  I don't like API pollution since 
> it makes it
> harder to maintain code, but the opportunity for the best performance
> possible is also appealing.

Others have reported that single, larger segments provide better 
performance than multiple, smaller segments.  (Kip Macy recently
forwarded me a patch to test which shows a performance improvement
on the cxgb adatper when this is used.)  I haven't done enough 
performance testing on bce to know if this helps overall, hurts,
or has no overall difference.  One thing I am interested in is
finding a way to allocate receive mbufs such that I can split the
header into a single buffer and then place the data into one or
more page aligned buffers, similar to what a transmit mbuf looks
like.  Anyway to support that in the current architecture?

Dave