realtek performance (was Re: good ATI chipset results)

Thu Oct 13 13:23:07 PDT 2005

On Thursday 13 October 2005 03:46 pm, Scott Long wrote:
> John Baldwin wrote:
> > On Thursday 13 October 2005 01:07 pm, Sean McNeil wrote:
> >>On Thu, 2005-10-13 at 11:49 -0400, John Baldwin wrote:
> >>>On Thursday 13 October 2005 11:13 am, Sean McNeil wrote:
> >>>>On Thu, 2005-10-13 at 09:17 -0400, Mike Tancsa wrote:
> >>>>>Havent really seen anyone else use this board, but I have had good
> >>>>>luck with it so far
> >>>>>
> >>>>>http://www.ecs.com.tw/ECSWeb/Products/ProductsDetail.aspx?DetailID=50
> >>>>>6&Me nuID=90&LanID=0
> >>>>>
> >>>>>Its a micro ATX formfactor with built in video and the onboard NIC is
> >>>>>a realtek.  (Although its not the fastest NIC, its driver is stable
> >>>>>and mature-- especially compared to the headaches people seem to have
> >>>>>with the NVIDIA NICs.)
> >>>>
> >>>>Is this the RealTek 8169S Single-chip Gigabit Ethernet?
> >>>>
> >>>>For those interested, here are some changes I always use to increase
> >>>>the performance of the above NIC.  With these mods, I can stream over
> >>>>20 MBps video multicast and do other stuff over the network without
> >>>>issues. Without the changes, xmit is horrible with severe UDP packet
> >>>>loss.
> >>>
> >>>So, I see two changes.  One is to up the number of descriptors from 32
> >>> rx and 64 tx to 64 rx and 64 tx on some models and 1024 rx and 1024 tx
> >>> on other modules.  The other thing is that you seem to pessimize TX
> >>> performance by always forcing the send packets to be coalesced into one
> >>> mbuf (which requires doing an alloc and then copying all of the data)
> >>> instead of making use of scatter/gatter for sending packets.  Do you
> >>> need both changes or do just the higher descriptor counts make the
> >>> difference?
> >>
> >>Actually, I've found that the higher descriptor counts do not make a
> >>noticeable difference.  The only thing that mattered was to eliminate
> >>the scatter/gather of sending packets.  I can't remember why I left the
> >>descriptor increase in there.  I think it was to get the best use out of
> >>the hardware.
> >
> > Hmm, odd.  Scott, do you have any ideas why m_defrag() plus one
> > descriptor would be faster than s/g dma for re(4)?
>
> There are two things that I would consider.  First is that
> bus_dmamap_load_mbuf_sg()
> should be use, as that cuts out some indirection (and thus latency) in
> the code.  Second
> is that not all DMA engines are created equal, and I honestly wouldn't
> expect a whole lot
> out of Realtek given the price point of this chip.  It might be
> optimized only for operating
> on only a single S/G element, for example.  Maybe it's really slow at
> pre-fetching s/g
> elements, or maybe it has some sort of a stall after each DMA sement
> transfer while it
> restarts a state machine.  I've seen evidence in other hardware that
> only one S/G element
> should be used even though there are slots for 2 (or 3 in the case of 9k
> jumbo frames).  One
> thing to keep in mind is the difference in the driver models between
> Windows and BSD
> that Bill Paul talked about the other day.  In the Windows world, the
> driver owns the
> network packet memory, whereas in BSD the stack owns it (in the form of
> mbufs).  This
> means that the driver can pre-allocate a contiguous slab and populate
> the descriptor rings
> with it without ever having to worry about s/g fragmentation, while in
> BSD fragmentation
> is a fact of life.  So it's likely yet another case of hardware being
> optimized for certain
> characteristics of Windows at the expense of other operating systems.

Ok.  Sean, do you think you can trim the patch down to just the m_defrag() 
changes and test that to make sure that is all that is needed?

-- 
John Baldwin <jhb at FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve"  =  http://www.FreeBSD.org