Polling for ath driver

Mon Feb 6 09:15:12 PST 2006

Nate Nielsen wrote:
> Sam Leffler wrote:
> 
>>Nate Nielsen wrote:
>>
>>>Adding polling to this driver does increase performance on embedded
>>>systems. With my current patch (on a 233Mhz system), the throughput (in
>>>this case a simple TCP stream) goes up by ~6Mbits, from 18Mbits to
>>>24Mbits.
>>
>>I routinely get >20 Mb/s for a single client running upstream TCP
>>netperf through a soekris 4511.  If you are seeing 6Mb/s you have
>>something else wrong.
> 
> 
> Note I was talking about an *increase* of 6Mb/s.

Sorry I missed that.  OTOH I'm already routinely getting close to your 
cited figure on a platform that's ~half the cpu power so I still wonder 
what's going on.

> 
> In addition my TCP stream ends on the box in question, which obviously
> increases the load on the system, which brings the numbers we are both
> seeing roughly in the same ballpark.

Yes; the only interesting test in my opinion is when the box is acting 
as an ap or otherwise forwarding the packets.  This has major impact on 
the performance characteristics but more clearly reflects real life use.
> 
> 
>>I've not seen livelock in any situations though there are some issues
>>with the priority of the taskqueue thread.
> 
> 
> With a small number of simple self regulating packet streams (such as
> TCP) livelock is not really an issue, as the the streams will slow their
> transmit rate when the box gets near livelock and packets start dropping.
> 
> However on more complex links where traffic is not (or not completely)
> self regulating (ie: VOIP, other datagram streams, a high number of TCP
> streams, asymetric routing) livelock because of interrupt performance is
> a common occurance.

Not in my experience but as I said there ARE priority issues with the 
taskqueue setup in cvs.
> 
> 
>>Polling is not a panacea;
>>you are potentially increasing latency which has ramifications.
> 
> 
> Correct, whenever polling is in use (on ethernet as well) we see the
> latency go up. In addition the system is under higher load when there is
> no traffic. These are tradeoffs that one accepts when using polling.
> Obviously DEVICE_POLLING is not configured by default.
> 
> 
>>>However it should be noted, that the default behaviour (in 6.0 release)
>>>seems to be that the hardware generates about around 2000 interrupts per
>>>second at around 15 - 18 Mbits throughput.
>>
>>You need to identify what kind of interrupts there are and what type of
>>ath hardware you are using.  
> 
> 
> The interrupts are the RX and TX interrupts. My polling additions don't
> mask out any other interrupts.

I see no statistics; are you sure you are not being pounded by phy 
errors.  You haven't even answered my question about what ath devices 
you're using.

> 
> 
>>You can trivially reduce the tx interrupt
>>load by turning off interrupts on EOL and just using the periodic
>>interrupts generated every N tx descriptors.  
> 
> 
> Thanks for the tip. I'm sure that would help. I'll look into that.

If you are polling then doing that is irrelevant.
> 
> 
>>But if you profile I
>>suspect you will find the interrupt overhead is not significant relative
>>to other costs.
> 
> 
> The very act of the CPU servicing the interrupt (ie: saving registers,
> switching stacks, etc...) causes overhead. In addition the interrupt
> handling does not fall under the domain of the scheduler.
> 
> This isn't just theory it has a tested real life impact. As I noted
> earlier for a simple TCP stream over a wireless link throughput went up
> by 6Mb/s.
> 
> In my real world case: On a 233Mhz net4826 running GIF encapsulation,
> IPSEC encryption, and wireless backhaul, throughput went from being
> livelocked at 3.5Mb/s (and userland barely functioning) to over 10Mb/s
> (with userland scheduled properly). This is with polling used in the sis
> ethernet driver, the hifn crypto card driver, and the ath driver.
> 
> Instead of each of these devices generating interrupts, polling (in my
> case at 256 Hz) allows the system to function smoothly. Yes, there is
> latency of up to 20ms, but that's a small tradeoff for the throughput
> and stability increases.

You've changed a bunch of stuff at once and can make no specific claims 
about the polling change to the ath driver.  For all you know changing 
the way the crypto driver works is what bought you all your performance. 
  Your config is so cpu-bound and i/o bound that buying back any cycles 
is going to be a big win.

> 
> 
>>I'm not convinced polling is worthwhile w/o a major restructuring of the
>>driver.  OTOH this shouldn't stop you from pushing forward...
> 
> 
> As you noted there are other performance enhancements that could be
> made, and while I'd love to see them implemented, I fear they may be out
> of my scope and available time. This polling addition is a simple
> performance enhancement that helps my clients, and perhaps others would
> also be interested.
> 
> I'll attach the rough patch, to give an idea of the direction I'm
> working in. Note that this patch is incomplete. It locks after a while,
> which is probably due to the way I call the taskqueue callbacks. I'll
> continue to work on this.

Your patch changes the way interrupts are handled by calling the tx+rx 
processing directly from the polling routine w/o any locks.  This breaks 
locking assumptions in the driver (as you noted in your comments).  To 
make this work correctly you probably need to restructure the driver 
along the lines of all the other drivers to use a single lock.  I've 
considered doing this because future additions for radar and apsd 
support will probably require immediate processing of the rx descriptors 
in the interrupt handler.  The existing setup was an experiment to see 
if we could make the tx+rx paths run in parallel to get increased 
concurrency.  It's worked out ok but has noticeable overhead on some 
platforms due to the additional locking overhead.

> 
> Please understand that by posting this patch I'm not pressuring you to
> help. Thanks again for your advice so far.

I appreciate your working on this stuff.  I'm mostly trying to prod you 
into measuring effects directly rather than inferring stuff based on 
indirect measurements.  Your system starts off overloaded so reducing 
load can result in noticeable improvements but possibly not for the 
reasons you think.

FWIW my polling changes are in the sam_wifi p4 branch; you can view it 
at http://perforce.freebsd.org.

	Sam