dummynet dropping too many packets

Wed Oct 7 13:40:20 UTC 2009

On Wed, 7 Oct 2009, rihad wrote:

>> Suggestions like increasing timer resolution are intended to spread out the 
>> injection of packets by dummynet to attempt to reduce the peaks of 
>> burstiness that occur when multiple queues inject packets in a burst that 
>> exceeds the queue depth supported by combined hardware descriptor rings and 
>> software transmit queue.
>
> Raising HZ from 1000 to 2000 has helped. There are now 200-300 global 
> drops/s, as opposed to 300-1000 with HZ=1000. Or maybe net.isr.direct from 1 
> to 0 help. Or maybe hash_size from 64 to 256. Or maybe...

Or maybe other random factors such as traffic load corresponding to major 
sports events, etc. :-)

It's also possible that combining multiple changes cancels out the effect of 
one or another change.  Given the rather large number of possible 
combinations of things to try, I'd suggest being fairly strategic in how you 
try them.  Starting with just an original config + significant HZ increase is 
probably the best starting point.  Changing hash_size is really about reducing 
CPU use, so if in the whole you're not getting close to the capacity of a core 
for any given thread involved in the work, it may not make much difference 
(tuning these data structures is a bit of a black art).

>> The two solutions, then are (a) to increase the timer resolution 
>> significantly so that packets are injected in smaller bursts
>
> But isn't that bad that it can actually become worse?  From /sys/conf/NOTES:
>
> # The granularity of operation is controlled by the kernel option HZ whose
> # default value (1000 on most architectures) means a granularity of 1ms
> # (1s/HZ).  Historically, the default was 100, but finer granularity is
> # required for DUMMYNET and other systems on modern hardware.  There are
> # reasonable arguments that HZ should, in fact, be 100 still; consider,
> # that reducing the granularity too much might cause excessive overhead in
> # clock interrupt processing, potentially causing ticks to be missed and thus
> # actually reducing the accuracy of operation.

Right: we fire the timer on every CPU at 1/HZ seconds, which means quite a lot 
of work being done.  On systems where timers are proportionally more expensive 
-- especially when using hardware virtualization, for example, we do recommend 
tuning the timers down.  And our boot loader will actually do it for you: we 
auto-detect vmware, parallels, kqemu, virtualbox, etc, and adjust the timer 
rate from from 1000 to 100 during the boot.

That said, in your configuration I see little argument for a lower timer rate: 
you need to burst packets at frequent intervals or risk overfilling queues, 
and the overheads of additional timer tickets on your system shouldn't be too 
bad as you have both very fast hardware and a lot of idle time.

I would suggest making just the HZ -> 4000 change for now and see how it goes.

>> and (b) increase the queue capacities.  The hardware queue limits likely 
>> can't be raised w/o new hardware, but the ifnet transmit queue sizes can be 
>> increased.
>
> Can someone please say how to increase the "ifnet transmit queue sizes"?

Unfortunately, I fear that this is driver-specific, and in the case of bce 
requires a recompile.  In the driver init code in if_bce, the following code 
appears:

         ifp->if_snd.ifq_drv_maxlen = USABLE_TX_BD;
         IFQ_SET_MAXLEN(&ifp->if_snd, ifp->if_snd.ifq_drv_maxlen);
         IFQ_SET_READY(&ifp->if_snd);

Which evaluates to a architecture-specific value due to varying pagesize.  You 
might just try forcing it to 1024.

>> Timer resolution going up is almost certainly not a bad idea in your 
>> configuration, although does require a reboot as you have observed.
>
> OK, I'll try HZ=4000, but there are some required servers like 
> flowtools/radius/mysql/perl app that are also running.

That should be fine.

>> On a side note: one other possible interpretation of that statistic is that 
>> you're seeing fragmentation problems.  Usually in forwarding scenarios this 
>> is unlikely.  However, it wouldn't hurt to make sure you have LRO turned 
>> off on the network interfaces you're using, assuming it's supported by the 
>> driver.
>> 
> I don't think fragments are the problem. The numbers are too small ;-)
> $ netstat -s|fgrep fragment
>        5318 fragments received
>        147 fragments dropped (dup or out of space)
>        5157 fragments dropped after timeout
>        4088 output datagrams fragmented
>        8180 fragments created
>        0 datagrams that can't be fragmented
>
> There's no such option as LRO shown, so I guess it's off: 
> options=1bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4>

That probably rules that out as a source of problems then.

Robert