bge(4) sysctl tuneables -- a blast from the past.

Thu Apr 18 14:47:22 UTC 2013

On Wed, Apr 17, 2013 at 7:49 PM, Bruce Evans <brde at optusnet.com.au> wrote:
> On Tue, 16 Apr 2013, Sepherosa Ziehau wrote:
>
>> On Tue, Apr 16, 2013 at 1:56 PM, Bruce Evans <brde at optusnet.com.au> wrote:
>>>
>>>
>>> Technical bugs include:
>>> - wrong defaults are claimed for *coal_ticks.  The defaults are 150, but
>>>   are claimed to be 150 milliseconds.  These values are dimensionless,
>>>   but since ticks take 1 microsecond each, 150 gives 150 microseconds,
>>>   not 150 milliseconds.
>>
>>
>> The real effect of TX coalesce ticks is confusing to me; TX interrupt
>> does not come at the rate you have specified, at least for several
>> PCI-e bge(4) I have tested. However, RX coalesce ticks work as
>> expected.
>
>
> It works for me on a 5701 (PCI-X) on a PCI-33 bus.
>
> Perhaps you are just seeing rx interrupts mixed with tx interrupts.  At

Nah, the testing switch only has the two tested NICs connected (the
receiving side is Intel 82571).  The ssh connections are on other
NICs.  And there are no other data between the tested NICs.  If the TX
coalesce BDs (e.g. set to 128), the receiving side could sink
1.488Mpps generated by the BCM5720, so the receiving side should be OK
too.

> least the FreeBSD driver doesn't determine the interrupt type, so it
> always processes tx activity when it gets an rx interrupt.
>
> I had to do the following to avoid getting rx interrupts (without this
> the interrupt rate increased by a factor of 3-4 with tx_coal_ticks = 150,
> from ~6.7 kHz to 19-24 kHz):
> - I use ttcp for testing, so on the receiving system use ttcp -u -r so
>   that it doesn't echo anything (otherwise it would "echo" with icmp
>   port-unreachable unless firewalled).
> - Use an old receiving system that doesn't support flow control.  The
>   system can't keep up, and drops about half of the packets, so if it
>   did flow control then there would be a lot of rx interrupts.
>
>
>> Here is how the tests were conducted:
>> - Send only test, no RX
>> - Each packet consume only one BD; UDP datagram, using hardware
>> checksum offloading
>> - TX coalesce BDs is set to 0, so only TX coalesce ticks have effect
>>
>> The interrupt rate I had got seemed to be related to packet size?!  I
>> had tested two TX coalesce ticks settings:
>> (the result I had recorded was using BCM5720)
>
>
> This might be due to larger packet size causing less rx activity.

No RX activity :)

>
>
>> The first setting was 1023us; the first col is UDP data size, the
>> second col is rough interrupt rate
>> 18B    667/s
>> 64B    611/s
>
>
> Oops, this doesn't look like rx activity.  We expect a rate of 977 Hz,
> possibly increased significantly by tx activity.
>
> I get 996-1004 here (1023us is actually 1000?).
>
>
>> 128B    538/s
>> 256B    432/s
>> 512B    311/s
>> 1024B    194/s
>> 1472B    146/s
>
>
> I get 996-1004 for all of these.

That's result that I wanted to get from the PCI-e bge(4) NICs that I
have tested.

>
> Now I remember another problem that I work around using huge ifqueues (10k
> or 20k entries) and/or busy-waiting in the send() in ttcp.  It is too easy
> for the tx to stop because there is nothing on the ifqueue to refill it.
> Then it won't restart until the application starts sending again.  It
> is normal for all the queues to fill up.  Then send returns ENOBUFS and
> there is no good way for the application to handle this, since select()
> on the queues not being full is broken (never supported).  Bad ways include:
> - sleep for a while in the application.  It is hard to know when to wake up,
>   and impossible to wake up soon enough if timeout granularity is large.
> - use huge ifqueues, so that long delays in the application work
> - spin trying send().

Use user space packet generator (netmap is an exception) does suffer
the problem you have described, however, my sending tests are
conducted using home-made in kernel packet generator (no ALTQ
involved):
- Upon test start, the 3/4 of ifqueue is filled; then call if_start.
For BCM5720, 384 packets are put on to the hardware queue at once.
- In bge_txeof(), for each packet mbuf freed, new packet is added to ifqueue.
- if_start is called at the end of bge_txeof.

Except the startup, the rest (major) part is TXEOF driven.

If TX coalesce BDs is set to 128 (TX coalesce ticks is set to 1023
here) on BCM5720, I get ~11700/s interrupt and sending rate is
~1.488Mpps (18B UDP datagram); obviously the TX coalesce BDs is taking
effect here (128 * 11700 ~= 1.488Mpps).  So I think the packet
generator mechanism works as expected (well, it works w/ other types
of NICs too, e.g. em, jme)

>
>
>> Tecond setting was 128us; the first col is UDP data size, the second
>> col is rough interrupt rate
>> 18B    1647/s
>> 64B    1338/s
>
>
> Now you should be getting much higher interrupt rates, unless something
> can't keep up.  I get 7904-7967 and 7906-7971.
>
>
>> 128B    1030/s
>> 256B    700/s
>> 512B    430/s
>> 1024B    235/s
>> 1472B    169/s
>
>
> I get little dependency on the packet size.  At 1472B, the packet rate is
> ~58900.  Eveything on the tx side can keep up with that though not much
> more, so no drop is expected.
>
>
>> Well, to be frank, it does not make too much sense to me.
>
>
> I found timestamps and counters for bge_*xeof() good for understanding
> the flow of control.  It is easy to generate too much data, so I keep
> the tx and rx statistics separate and try to understand tx and rx activity
> separately.  Some for tx with tx_coal_ticks = 1023 and packet size 18:
>
> @  976 1366197879.094951 454  25 349 1366197879.094976 105
> @  971 1366197879.095947 455  26 351 1366197879.095973 104
> @  972 1366197879.096945 455  25 355 1366197879.096970 100
> @  975 1366197879.097945 451  24 351 1366197879.097969 100
> @  974 1366197879.098943 443  24 337 1366197879.098967 106
>
> The large numbers are absolute timestamps for bge_txeof() entry and exit.
>
> The entries are separated by almost exactly 1000 us (not 1023 us as
> expected).
>
> The first numeric column gives the time in us between the previous exit and
> this entry.  Not very relevant here.
>
> The fourth numeric column gives the time in us between this entry and exit.
> Not very relevant here.
>
> The third and final numeric columns give the ring indexes on entry and
> exit, and the 5th numeric column gives the difference of these.  These
> are relevant here.  Ideally the ring would be almost but not quite
> full whenever we start, and the difference would be almost 512, but
> ttcp apparently can't generate data fast enough to keep it full, so
> it has an average of 350+ entries and the packet rate is 350+kpps.  We
> don't want the ring to be completely full when we start, since that
> means that we are not interrupting enough to keep up with the generator
> and probably also with the hardware.  This system can do 640+kpps when
> ideally configured, using tx_coal_ticks = 1000000 and tx_coal_bds =
> 384.  With tx_coal_ticks = 1023 (1000) and tx_coal_bds = 0. it couldn't
> do more than 512kpps.  Its current non-ideal configuration includes
> firwalling, sharing the bge interrupt with rl, and not overclocking.
> In this configuration, the above 2 tx_coal_* settings are almost equally
> good (tx_coal_ticks = 1023 reduces latency for reaping descriptors,
> but latency doesn't matter; tx_coal_ticks = 100000 reduces interrupts
> when not under load, but when not under load interrupt overhead isn't
> a problem).

Thank you for the measurement and analysis!  I have not yet to make
any detailed function call timing and packet count measurement :)

Best Regards,
sephe

-- 
Tomorrow Will Never Die