strange ping response times...

Thu Apr 12 03:19:18 UTC 2012

On Wed, 11 Apr 2012, Luigi Rizzo wrote:

> On Wed, Apr 11, 2012 at 02:16:49PM +0200, Andre Oppermann wrote:
>> On 11.04.2012 13:00, Luigi Rizzo wrote:
>>> On Wed, Apr 11, 2012 at 12:35:10PM +0200, Andre Oppermann wrote:
>>>> On 11.04.2012 01:32, Luigi Rizzo wrote:
>>>>
>>>> Things going through loopback go through a NETISR and may
>>>> end up queued to avoid LOR situations.  In addition per-cpu
>>>> queues with hash-distribution for affinity may cause your
>>>> packet to be processed by a different core.  Hence the additional
>>>> delay.
>>>
>>> so you suggest that the (de)scheduling is costing several microseconds ?
>>
>> Not directly.  I'm just trying to explain what's going on to
>> get a better idea where it may go wrong.
>>
>> There may be a poor ISR/scheduler interaction that prevents that
>> causes the packet to be processed only on the next tick or something
>> like that.  I don't have a better explanation for this.

It's certainly abysmally slow.  Just the extra context switching made
in FreeBSD-5 made the RTT for pinging localhost 3-4 times slower than
in FreeBSD-3 in old tests (I compared with FreeBSD-3 instead of
FreeBSD-4 since general bloat had already made FreeBSD-4 significantly
slower, although not 3-4 times).  Direct dispatch of netisrs never did
anything good in old tests, and the situation doesn't seem to have
improved -- you now need an i7 2600 (SMP?) to get the same speed as
my Athlon64 2000 (UP) in the best cases for both (2-3 usec RTT).  SMP
and multiple cores give more chances for scheduler pessimizations.

> ok, some final remarks just for archival purposes
> (still related to the loopback ping)
>
> ping takes a timestamp in userspace before trying to transmit
> the packet, and then the timestamp for the received packet
> is recorded in the kernel (in the interrupt or netisr thread
> i believe -- anyways, not in userspace).

No, all timestamps recorded used by ping are recorded in userland.
IIRC, there is no kernel timestamping at all for ping packets, unless
ping is invoked with "-M time" to make it use ICMP_TSTAMP, and
ICMP_TSTAMP gives at best milliseconds resolution so it is useless
for measuring RTTs in the 2-999 usec range.
    (ICMP_TSTAMP uses iptime(), and the protocol only supports
    milliseconds resolution, which was good enough for 1 Mbps ethernet.
    iptime() is more broken than that (except in my version), since it
    uses getmicrotime() instead of microtime().  getmicrotime() gives
    at best 1/HZ resolution, so it is not even good enough for 1 Mbps
    ethernet when HZ is small, and now it may give extra inaccuracies
    from stopping the 1/HZ clock while in sleep states.)

This reminds me that slow timecounters make measuring small differences
in times difficult.  It can take longer to read the timecounter than
the entire RTT.  I tested this by pessimizing kern.timecounter.hardware
from TSC to i8254.  On my test system, clock_gettime() with
CLOCK_MONOTONIC takes an average of 273 nsec with the TSC timecounter
and 4695 nsec with the i8254 timecounter.  ping uses gettimeofday()
which is slightly slower and more broken (since it uses CLOCK_REALTIME).
My normal ping -fq localhost RTT is 2-3 usec
     (closer to 3; another bug in this area is that the timestamps only
     have microseconds resolution so you can't see if 3 is actually
     more like 2.5.  I was thinking of changing the resolution to
     nanoseconds 8-10 years ago, before the FreeBSD-5 pessimizations
     and CPU speeds hitting a wall made this not really necessary),
but the kernel I'm testing with uses ipfw which bloats the RTT to 8-9
usec.  Then kern.timecounter.hardware=i8254 bloates the RTT to 24-25!
That's 16 usec extra, enough for the extra overhead of 4
gettimeofday() calls.  Timecounter statistics confirm that there are
many more than 2 timecounter calls per packet:
     - 7 binuptime calls per packet.  That's the hardware part that is
       very slow with an i8254 timecounter.  It apparently takes more
       like 3000 nsec than 4695 nsec (to fit 7 in 24-25 usec).
     - 3 bintime calls per packet.  bintime calls binuptime, so this
       accounts for 3 of the above 7.  The other 4 are apparently for
       context switching.  There are 2 context switches per packet :-(.
       I can't explain why there are apparently 2 timestamps per
       context switch.
 	  (Note that -current uses the inferior cputicker mechanism
 	  instead of timecounters for timestamping context switches.
 	  It does this because some timecounters are very slow.  But
 	  when the timecounter is the TSC, it binuptime() only takes
 	  a few cycles more than cpu_ticks().  (The above time of
 	  273 nsec for reading the TSC timecounter is from userland.
 	  The kernel part takes only about 30 nsec, while cpu_ticks()
 	  might take 15 nsec.)  So -current wouldn't be pessimized for
 	  this part by changing the timecounter to i8254, but without
 	  the pessimization it would be only a few nsec faster than
 	  old kernels provided the timecounter for old kernels is the
 	  TSC.)
     - 2 microtime calls per packet.  These are for the 2 gettimeofday()s.
       microtime calls bintime, so this accounts for 2 of the above 3.
       The other 1 is unaccounted for.
     - 1 getmicrouptime call per packet.  This doesn't involve the hardware.
       This is for the select() timeout.  If we fixed the resolution bugs
       in select(), then we would have to be careful not to pessimize it
       too much by replacing this by microuptime.
     - 0 calls to iptime() per packet.

Benchmarks like this show what a bad idea the get*time() timecounter
APIs are.  When you actually need to read the time a lot, you need high
resolution and thus the full functions, and optimizing for when only
low resolution is needed only optimizes the part that doesn't take very
long anyway (except for broken programs).

> A thing that seems to make
> the difference is the choice of idle states:
> 	machdep.idle = {acpi, hlt, mwait, spin}
> 	hw.acpi.cpu.cx_lowest={C1, C2, C3}
> results in 10-15% of samples with significantly different values
> (see table below, all values in microseconds)
>
> 	config		normal		top 10-15%
> 	-----------------------------------------
> 	acpi+c3		13		85
> 	acpi+c2		10		40
> 	acpi+c1		7		10
>
> 	spin		7		8
> 	hlt,mwait		6..10 (no clear distinction)
> 	ping -f			5-6
>
> This suggests that sometimes the responding thread
> may be started on a sleeping core and so takes
> longer to wake up.
>
> The high values are not periodic but the ratio is close to
> the 128/1000 which relates profhz and tick, so that might be
> something to look at.

I guess this is to be expected on a multi-core system.  The scheduler
should probably try to use a different core for netisrs, but this would
cause both cores to block waiting for each other, so both might sleep.
Any batching of netisrs would give bursts with longer sleeps in between.

I didn't see any large differences between ping -f and ping -i 0.01
on a UP system.  My HZ is 100, so -i 0.01 is not really possible and
gives more like 0.015 (average) or perhaps 0.02 or even 0.03 after
synchronization, as we discussed previously.  -i 0.1 is closer to
possible and gave a small increase in RTT (-i 1.0 gives the default
and a much larger increase).  The increases are absolute and not so
large relatively starting with the slowness given by ipfw.

Note that ping -f doesn't actually flood, since it waits for 1/100
seconds for something to come back, and 1/100 was only a short time
with 1 Mbps ethernet.  This can be worked around to some extent using
ping -i to give a wait time of less than 1/100 seconds.  It is still
impossible to flood, since wait times of less than 1/HZ are impossible,
and the resolution of ping -i is gratuitously limited to 1 msec.  You
could try setting the -i arg to 1/HZ to see the effects of this, or
hack the hard-coded -f timeout to 1 usec.

With hardware, there are many more ways for the RTT to vary.  One way is
that although ping -f doesn't flood, injecting an extra packet on some
1/100 second boundaries can cause queue lengths to build up.  This might
be only with my version of ping (it keeps track of the boundaries an
might send an extra packet when it shouldn't).  When the queue lengths
increase for any reason (due to injecting packets or when there is
a delay for other system activity), packets are sent and received in
bursts; throughput goes up almost proporitionally to the queue lengths,
and RTT goes up.

Bruce