Micro-benchmark for various time syscalls...

Mon Jun 2 17:11:09 UTC 2008

On Sun, 1 Jun 2008, Sean Chittenden wrote:

> I wrote a small micro-benchmark utility[1] to test various time syscalls and 
> the results were a bit surprising to me.  The results were from a UP machine 
> and I believe that the difference between gettimeofday(2) and 
> clock_gettime(CLOCK_REALTIME_FAST) would've been bigger on an SMP system and 
> performance would've degraded further with each additional core.

I wouldn't expect SMP to make much difference between CLOCK_REALTIME and
CLOCK_REALTIME_FAST.  The only difference is that the former calls
nanotime() where the latter calls getnanotime().  nanotime() always does
more, but it doesn't have any extra SMP overheads in most cases (in rare
cases like i386 using the i8254 timecounter, it needs to lock accesses to
the timecounter hardware).  gettimeofday() always does more than
CLOCK_REALTIME, but again no more for SMP.

> clock_gettime(CLOCK_REALTIME_FAST) is likely the ideal function for most 
> authors (CLOCK_REALTIME_FAST is supposed to be precise to +/- 10ms of 
> CLOCK_REALTIME's value[2]).  In fact, I'd assume that CLOCK_REALTIME_FAST is 
> just as accurate as Linux's gettimeofday(2) (a statement I can't back up, but 
> believe is likely to be correct) and therefore there isn't much harm (if any) 
> in seeing clock_gettime(2) + CLOCK_REALTIME_FAST receive more widespread use 
> vs. gettimeofday(2).  FYI.  -sc

The existence of most of CLOCK_* is a bug.  I wouldn't use CLOCK_REALTIME_FAST
for anything (if only because it doesn't exist in most kernels that I
run.  I switched from using gettimeofday() to CLOCK_REALTIME many years
ago when syscalls started taking less than 1 usec and still occasionally
have problems from this running old kernels, because old i386 kernels
don't support CLOCK_REALTIME and old amd64 kernels have a broken
CLOCK_REALTIME in 32-bit mode).

> PS  Is there a reason that time(3) can't be implemented in terms of 
> clock_gettime(CLOCK_SECOND)?  10ms seems precise enough compared to time_t's 
> whole second resolution.

I might use CLOCK_SECOND (unlike CLOCK_REALTIME_FAST), since the low
accuracy timers provided by the get*time() family are accurate enough
to give the time in seconds.  Unfortunately, they are still broken --
they are all incoherent relative to nanotime() and some are incoherent
relative to each other.  CLOCK_SECOND can lag the time in seconds given
by up to tc_tick/HZ seconds.  This is because CLOCK_SECOND returns the
time in seconds at the last tc_windup(), so it misses seeing rollovers
of the second in the interval between the rollover and the next
tc_windup(), while nanotime() doesn't miss seeing these rollovers so
it gives incoherent times, with nanotime()/CLOCK_REALTIME being correct
and time_second/CLOCK_SECOND broken.  vfs_timestamp() already defaults
to using time_second, so it gives times incoherent with time() since
the latter still uses getttimeofday().  Some file system test programs
see this incoherency and I run them with vfs.timestamp.precision=3
(nanotime()) to avoid it.  File systems were micro-optimized to use
time_second (now not so micro optimized to use vfs_timestamp() which
defaults to using time_second), but micro-pessimizing them to use
nanotime() makes no significant difference.  This is because most file
system timestamp updates are cached (delayed until the next sync or
disk write), and in cases where the updates are written to disk the
time to read the clock is in the noise relative to the time for the
disk write.

>
> % ./bench_time 9079882 | sort -rnk1
> Timing micro-benchmark.  9079882 syscall iterations.
> Avg. us/call    Elapsed     Name
> 9.322484    84.647053       gettimeofday(2)
> 8.955324    81.313291       time(3)
> 8.648315    78.525684       clock_gettime(2/CLOCK_REALTIME)
> 8.598495    78.073325       clock_gettime(2/CLOCK_MONOTONIC)
> 0.674194    6.121600        clock_gettime(2/CLOCK_PROF)
> 0.648083    5.884515        clock_gettime(2/CLOCK_VIRTUAL)
> 0.330556    3.001412        clock_gettime(2/CLOCK_REALTIME_FAST)
> 0.306514    2.783111        clock_gettime(2/CLOCK_SECOND)
> 0.262788    2.386085        clock_gettime(2/CLOCK_MONOTONIC_FAST)

These are very slow.  Are they on a 486? :-)  I get about 262 ns for
CLOCK_REALTIME using the TSC timecounter on all ~2GHz UP systems.
The syscall overhead is about 200 nsec (170 nsec for a simpler syscall
and maybe 30 nsec extra for copyin/out for clock_gettime()) and reading
the TSC timecounter adds another 60 nsec, including a whole 6 nsec for
the hardware part of the read (perhaps more like 30 nsec than 60 for the
whoe read).  The TSC doesn't work on all machines (never for SMP), but
this will hopefully change.  (Phenom is supposed to have TSCs that are
coherent across CPUs, and rdtsc has slowed down from 12 cycles to 40+
to implement this :-(.  Core2 already has a 40+ cycles rdtsc, but AFAIK
it doesn't have coherent TSCs.)  Other timecounters are much slower than
the TSC, but I haven't seen one take 8000 nsec since 486 days.

Some of my benchmark results:

2.205GHz A64 in 32-bit mode, VIA motherboard:
%%%
2008/01/05 (TSC) bde-current, -O2 -mcpu=athlon-xp
min 240, max 77658, mean 242.171787, std 65.655259

2007/11/23 (TSC) bde-current
min 247, max 11890, mean 247.857786, std 62.559317

2007/05/19 (TSC) plain -current-noacpi
min 262, max 286965, mean 263.941187, std 41.801400

2007/05/19 (TSC) plain -current-acpi
min 261, max 68926, mean 279.848650, std 40.477440

2007/05/19 (ACPI-fast timecounter) plain -current-acpi
min 558, max 285494, mean 827.597038, std 78.322301

2007/05/19 (i8254) plain -current-acpi
min 3352, max 288288, mean 4182.774148, std 257.977752
%%%

These times are for CLOCK_REALTIME.

This system has a fairly fast ACPI and i8254 timecounters.  1500-800
nsec is more typical for ACPI-fast, and 4000-5000 is more typical
for i8254.  ACPI-fast should be named ACPI-not-very-slow.  ACPI-safe
is very slow, perhaps slower than i8254.  i8254 could be made about
twice as fast if anyone cared.

133MHz P1:
%%%
1996/07/12:
min 3, max 472, mean 3.320346, std 0.694846

1998/02/21 pre-phk:
min 3, max 595, mean 3.443382, std 0.767383

1998/02/21 post-phk:
min 4, max 99, mean 4.614527, std 0.710407

1999/12/04:
min 4, max 120, mean 4.630231, std 0.777733

2000/09/29:
min 5, max 203, mean 5.376130, std 1.912127

2001/05/19:
min 6, max 1715, mean 6.783378, std 2.015211

2001/09/02:
min 5, max 482, mean 5.474384, std 2.683939
%%%

These times are for gettimeofday().  Note that there are now in usec.
The timecounter is always the TSC (post-phk) or uses the TSC more
directly (pre-phk).  These times  serve mainly to document time bloat
due to timecounters and SMPng.  The P1 has limited caching and suffers
more from longer code paths than new CPUs.

66MHz 486DX2:
%%%
1995/11/03:
min 13, max 171, mean 14.286634, std 1.836667

2000/11/15:
min 20, max 542, mean 21.843003, std 8.003137
%%%

Here the timecounter is always the i8254.  These times serve mainly
as a reminder of how slow old machines were.  The i8254 timecounter
hardware didn't take any longer back then (it was probably faster,
since old machines didn't have PCI bridges, and they had tunable ISA
wait states which I tuned), but a simple syscall took 7.2 usec and
gettimeofday() took much longer.  The bloat between 1995 and 2000 was
relatively similar to that on the P1 system.

Other implementation bugs (all in clock_getres()):
- all of the clock ids that use getnanotime() claim a resolution of 1
   nsec, but that us bogus.  The actual resolution is more like tc_tick/HZ.
   The extra resolution in a struct timespec is only used to return
   garbage related to the incoherency of the clocks.  (If it could be
   arranged that tc_windup() always ran on a tc_tick/HZ boundary, then
   the clocks would be coherent and the times would always be a multiple
   of tc_tick/HZ, with no garbage in low bits.)
- CLOCK_VIRTUAL and CLOCK_PROF claim a resolution of 1/hz, but that is
   bogus.  The actual resolution is more like 1/stathz, or perhaps 1
   microsecond.  hz is irrelevant here since statclock ticks are used.
   statclock ticks only have a resolution of 1/stathz, but if 1 nsec is
   correct for CLOCK_REALTIME_FAST, then 1 usec is correct here since
   caclru() calculates the time to a resolution of 1 usec; it is just
   very inaccurate at that resolution.
"Resolution" is a poor term for the functionality needed here.  I think
a hint about the accuracy is more important.  In simple implementations
using interrupts and ticks, the accuracy would be about the the same as
the resolution, but FreeBSD is more complicated.

Bruce