[RFC] BPF timestamping

Thu Jun 10 17:03:09 UTC 2010

On Thu, 10 Jun 2010, Alexander Sack wrote:

> On Thu, Jun 10, 2010 at 5:45 AM, Bruce Evans <brde at optusnet.com.au> wrote:
>> On Wed, 9 Jun 2010, Jung-uk Kim wrote:
>>
>>> bpf(4) can only timestamp packets with microtime(9).  I want to expand
>>> it to be able to use different format and resolution.  The patch is
>>> here:
>>>
>>> http://people.freebsd.org/~jkim/bpf_tstamp.diff
>>>
>>> With this patch, we can select different format and resolution of the
>>> timestamps.  It is done via ioctl(2) with BIOCSTSTAMP command.
>>> Similarly, you can get the current format and resolution with
>>> BIOCGTSTAMP command.  Currently, the following functions are
>>> available:
>>>
>>>        BPF_T_MICROTIME         microtime(9)

[more \xa0 deleted]

>> This has too many timestamp types, yet not one timestamp type which
>> is any good except possibly BPF_T_NONE, and not one monotonic timestamp
>> type.  Only external uses and compatibility require use of CLOCK_REALTIME.
>
> None of these issues are bpf(4)) related though...you are blaming the
> clock source (rightly so).
>
> bpf(4) is just a consumer, its not its job to validate clocks.  The
> kernel offers it so bpf(4) uses it.

My main point is that no one knows how to choose the best clock, partly
because there isn't one, so bpf and its consumers won't know either,
or shouldn't, because such knowledge should be in the kernel.

> Again, I agree that bpf(4) has really no way to validate whether the
> clock its using has the precision to generate a valid timestamp.

Not quite.  bpf or userland in conjunction with bpf could call
clock_getres(CLOCK_XXX), where XXX is REALTIME[_PRECISE] corresponding
to BPF_T_NANOTIME and REALTIME_FAST corresponding to
BPF_T_NANOTIME_FAST, and unavailable for the other bpf ioctl values,
and if clock_getres() worked as well as possible, then bpf could tell
if the clock has enough precision.  clock_getres() works reasonably
well for CLOCK_REALTIME_PRECISE (it gives the same "resolution" as the
hardware clock, although the time delivered is not usually a multiple
of that), but for CLOCK_REALTIME_FAST it returns the same as for
CLOCK_REALTIME_PRECISE and is thus useless (it should return the get*
update interval tc_tick * hz).  POSIX is as confused as anyone about
the difference between precision, resolution and accuracy so its
clock_getres() is misnamed and/or underspecified, but to be useful
it has to return a value related to the clock's update interval and
not one related to the number of trailing zeros (or multiple of a non
power of 10) in the nanoseconds field of interface.

> I ran into this exact problem trying to ACPI-fast timecounter in the
> bge(4) driver as an experiment to timestamp packets directly (despite
> the many many issues with this idea), one of the main problems was
> EXACTLY what you describe, ACPI-fast takes too long to access and the
> get* variants just don't update quick enough

I still mostly run UP systems where the TSC works.  Then microtime()
works very well for timestamping groups of packets at interrupt time.
Raw TSC reads would have worked a bit faster, but using microtime()
gave correctness (except I should have used microuptime()) and a
convenient conversion at little cost.

> HOWEVER, JK's changes STILL make sense in my opinion:
>
> - Many new NIC chipsets will timestamp packets directly in the driver.
> (Intel has one already that timestamps all packets, not just
> PTP/IEEE1588 ones).  So using bpf_gettime() vs just blindly calling
> microtime() no matter what IS the right idea (including the mbuf
> tagging mechanism to tell bpf(4) you got a time, use it).

I agree with that.  vfs_timestamp() does the same things for file systems,
but has the bugs that it is system-wide (should be per file system) and
not all file systems use it.

> - I don't see the use of the get* variants for anything above 100Mbps
> (as you say, they don't update fast enough), however there may be uses
> I am not thinking of (I think Guy mentioned he uses it).

Above 1Mbps for me :-).  Its still surely not useful to have nanosecond
and weeniesecond precisions, and would be more useful to have a bpf-specific
version that combines a timestamp (maybe in seconds resolution) with a
generation count.  phk wouldn't want a generation count on the get*time()
functions for the same reasons that he didn't want to fix them to be
coherent with the non-get* ones -- this would require locking on every
*time() read.  A bpf-specific version might be able to avoid additional
locking for the generation count or at least localize it better.

> - I am told by Intel that going forward the TSC will now not only be
> P-state invariant BUT ALSO synced across packages on Nehalem based
> platforms and higher.  I was going to start a new thread about this
> and sort of socialize this fact (they claim that all packages of ALL
> cores on the same motherboard will be clocked to the same oscillator
> and all drift equally provided you don't write TSC MSR directly -
> software resets have no effect on the TSC, once the BIOS performs a
> hard reset, the TSCs are all synchronized).

Is it already synced across cores within a package on pre-Nehalem?  I
was going to mention in my original reply that it seems to be synced
on FreeBSD cluster machines which are Xeon 0x6f7 whatever that is.

> *** That is what I am told by Intel.  ***
>
> I realize this is much different than what everybody is used too.  I
> am in the process of setting up a box to play with the TSC a bit (how
> does one test this, I was going to just read the tsc constantly across
> randomized cores and verify its at least monotonic?).  Apparently

My tests don't do the randomization properly and SCHED_ULE does too good
a kob of keeping threads on the same CPU.  However, with 64 threads
on 8 CPUs, threads migrate occasionally and this doesn't cause any
noticeable glitches in a TSC calibration program.

> Intel submitted some patches for Linux that skips its warp/sync test
> on boot up due to these new changes (i.e. I think by default on
> Nehalem based processors where the TSC is marked reliable (via CPUID

Shouldn't it still do the test?  At least it has one.

> bits), gettimeofday() calls use the TSC because its now reliable again
> - can somebody else confirm this?).  Provided one COULD use the TSC
> which is much much faster than ACPI-fast, I believe the above changes
> make even more sense!

Nah, with a working TSC timecounter almost everything can just use
microtime() :-).

There is still the problem that rdtsc() is not serializing and you
don't want it to be.  I have never noticed this causing time warps
though there must have been many for timestamping of function calls
in high resolution kernel profiling.  The extra code in microtime()
may prevent time warps for timestamping at an even lower level, and
packet timestamping is at a higher level and is associated with
i/o which may imply sufficient serialization.

Bruce