[RFC] BPF timestamping

Thu Jun 10 14:16:01 UTC 2010

On Thu, Jun 10, 2010 at 5:45 AM, Bruce Evans <brde at optusnet.com.au> wrote:
> On Wed, 9 Jun 2010, Jung-uk Kim wrote:
>
>> bpf(4) can only timestamp packets with microtime(9).  I want to expand
>> it to be able to use different format and resolution.  The patch is
>> here:
>>
>> http://people.freebsd.org/~jkim/bpf_tstamp.diff
>>
>> With this patch, we can select different format and resolution of the
>> timestamps.  It is done via ioctl(2) with BIOCSTSTAMP command.
>> Similarly, you can get the current format and resolution with
>> BIOCGTSTAMP command.  Currently, the following functions are
>> available:
>>
>>        BPF_T_MICROTIME         microtime(9)
>>        BPF_T_NANOTIME          nanotime(9)
>>        BPF_T_BINTIME           bintime(9)
>>        BPF_T_MICROTIME_FAST    getmicrotime(9)
>>        BPF_T_NANOTIME_FAST     getnanotime(9)
>>        BPF_T_BINTIME_FAST      getbintime(9)
>>        BPF_T_NONE              ignore time stamps
>
> This has too many timestamp types, yet not one timestamp type which
> is any good except possibly BPF_T_NONE, and not one monotonic timestamp
> type.  Only external uses and compatibility require use of CLOCK_REALTIME.

None of these issues are bpf(4)) related though...you are blaming the
clock source (rightly so).

bpf(4) is just a consumer, its not its job to validate clocks.  The
kernel offers it so bpf(4) uses it.

> I recently tried looking at timeout resolution on FreeBSD cluster
> machines using ktrace, and found ktrace unusable for this.  At
> first I blamed the slowness of the default misconfiguered timecounter
> ACPI-fast, but the main problem was that I forgot my home directory
> was on nfs, and nfs makes writing ktrace records take hundreds of
> times longer than on local file systems.  ACPI-fast seemed to be
> taking nearly 1000 uS, but it was nfs taking that long.
>
> Anyway, ACPI-fast takes nearly 1000 nS, which is many times too long
> to be good for timestamping individual syscalls or packets, and makes
> sub-microseconds resolution useless.  The above non-get *time()
> interfaces still use the primary timecounter, and this might be slow
> even if it is not misconfigured.  The above get*time() interfaces are
> fast only at the cost of being broken.  Among other bugs, their times
> only change at relatively large intervals which should become infinity
> with tickless kernels.  (BTW, icmp timestamps are still broken on
> systems with hz < 100.  Someone changed microtime() to getmicrotime(),
> but getmicrotime() cannot deliver the resolution of 1 mS supported by
> icmp timestamps unless these intervals are <= 1 mS.)

Again, I agree that bpf(4) has really no way to validate whether the
clock its using has the precision to generate a valid timestamp.

I ran into this exact problem trying to ACPI-fast timecounter in the
bge(4) driver as an experiment to timestamp packets directly (despite
the many many issues with this idea), one of the main problems was
EXACTLY what you describe, ACPI-fast takes too long to access and the
get* variants just don't update quick enough

HOWEVER, JK's changes STILL make sense in my opinion:

- Many new NIC chipsets will timestamp packets directly in the driver.
(Intel has one already that timestamps all packets, not just
PTP/IEEE1588 ones).  So using bpf_gettime() vs just blindly calling
microtime() no matter what IS the right idea (including the mbuf
tagging mechanism to tell bpf(4) you got a time, use it).

- I don't see the use of the get* variants for anything above 100Mbps
(as you say, they don't update fast enough), however there may be uses
I am not thinking of (I think Guy mentioned he uses it).

- I am told by Intel that going forward the TSC will now not only be
P-state invariant BUT ALSO synced across packages on Nehalem based
platforms and higher.  I was going to start a new thread about this
and sort of socialize this fact (they claim that all packages of ALL
cores on the same motherboard will be clocked to the same oscillator
and all drift equally provided you don't write TSC MSR directly -
software resets have no effect on the TSC, once the BIOS performs a
hard reset, the TSCs are all synchronized).

*** That is what I am told by Intel.  ***

I realize this is much different than what everybody is used too.  I
am in the process of setting up a box to play with the TSC a bit (how
does one test this, I was going to just read the tsc constantly across
randomized cores and verify its at least monotonic?).  Apparently
Intel submitted some patches for Linux that skips its warp/sync test
on boot up due to these new changes (i.e. I think by default on
Nehalem based processors where the TSC is marked reliable (via CPUID
bits), gettimeofday() calls use the TSC because its now reliable again
- can somebody else confirm this?).  Provided one COULD use the TSC
which is much much faster than ACPI-fast, I believe the above changes
make even more sense!

I still like the changes!  :)

-aps