[RFC] BPF timestamping

Fri Jun 11 16:38:42 UTC 2010

On Friday 11 June 2010 09:08 am, Bruce Evans wrote:
> On Thu, 10 Jun 2010, Jung-uk Kim wrote:
> > On Thursday 10 June 2010 05:45 am, Bruce Evans wrote:
> >> On Wed, 9 Jun 2010, Jung-uk Kim wrote:
> >>> bpf(4) can only timestamp packets with microtime(9).  I want to
> >>> expand it to be able to use different format and resolution. 
> >>> The ...
> >>
> >> This has too many timestamp types, yet not one timestamp type
> >> which is any good except possibly BPF_T_NONE, and not one
> >> monotonic timestamp type.  Only external uses and compatibility
> >> require use of CLOCK_REALTIME.
> >> ...
> >
> > Please note that I am not trying to solve timecounter issues
> > here. The current BPF timestamping is not too good because of two
> > main reasons; 1) it is too slow with some timecounter hardware as
> > you have noted and 2) we have no API to change timestamp
> > resolution, accuracy, format, offset, or whatever *at all*.
> >
> > The most common trick for the first problem is using
> > getmicrotime(9) instead of microtime() if the users don't care
> > much about its accuracy.  For those people who want to collect as
> > many packets as possible without spending fortunes, it works
> > pretty well.  However, suppose you have multiple interfaces.  You
> > want good timestamps from a slower controller (LAN side) and less
> > accurate timestamps from a super fast controller (WAN side), but
> > you can't.  My patch solves this problem by assigning time
> > stamping function per descriptor.  So, you can use the same
> > resolution but different accuracies, for example.
>
> I now think you should provide exactly the same timestamping
> features as provided to useland by clock_gettime(2),
> clock_getres(2) and clock_getaccprecres(2missing), using
> essentially the same interface and code.  The userland interface
> involves clock ids of type clockid_t with names like CLOCK_REALTIME
> instead of bpf-specific names and types. Unfortunately it only
> supports the timespec format.

I thought about using them but struct timespec isn't good enough.  It 
has exactly the same problem as struct timeval does, i.e., 
sizeof(time_t) and sizeof(long) are variable depending on arch.  Note 
struct bpf_xhdr uses int64_t and uint64_t to work around the problem.  
At least in theory, it should be good enough until we have to support 
a 16-byte aligned arch. :-)

> > The second problem is little bit harder for us without breaking
> > libpcap and its consumers as it expects struct timeval and
> > nothing else.  That's why I had to introduce new header format
> > with compat shims.  In fact, struct bpf_hdr (and struct
> > pcap_sf_pkthdr) is really obsolete and people have been talking
> > about pcap NG for many years, which can store timestamps in
> > variable resolutions and offsets.
>
> Does it prefer or support bintimes?

It supports bintime.  It does not prefer anything although the default 
resolution is 1 usec for backward compatibility with old pcap format.

> > However, we can only use the default resolution even if libpcap
> > gets the new format because we are stuck with struct bpf_hdr[1].
> >
> > BTW, I updated my patch, which includes monotonic clocks now.
> >
> > 	BPF_T_MICROTIME_MONOTONIC	microuptime(9)
> > 	BPF_T_NANOTIME_MONOTONIC	nanouptime(9)
> > 	BPF_T_BINTIME_MONOTONIC		binuptime(9)
> > 	BPF_T_MICROTIME_MONOTONIC_FAST	getmicrouptime(9)
> > 	BPF_T_NANOTIME_MONOTONIC_FAST	getnanouptime(9)
> > 	BPF_T_BINTIME_MONOTONIC_FAST	getbinuptime(9)
> >
> > http://people.freebsd.org/~jkim/bpf_tstamp2.diff
> >
> > Thanks for the hint, Bruce, although you may say there are more
> > bogus clock types now. ;-)
>
> Yes, there are far too many, but many are still missing:
> - aliases BPF_T_*TIME_PRECISE for BPF_T_*TIME correpsonding to the
>    corresponding aliases for clockid_t's.  This gives 18 clock ids
>    per timecounter instead of only 12.  clock_gettime() only
> supports 6 of these (it doesn't support the micro or bin time
> formats). - aliases BPF_T_UPTIME* for BPF_*TIME_MONOTONIC.  This
> gives 27 clock ids per timecounter instead of only 18. 
> clock_gettime() only supports 9 of these.
> - BPF_T_SECOND corresponding to CLOCK_SECOND.  clock_gettime()
>    supports this.
> - BPF_T_THREAD_CPUTIME corresponding to CLOCK_THREAD_CPUTIME_ID,
> but without the bogus _ID suffix.  The latter gives the runtime of
> the current thread in nanoseconds.  This might be almost useful for
> bpf if all the packets are stamped by the same kernel or user
> thread.  Then it would function as a packet id with extra info
> about the time spent processing packets.
> - BPF_T_VIRTUAL and BPF_T_PROF corresponding to CLOCK_VIRTUAL and
>    CLOCK_PROF.  The latter give user and user+sys times for
> processes. They would be about as useful as BPF_T_THREAD_CPUTIME
> for bpf. - the total is now 31 for bpf (19 missing) and 13 for
> clock_gettime(). - multiply this by the number of timecounters. 
> Non-primary timecounters should be available iff something has a
> use for them.
> - raw cputicker timestamps.  CLOCK_THREAD_CPUTIME_ID's timer uses
> these. These are not available in userland.  They are easily
> available in the kernel, by calling cpu_tick().  Scaling them is
> nontrivial. - raw timecounter reads.  These are already available
> in userland via sysctlbyname("kern.timecounter.tc.<name>.counter",
> ...).  Strangely, they are hard to call from the kernel.

That's really far too many for my taste. :-( It'll significantly 
increase number of special cases for switch statement but I cannot 
avoid it (please see below).  I added _MONOTONIC because it was 
relatively cheap to implement and important.  I may add some aliases 
for _REALTIME, _PRECISE, and _UPTIME if you insist, though.

> By using normal clock ids and calling kern_clock_gettime(), you can
> avoid lots of duplication (including documentation of the bpf clock
> ids) and automatically support new normal clock ids.  However, I
> can't see how to implement the following features as efficiently:
> - direct scaling to the final precision (kern_clock_gettime() only
>    returns timspecs -- see abov)
> - delayed scaling to the final precision (bpf seems to make
> timestamps as binuptimes and scale them later)
> - avoiding going through layers and switches.  bpf goes through
> several layers and switches now, but perhaps it can go directly to
> the *time() function in kern_tc.c via a single function pointer,
> where kern_clock_gettime() and delayed scaling have to use a switch
> or an indexed function pointer since their clock id is highly
> variable.

As I said, we cannot use kern_clock_gettime() and clockid_t.  The code 
duplication is also necessary evil because multiple descriptors may 
be attached to a single interface, unless you are effectively asking 
me to revert the following commit:

http://docs.freebsd.org/cgi/mid.cgi?200607241542.k6OFg5ck098374

Cheers,

Jung-uk Kim