Fast gettimeofday(2) and clock_gettime(2)

Thu Jun 7 09:12:53 UTC 2012

On Thu, Jun 07, 2012 at 01:00:34PM +1000, Bruce Evans wrote:
> On Wed, 6 Jun 2012, Konstantin Belousov wrote:
> 
> >On Wed, Jun 06, 2012 at 02:23:53PM -0400, John Baldwin wrote:
> >>In general this looks good but I see a few nits / races:
> >>
> >>1) You don't follow the model of clearing tk_current to 0 while you
> >>   are updating the structure that the in-kernel timecounter code
> >>   uses.  This also means you have to avoid using a tk_current of 0
> >>   and that userland has to keep spinning as long as tk_current is 0.
> >>   Without this I believe userland can read a partially updated
> >>   structure.
> >I changed the code to be much more similar to the kern_tc.c. I (re)added
> >the generation field, which is set to 0 upon kernel touching timehands.
> 
> Seems necessary.
> 
> >I think this can only happen if tc_windups occurs quite close in
> >succession, or usermode thread is suspended for long enough. BTW,
> >even generation could loop back to the previous value if thread is
> >stopped.
> 
> tc_windup()'s close in succession are bugs, since they cycle the timehands
> faster than they were designed to be.  We already have too many of these
> bugs (where tc_setclock() calls tc_windup().  I didn't notice this
> particular problem with it before).  Now I will point out that version
> 2 of your patch adds more of these calls, apparently to get changes to
> happen sooner.  But in sysctl_kern_timecounter_hardware(), such a call
> was intentionaly left out since it is not needed.  Note that tc_tick
> prevents calls to tc_windup() more often than about once per msec if
> hz > 1000.
No, I did not added more tc_windup calls. I added a recalculation
of the shared page content on the timecounter change, which is not
the same as tc_windup() call. This is exactly to handle a disable
of usermode rdtsc use when kernel timecounter hardware changes.

> 
> The generation count makes tc_windup()s close in succession harmless,
> except they increase race possibilities by reducing the time-domain
> locking.  The generation count is 32 bits, so it can only loop back to
> a previous value after 2**32 tc_windup_calls.  This "can't happen".
> What can happen is for the timehands to cycle after something is
> preempted for 10-100 msec.  Then the generation count allows detection
> of the cycling.  It only has an effect in this case.  Otherwise, the
> a thread can be preempted for 10-100 seconds and start up using a
> timehands pointer that it read into a register that long ago, and
> safely use the old pointer unless its generation has changed.  Even
> switching the timecounter works in that case.  This depends on the
> hardware part of the timecounter not going away and the software
> keeping most state per-timehands.
I reinstantiated the generation counter for rev. 3.

> 
> >There was apparently another issue with version 2. The bcopy() is not
> >atomic, so potentially libc could read wrong tk_current. I redid
> >the interface to write to the shared page to allow use of real atomics.
> 
> Timecounter code is supposed to be lock-free except for some time-domain
> locking.  I only see 1 problem with this: where tc_windup() writes the
> generation count and other things without asking for these writes to
> be ordered.  In most cases, the time-domain locking prevents problems.
In fact, on x86 the ordering is strong enough that no barriers are needed,
this is why the problem goes unnoticed so far.

> E.g., when the timehands pointer is read, it remains valid for 9+
> generations of cycling timehands (9+ to 90+ msec).  It is only when
> it sleeps for this long while holding and planning to use the old
> pointer that it needs the generation count to actually work.  Another
> case is if writes are out of order (can't happen on x86), so:
> 
>  	/*
>  	 * The write to th_generation fails to protect users of th
>  	 * via 10-100 msec old pointers if it becomes visible unordered
>  	 * after any of the writes done by the bcopy().  Very rare to
>  	 * lose here, but th_generation's point is to not lose here.
>  	 */
>  	th->th_generation = 0;
>  	bcopy(tho, th, offsetof(struct timehands, th_generation));
> 
>  	// finish writing th except for th_generation
>  	th->th_generation = ogen;
>  	/*
>  	 * The previous write to th_generation fails to protect users
>  	 * of th via old pointers if becomes visible unordered before
>  	 * all of the other writes (users see the generation change
>  	 * via the old pointer, and now since it has become nonzero
>  	 * they use the incompletely written data.  Again, only a problem
>  	 * after 10-100 msec.
>  	 */
> 
>  	timehands = th;
>  	/*
>  	 * Now users can grab th via timehands.  If timehands became visible
>  	 * unordered before all of the other writes except th_generation,
>  	 * then users use the incompletely written data.  Now the time
>  	 * domain locking doesn't help.
>  	 */
> 
> >>2) You read tk->tk_boottime without the tk_current protection in your
> >>   non-uptime routines.  This is racey as the kernel alters the
> >>   boottime when it skews time for large adjustments from ntp, etc.
> >>   To be really safe you need to read the boottime inside the loop
> >>   into a local variable and perhaps use a boolean parameter to decide
> >>   if you should add it to the computed uptime.
> >I moved the bootime to timehands from timekeep, thank you for the
> >clarification.
> 
> This isn't bug for bug compatible with the kernel.  The kernel has a
> global boottimebin which affects uses of old timehands the instance
> that it is changed (even before tc_windup() is called).
> 
> >Updated patch is at
> >http://people.freebsd.org/~kib/misc/moronix.3.patch
> 
> I had better not be awed by looking at it :-).
I will test this with your test code when return to home.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20120607/16f04c0b/attachment.pgp