Fast gettimeofday(2) and clock_gettime(2)
Konstantin Belousov
kostikbel at gmail.com
Thu Jun 7 09:12:53 UTC 2012
On Thu, Jun 07, 2012 at 01:00:34PM +1000, Bruce Evans wrote:
> On Wed, 6 Jun 2012, Konstantin Belousov wrote:
>
> >On Wed, Jun 06, 2012 at 02:23:53PM -0400, John Baldwin wrote:
> >>In general this looks good but I see a few nits / races:
> >>
> >>1) You don't follow the model of clearing tk_current to 0 while you
> >> are updating the structure that the in-kernel timecounter code
> >> uses. This also means you have to avoid using a tk_current of 0
> >> and that userland has to keep spinning as long as tk_current is 0.
> >> Without this I believe userland can read a partially updated
> >> structure.
> >I changed the code to be much more similar to the kern_tc.c. I (re)added
> >the generation field, which is set to 0 upon kernel touching timehands.
>
> Seems necessary.
>
> >I think this can only happen if tc_windups occurs quite close in
> >succession, or usermode thread is suspended for long enough. BTW,
> >even generation could loop back to the previous value if thread is
> >stopped.
>
> tc_windup()'s close in succession are bugs, since they cycle the timehands
> faster than they were designed to be. We already have too many of these
> bugs (where tc_setclock() calls tc_windup(). I didn't notice this
> particular problem with it before). Now I will point out that version
> 2 of your patch adds more of these calls, apparently to get changes to
> happen sooner. But in sysctl_kern_timecounter_hardware(), such a call
> was intentionaly left out since it is not needed. Note that tc_tick
> prevents calls to tc_windup() more often than about once per msec if
> hz > 1000.
No, I did not added more tc_windup calls. I added a recalculation
of the shared page content on the timecounter change, which is not
the same as tc_windup() call. This is exactly to handle a disable
of usermode rdtsc use when kernel timecounter hardware changes.
>
> The generation count makes tc_windup()s close in succession harmless,
> except they increase race possibilities by reducing the time-domain
> locking. The generation count is 32 bits, so it can only loop back to
> a previous value after 2**32 tc_windup_calls. This "can't happen".
> What can happen is for the timehands to cycle after something is
> preempted for 10-100 msec. Then the generation count allows detection
> of the cycling. It only has an effect in this case. Otherwise, the
> a thread can be preempted for 10-100 seconds and start up using a
> timehands pointer that it read into a register that long ago, and
> safely use the old pointer unless its generation has changed. Even
> switching the timecounter works in that case. This depends on the
> hardware part of the timecounter not going away and the software
> keeping most state per-timehands.
I reinstantiated the generation counter for rev. 3.
>
> >There was apparently another issue with version 2. The bcopy() is not
> >atomic, so potentially libc could read wrong tk_current. I redid
> >the interface to write to the shared page to allow use of real atomics.
>
> Timecounter code is supposed to be lock-free except for some time-domain
> locking. I only see 1 problem with this: where tc_windup() writes the
> generation count and other things without asking for these writes to
> be ordered. In most cases, the time-domain locking prevents problems.
In fact, on x86 the ordering is strong enough that no barriers are needed,
this is why the problem goes unnoticed so far.
> E.g., when the timehands pointer is read, it remains valid for 9+
> generations of cycling timehands (9+ to 90+ msec). It is only when
> it sleeps for this long while holding and planning to use the old
> pointer that it needs the generation count to actually work. Another
> case is if writes are out of order (can't happen on x86), so:
>
> /*
> * The write to th_generation fails to protect users of th
> * via 10-100 msec old pointers if it becomes visible unordered
> * after any of the writes done by the bcopy(). Very rare to
> * lose here, but th_generation's point is to not lose here.
> */
> th->th_generation = 0;
> bcopy(tho, th, offsetof(struct timehands, th_generation));
>
> // finish writing th except for th_generation
> th->th_generation = ogen;
> /*
> * The previous write to th_generation fails to protect users
> * of th via old pointers if becomes visible unordered before
> * all of the other writes (users see the generation change
> * via the old pointer, and now since it has become nonzero
> * they use the incompletely written data. Again, only a problem
> * after 10-100 msec.
> */
>
> timehands = th;
> /*
> * Now users can grab th via timehands. If timehands became visible
> * unordered before all of the other writes except th_generation,
> * then users use the incompletely written data. Now the time
> * domain locking doesn't help.
> */
>
> >>2) You read tk->tk_boottime without the tk_current protection in your
> >> non-uptime routines. This is racey as the kernel alters the
> >> boottime when it skews time for large adjustments from ntp, etc.
> >> To be really safe you need to read the boottime inside the loop
> >> into a local variable and perhaps use a boolean parameter to decide
> >> if you should add it to the computed uptime.
> >I moved the bootime to timehands from timekeep, thank you for the
> >clarification.
>
> This isn't bug for bug compatible with the kernel. The kernel has a
> global boottimebin which affects uses of old timehands the instance
> that it is changed (even before tc_windup() is called).
>
> >Updated patch is at
> >http://people.freebsd.org/~kib/misc/moronix.3.patch
>
> I had better not be awed by looking at it :-).
I will test this with your test code when return to home.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20120607/16f04c0b/attachment.pgp
More information about the freebsd-arch
mailing list