svn commit: r284178 - head/sys/kern

Tue Jun 9 15:53:10 UTC 2015

On Tue, 9 Jun 2015, Konstantin Belousov wrote:

> Log:
>  When updating/accessing the timehands, barriers are needed to ensure
>  that:
>  - th_generation update is visible after the parameters update is
>    visible;
>  - the read of parameters is not reordered before initial read of
>    th_generation.

This adds pessimizations that were intentionally left out.

>  On UP kernels, compiler barriers are enough.  For SMP machines, CPU
>  barriers must be used too, as was confirmed by submitter by testing on
>  the Freescale T4240 platform with 24 PowerPC processors.

Some SMP machines need something.  Hopefully not most, and never as
heavyweight as this.

> Modified: head/sys/kern/kern_tc.c
> ==============================================================================
> --- head/sys/kern/kern_tc.c	Tue Jun  9 11:41:37 2015	(r284177)
> +++ head/sys/kern/kern_tc.c	Tue Jun  9 11:49:56 2015	(r284178)
> @@ -34,6 +34,7 @@ __FBSDID("$FreeBSD$");
> #include <sys/timetc.h>
> #include <sys/timex.h>
> #include <sys/vdso.h>
> +#include <machine/atomic.h>

Style bug.  <machine/atomic.h> is standard pollution in <sys/param.h>.
and this is relied on in some of the headers already included, e.g.,
in <sys/mutex.h>.

Timecounters have much larger races than this.  E.g., kern_ntptime.c is
supposed to be locked by splhigh(), but is actually locked by Giant in
its less critical parts and by nothing except the null splhigh() in its
most critical parts.  It is called from fast interrupt handers with
whatever thre locking is.  Ntptime knows nothing of this locking, so
it doesn't work.  The must usual call is ntp_update_second from the
hardclock fast interrupt handler.  There are also calls from arbitrary
drivers with pps support.  These call into kern_tc.c with whatever locking
the driver has.  These have broken locking using the generation could to
protect the timecounter parts (see below), and null locking to protect
the ntptime parts (mainly hardpps()).

> @@ -189,6 +190,33 @@ tc_delta(struct timehands *th)
> 	    tc->tc_counter_mask);
> }
>
> +static u_int
> +tc_getgen(struct timehands *th)
> +{
> +
> +#ifdef SMP
> +	return (atomic_load_acq_int(&th->th_generation));
> +#else
> +	u_int gen;
> +
> +	gen = th->th_generation;
> +	__compiler_membar();
> +	return (gen);
> +#endif
> +}

This is an enormous pessimization unless it is automatically inlined.
Since it is called more than once, only clang and gcc -O3 automatically
inlines it.  A simple load operation from a pointer becomes a function
call and sometimes more.

> +
> +static void
> +tc_setgen(struct timehands *th, u_int newgen)
> +{
> +
> +#ifdef SMP
> +	atomic_store_rel_int(&th->th_generation, newgen);
> +#else
> +	__compiler_membar();
> +	th->th_generation = newgen;
> +#endif
> +}

Same pessimization for simple stores.

These pessimizations are especially good since the functions are called in
loops (though usually just twice).  Calling them in loops is an excuse for
clang to inline them.  But clang's inlining makes little difference except
to break debugging in most places in the kernel.  I avoid it and other
misoptimizations using gcc -O1 -fno-inline-functions-called once.

> @@ -1651,13 +1679,13 @@ pps_capture(struct pps_state *pps)
>
> 	KASSERT(pps != NULL, ("NULL pps pointer in pps_capture"));
> 	th = timehands;
> -	pps->capgen = th->th_generation;
> +	pps->capgen = tc_getgen(th);

This is part of the broken locking for pps.  The generation count is fragile
and only works if you check it before committing to a change and retry if
it changed.  But the pps API doesn't support retrying.

> 	pps->capth = th;
> #ifdef FFCLOCK
> 	pps->capffth = fftimehands;
> #endif
> 	pps->capcount = th->th_counter->tc_get_timecount(th->th_counter);
> -	if (pps->capgen != th->th_generation)
> +	if (pps->capgen != tc_getgen(th))
> 		pps->capgen = 0;

This check is in the middle of pps activity.  Looping to retrying would
be easy enough, but isn't done.  Anyway, the race window is much larger
later.

> }
>
> @@ -1677,7 +1705,7 @@ pps_event(struct pps_state *pps, int eve
>
> 	KASSERT(pps != NULL, ("NULL pps pointer in pps_event"));
> 	/* If the timecounter was wound up underneath us, bail out. */
> -	if (pps->capgen == 0 || pps->capgen != pps->capth->th_generation)
> +	if (pps->capgen == 0 || pps->capgen != tc_getgen(pps->capth))
> 		return;

Normal use of pps is pps_capture() followed by pps_event().  The above
is broken locking for the latter...

>
> 	/* Things would be easier with arrays. */
> @@ -1727,7 +1755,7 @@ pps_event(struct pps_state *pps, int eve
> 	bintime2timespec(&bt, &ts);
>
> 	/* If the timecounter was wound up underneath us, bail out. */
> -	if (pps->capgen != pps->capth->th_generation)
> +	if (pps->capgen != tc_getgen(pps->capth))
> 		return;

We also check in the middle, and give up instead of retrying.  Not too
bad (I think ntp can handle rare missed pps's).

>
> 	*pcount = pps->capcount;

But then we continue without further locking.

Perhaps we did enough to stabilize the timecounter while accessing it,
but we don't do anything extra to stabilize *pps.  pps stuff is basically
locked by the generation count.

This locking (except for the ntptime parts) is more complicated and less
broken than might first appear.  Note the comment:

 	/* If the timecounter was wound up underneath us, bail out. */

I think this is correct and it takes a full timecounter windup for the
generation cound to change.  That is, cycling through all 10 timehands.
With HZ = 100, than is 100 msec, and with HZ = 1000 it is 10 msec which
is still a long time.  It takes weird scheduling or stopping in a debugger
(which is a type of weird scheduling) for the timehands to change more
than once under the thread that has "captured" a generation using the
generation count.  For pps_capture(), the thread doing the capturing
should be at least a slow interrupt handler, so it shouldn't be delayed
for long.  Hmm, actually it is easy to think of a not very weird
configurataion where it is delayed for that long:
- a slow UP system
- misconfigured with HZ = 1000 so the windup occurs in only 10 msec
- pps capture called from a slow interrupt handler (like mine for the
   RTC)
- then the pps interrupt handler can be preempted by other interrupts,
   and it is easy for them to sometimes take more than 10 msec to complete.
Reading of the timecounter uses essentially a passive form of capture,
Since it is usually done in syscalls, the caller can easily be preempted
for 1 full timecounter windup time.  They handle this by retrying.

The large interval of 10 timehands cycles is supposed to make atomic
ops unnecessary, and even memory ordering unnecessary in practice.
It is only necessary for changes to the generation to not become visible
before changes to the data and not too many generations after changes
to the data (even 8 or 9 generations late would work, but reduces the
effective number of generations to 2 or 1).  The latter happens automatically
with locking on return from hardclock() if not before.  Perhaps this is
intentional.  But tc_windup() makes no attempt to make the data visible
before the generation count changes.  It basically does:

 	th = oldest timehands	// normally nothing using this
 	th->th_generation = 0	// missing ordering
 	start hacking on th 	// missing ordering

This loses if something is using th, AND the memory ordering is not program
order, AND something keeps sees the changed parts of th before seeing the
th_generation change.

         ... finish hacking on th
         th->th_generation = next generation  // missing ordering

This loses in the opposite way if the change to the generation becomes
visible before the data, AND something is using the oldest th, etc.

         timehands = th		// missing ordering

This loses for the new current generation, if the change to timehands
becomes visible before the changes to *th (except if the visible change
to the generation is to 0 but not to the new generation).  This race
is more likely to be lost than the others, since it doesn't take the
unlikely setup with the oldest timehands till active.  If there is
a race window at all (due to out of order memory visibility for
critical parts of the changed timehands), then any thread that starts
and finishes reading the timecounter while its memory is disordered will
lose the race.  This race window is just fairly short, since tc_windup()
soon returns and the interrupt handler does some atomic ops which sync
the order on most or all arches.  Without that, the order would probably
be synced by unrelated memory activity before 10 timehands cycles, but
the current timehands is supposed to be synced already.

The change in this commit is sort of backwards.  It pessimizes all
timecounter reads to work around the missing ordering in timecounter
windups.  It should be arranged that even if a timecounter read sees
a stale timehands, the timehands remains for long enough.  Surely this
is possible.  Timecounter windups of course sync everything.  Then
when a timecounter read checks the generation count, I think that on
some arches the sync in windup means that the reader sees the up to
date value.  Otherwise, the sync just has to occur within before 10
cycles of timehands updates.  At worse, this can be done by an ordered
read of all timecounter memory once every 9 cycles of timehands updates.
Except, there would be problems if this read were delayed.  So tc_windup()
should tell other CPUs to sync, and wait for them.

The other gross bugs in timecounter locking are mainly calls to
tc_windup() from clock_settime(CLOCK_REALTIME) and sysctl
kern.timecounter.hardware.  These race with calls from hardclock().
hardclock()'s locking defends against interference from hardclock()
only (except in UP when hardclock() is a fast interrupt handler).  It
isn't clear what happens, but at best tc_windup() will cycle through
the timehands at an unsuported rate.  This is not a large problem since
it takes foot-shooting to activate.  Ntp prefers to do micro-adjustments
and almost never calls clock_settime().  clock_settime() is rate-limited
to once per second at securelevel >= 2, but we now know that securelevel
was a mistake and there are better methods so this configuration is
probably rarer now than ever.

Bruce