svn commit: r302252 - head/sys/kern

Thu Jun 30 18:01:16 UTC 2016

On Fri, Jul 01, 2016 at 03:30:41AM +1000, Bruce Evans wrote:
> On Thu, 30 Jun 2016, Konstantin Belousov wrote:
> 
> > On Thu, Jun 30, 2016 at 05:47:39AM +1000, Bruce Evans wrote:
> >> On Wed, 29 Jun 2016, Konstantin Belousov wrote:
> >>
> >>> On Wed, Jun 29, 2016 at 05:54:43PM +0300, Konstantin Belousov wrote:
> >>>> As an additional, but probably excessive measure, invocation of
> >>>> tc_windup() from hardclock could ask the setclock() to re-execute
> >>>> tc_windup() after its call is done.
> >>
> >> I don't quite understand this.  hardclock -> tc_windup() normally doesn't
> >> have interference from tc_setclock().
> > In what sense it does not have interference ?  tc_windup() may be executed
> > from hardclock/hardclock_cnt (low half) and concurrently from setclock()
> > (top half).  And this is exactly what we want to avoid, isn't it ?
> 
> Ues, but you seemed to be saying that hardclock should schedule or call
> tc_setclock().  That makes no sense.  So I thought that you meant
> tc_setclock() calling or scheduling tc_windup() in a different way than
> it does now (it just calls it now).
More precisely, I mean that hardclock, when not able to execute tc_windup()
due to the top-half currently executing tc_windup(), may ask top-half to
re-execute tc_windup() (and not tc_setclock()) once the current invocation
finishes.

> 
> >>> ...
> >>> diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
> >>> index 0f015b3..a5776d2 100644
> >>> --- a/sys/kern/kern_tc.c
> >>> +++ b/sys/kern/kern_tc.c
> >>> ...
> >>> @@ -1237,6 +1238,7 @@ tc_setclock(struct timespec *ts)
> >>> 	struct timespec tbef, taft;
> >>> 	struct bintime bt, bt2;
> >>>
> >>> +	critical_enter();
> >>> 	cpu_tick_calibrate(1);
> >>> 	nanotime(&tbef);
> >>> 	timespec2bintime(ts, &bt);
> >>> @@ -1247,8 +1249,10 @@ tc_setclock(struct timespec *ts)
> >>> 	bintime2timeval(&bt, &boottime);
> >>
> >> boottime is one of the critical variables.  Before this, there are 2
> >> accesses to boottimebin.  The Giant locking that you just removed gave
> >> only some protection here.  It prevented concurrent acccess by
> >> malicious/stress-testing code doing settime().  It didn't prevent
> >> nanotime() and friends seeing boottime in an in-between state.  Oops,
> >> mutex locking won't fix that either.  I think the update of these
> >> variables needs to be moved into tc_windup() where it will hopefully
> >> automatically be protected by the generation count, like the timehands
> >> offsets already are.
> > I do not understand how that would help.  Do you mean to move the vars
> > into current timehands ?
> 
> I hoped that it would just work without putting it in timehands, but
> now I don't see a better way that putting it in timehands.  The space
> wastage is not too bad, and we might need it in timehands for other
> reasons.  One is that the actual boot time is invariant, but FreeBSD
> varies it; a global is needed for the actual boot time and then it is
> natural to put the fudged boot time in timehands (it is not a boot
> time, but an offset which is sometimes close to the boot time).
> 
> Note that boottimebin is also updated in tc_windup() (for leap seconds).
> The new mutex doesn't protect it there.  This is the next of other
> reasons to put it in timehands.
Yes.

> 
> >*...
> >>> +	for (;;) {
> >>> +		if (atomic_cmpset_int(&tc_windup_lock, 0, 1)) {
> >>> +			atomic_thread_fence_acq();
> >>> +			tc_windup_locked();
> >>> +			atomic_store_rel_int(&tc_windup_lock, 0);
> >>> +			break;
> >>> +		} else if (!top_call) {
> >>> +			break;
> >>> +		}
> >>> +	}
> >>> }
> >>
> >> I like to write optimized locking like that, but don't see why you
> >> want it here.
> > Because I do not want the fast clock interrupt handler to wait on lock,
> > which neccessary becomes spinlock.  The ntp_lock is bad enough already.
> > What I coded above is not real lock.  In contention case it bails out,
> > so there is no blocking.  In particular, witness is not relevant to it.
> 
> I think this is too complicated.  Spinlocks aren't very expensive, and
> hardclock() already has some.  I think it has several.  It has at least
> one silly one -- in hardclock_cpu(), a thread locking to lock a single
> flags update.  hardclock() and tc_windup() just aren't called enough
> for the locking overhead or contention to matter, even at too-large hz.
Spinlocks are quite expensive.  They delay interrupt delivery.
I do not want the fast interrupt handler to be delayed due to the top-half
of the kernel executing settimeofday(2) in loop.

> 
> > ...
> > diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
> > index 0f015b3..7bb767c 100644
> > --- a/sys/kern/kern_tc.c
> > +++ b/sys/kern/kern_tc.c
> >...
> > +static struct mtx tc_setclock_mtx;
> > +MTX_SYSINIT(tc_setclock_init, &tc_setclock_mtx, "tcsetc", MTX_DEF);
> 
> I thought it would be a spinlock.  The sleep lock is better if you can
> make ot work, but this is complicated.
For mutual exclusion of invocation of tc_setclock(), I do not see why
sleepable lock would not suffice.

> 
> > +
> > /*
> >  * Step our concept of UTC.  This is done by modifying our estimate of
> >  * when we booted.
> > - * XXX: not locked.
> >  */
> 
> Now "not properly locked" -).
> 
> > void
> > tc_setclock(struct timespec *ts)
> > @@ -1237,6 +1240,8 @@ tc_setclock(struct timespec *ts)
> > 	struct timespec tbef, taft;
> > 	struct bintime bt, bt2;
> >
> > +	mtx_lock(&tc_setclock_mtx);
> > +	critical_enter();
> 
> I thought at first that this critical section was redundant, since a
> spinlock would already have it.
It is not spinlock.

> 
> > 	cpu_tick_calibrate(1);
> > 	nanotime(&tbef);
> > 	timespec2bintime(ts, &bt);
> > @@ -1247,8 +1252,10 @@ tc_setclock(struct timespec *ts)
> > 	bintime2timeval(&bt, &boottime);
> 
> The leap second update to boottimebin in tc_windup() races with the
> update to boottimebin here.  This can be fixed partially by using the
> same locking there.  Then the locking would have to be a spin mutex
> (or special).  This still doesn't fix non-atomic accesses to boottimebin
> in nanotime() and friends.  Putting it in timehands seems to be best for
> fixing that.
Yes, timehands for bootimebin should be the solution, but
not in the scope of this patch.  I will work on this right after the
current changeset lands in svn.

> boottime is easier to fix (and doesn't need to be in timehands).  It
> is only used to copy it out in a historical sysctl.  The locking for
> that is buggy.  We convert boottimebin to boottime here (now with locking)
> but in the sysctl we copy out boottime non-atomically.
Do we need boottime at all ?  Can't it be calculated from boottimebin
on the sysctl invocation ?