svn commit: r302252 - head/sys/kern

Wed Jun 29 14:54:53 UTC 2016

This is a reply to two mails in one, both for r302251 and r302252 threads.

On Wed, Jun 29, 2016 at 05:58:05PM +1000, Bruce Evans wrote:
> On Tue, 28 Jun 2016, Konstantin Belousov wrote:
> 
> > Log:
> >  Do not use Giant to prevent parallel calls to CLOCK_SETTIME().  Use
> >  private mtx in resettodr(), no implementation of CLOCK_SETTIME() is
> >  allowed to sleep.
> 
> Thanks.
> 
> As you know, clock locking is still quite broken.  The most obvious
> bugs near here are now:
> - settime() still begins with splclock().  This gave fewer bugs in
>    clock locking when it last worked in FreeBSD-4.  Now it is just
>    a hint that the clock locking has still not been converted to use
>    mutexes.
> - bitrot near the end of settime() removed the splx() that should
>    be paired with the splclock().
Yes, I am aware of both splclock() and Giant braces in the setclock().
I did not wanted to even look at tc_setclock() at all, for that change.
See below.

> - the locking should be stronger than mutexes (or splclock()) to
>    minimise the delay between deciding to set the time and actually
>    setting it.
...
This is separate issue.  Might be a critical section around the code
is the right approximation for reducing delays there.

> > +	mtx_lock(&resettodr_lock);
> 
> This could be later (only around the hardware parts) since we are sloppy
> with the delays.
Done.

> 
> > 	getnanotime(&ts);
> 
> This doesn't even try to get nanoseconds accuracy.
> 
> > 	timespecadd(&ts, &clock_adj);
> > 	ts.tv_sec -= utc_offset();
> 
> Who knows what utc_offset() is locked by?  Certainly not this new locking,
> since it is set in another file but the new locking is only in this file.
> It still seems to be most under Giant :-(.
utc_offset() uses several variables, updates to each of the variable is
atomic.  There is no even an attempt to provide userspace with an interface
which would makes updates to the components used by utc_offset() consistent.
Thus I do not see a need in any locking there.

> 
> The change in the other part of the diff is related to this, and doesn't
> seem quite right.  utc_offset() uses the variables tz_minuteswest,
> wall_cmos_clock and adjkerntz:
> - tz_minuteswest is set with no locking in kern_settimeofday(), just
>    after the settime() call.  This last had a chance of being correct
>    when the kernel was UP and not preemptible.  But it is broken in
>    a more fundamental way back to at least FreeBSD-1: resettodr() is
>    called _before_ updating the timezone, so it uses an out-of-date
>    tz_minuteswest.
First, the man page is not correct, saying that 'this information is
kept outside the kernel'.

Second, why does settime(), logicall, need the updated timezone ?
I understand that resettodr() is called on successful update of
the wall clock.  But shouldn't the need to reset not-GMT RTC clock
in kern_settimeofday() satisfied by yet another call to resettodr()
if tzp != NULL, after the tz_* vars update ?

> - wall_cmos_clock is still under a simple (not MPSAFE) sysctl
> - adjkerntz is under a PROC sysctl, and you just changed this sysctl to
>    be MPSAFE.  The sysctl calls here, so we don't want all of it
>    giant-locked.  But it now does unlocked accesses to the adjkerntz
>    variable, and the access to the disable_rtc_set variable is unlocked
>    because we do it before acquiring the lock in this function (this
>    variable is still under a simple (not MPSAFE) sysctl.
> 
> Most of the variables are ints, so access to them are atomic, but without
> locking the correct order only occurs accidentally and is hard to
> understand.
I think there is no any global consistency, mostly caused by existence
of separate interfaces to control all of them.  IMO it is not worth to
try to ensure any.

> 
> > 	/* XXX: We should really set all registered RTCs */
> > -	if ((error = CLOCK_SETTIME(clock_dev, &ts)) != 0)
> > +	error = CLOCK_SETTIME(clock_dev, &ts);
> > +	mtx_unlock(&resettodr_lock);
> > +	if (error != 0)
> > 		printf("warning: clock_settime failed (%d), time-of-day clock "
> > 		    "not adjusted to system time\n", error);
> > }
> 
> The lock needs to cover all high-level accesses to the rtc and related
> variables to avoid races, and a bit more to minimise delays.  In particular,
> inittodr() should be under this lock, and there is a problem with
> "show rtc" in ddb.
I do not think that inittodr() actually needs that locking.  This code
is executed at the initialization stage, or at the resume stage.

Do you mean that CLOCK_GETTIME() hardware access should be made exclusive
with CLOCK_SETTIME() ?  Is there a hardware that rely on exclusivity and
not enforce that internally, how x86 rtc does ?

DDB access to rtc is separate problem.

On Wed, Jun 29, 2016 at 07:39:45PM +1000, Bruce Evans wrote:
> On Tue, 28 Jun 2016, Konstantin Belousov wrote:
> > static int64_t time_adjtime;		/* correction from adjtime(2) (usec) */
> >
> > +static struct mtx ntpadj_lock;
> > +MTX_SYSINIT(ntpadj, &ntpadj_lock, "ntpadj",
> > +#ifdef PPS_SYNC
> > +    MTX_SPIN
> > +#else
> > +    MTX_DEF
> > +#endif
> > +);
> 
> This needs to be a spinlock in all cases, since ntp_update_second() needs
> to be locked and ntp_update_second() is called from tc_windup() is usually
> called from a fast interrupt handler.
Oh, I see.

> 
> ntp_update_second() is still unlocked.  It accesses lots of variables, so
> it obviously needs locking.  E.g., its very first access is
> time_maxerror += MAXFREQ / 1000.  This has the usual non-atomicity for
> a read-modify write.  This obiously races with the locked access on the
> 28th linke of sys_ntp_adjtime().  Only the latter is locked now.
> 
> > +
> > +/*
> > + * When PPS_SYNC is defined, hardpps() function is provided which can
> > + * be legitimately called from interrupt filters.  Due to this, use
> > + * spinlock for ntptime state protection, otherwise sleepable mutex is
> > + * adequate.
> > + */
> > +#ifdef PPS_SYNC
> > +#define	NTPADJ_LOCK()		mtx_lock_spin(&ntpadj_lock)
> > +#define	NTPADJ_UNLOCK()		mtx_unlock_spin(&ntpadj_lock)
> > +#else
> > +#define	NTPADJ_LOCK()		mtx_lock(&ntpadj_lock)
> > +#define	NTPADJ_UNLOCK()		mtx_unlock(&ntpadj_lock)
> > +#endif
> > +#define	NTPADJ_ASSERT_LOCKED()	mtx_assert(&ntpadj_lock, MA_OWNED)
> 
> *ADJ* and *adj* are not good names, since much more than ntp_adj*() needs
> to be locked.
Do you have a suggestion for a better name ?

> 
> Probably the locking should be shared with kern_tc.c.  tc_windup() still
> uses only fancy time-domaind/atomic-op locking.  This mostly works, but
> is quite broken by calling tc_windup() from places other than hardclock().
> The most problematic other place is from tc_setclock().  That is unlocked,
> except accidentally by callers with only Giant locking, so it races with
> the fast interrupt handler.
> 
> tc_windup() doesn't need the fancy locking for efficiency -- it only needs
> it to be consistent with binuptime() and friends.  So it may as well use
> the same lock as ntp.  Its lock must be around the whole function to
> protect it from tc_setclock().  Then the same lock works for
> ntp_update_second() and pps.  Any lock that is only acquired once per
> second is not worth optimizing.  Hopefully the same for ntp's other locks.
> ntp syscalls should only be frequent under unusual/malicious loads, so
> they shouldn't cause much contention with the tc_windup() lock.
So I postponed looking at tc_windup(), and it seems that it was
useful. After you note above, it is mostly clear that we can skip some
tc_windup() calls which conflicts with currently active windup calls.
E.g., the hardclock update, which is done simultaneously with the
setclock(), could be just skipped.  But we must not skip the tc_windup()
call from setclock().

As an additional, but probably excessive measure, invocation of
tc_windup() from hardclock could ask the setclock() to re-execute
tc_windup() after its call is done.

> > -static int
> > -ntp_is_time_error(void)
> > +static bool
> > +ntp_is_time_error(int tsl)
> > {
> > +
> 
> Is it worth changing this to make a single variable less volatile?
Yes, to rely on single-variable read being atomic and avoiding locking
in callout.

[Skipped the S64/OPAQUE stuff]

Patch does the following:
- change the ntpadj_lock to spinlock.
- acquire the ntpadj_lock around ntp_update_second().
- prevent parallel tc_windup() invocations, while not allowing to
  skip calls coming from kernel top half.
- move resettodr_lock acquisition in resettodr() later.

diff --git a/sys/kern/kern_ntptime.c b/sys/kern/kern_ntptime.c
index d352ee7..34b29d9 100644
--- a/sys/kern/kern_ntptime.c
+++ b/sys/kern/kern_ntptime.c
@@ -163,27 +163,10 @@ static l_fp time_adj;			/* tick adjust (ns/s) */
 static int64_t time_adjtime;		/* correction from adjtime(2) (usec) */
 
 static struct mtx ntpadj_lock;
-MTX_SYSINIT(ntpadj, &ntpadj_lock, "ntpadj",
-#ifdef PPS_SYNC
-    MTX_SPIN
-#else
-    MTX_DEF
-#endif
-);
+MTX_SYSINIT(ntpadj, &ntpadj_lock, "ntpadj", MTX_SPIN);
 
-/*
- * When PPS_SYNC is defined, hardpps() function is provided which can
- * be legitimately called from interrupt filters.  Due to this, use
- * spinlock for ntptime state protection, otherwise sleepable mutex is
- * adequate.
- */
-#ifdef PPS_SYNC
 #define	NTPADJ_LOCK()		mtx_lock_spin(&ntpadj_lock)
 #define	NTPADJ_UNLOCK()		mtx_unlock_spin(&ntpadj_lock)
-#else
-#define	NTPADJ_LOCK()		mtx_lock(&ntpadj_lock)
-#define	NTPADJ_UNLOCK()		mtx_unlock(&ntpadj_lock)
-#endif
 #define	NTPADJ_ASSERT_LOCKED()	mtx_assert(&ntpadj_lock, MA_OWNED)
 
 #ifdef PPS_SYNC
@@ -506,6 +489,8 @@ ntp_update_second(int64_t *adjustment, time_t *newsec)
 	int tickrate;
 	l_fp ftemp;		/* 32/64-bit temporary */
 
+	NTPADJ_LOCK();
+
 	/*
 	 * On rollover of the second both the nanosecond and microsecond
 	 * clocks are updated and the state machine cranked as
@@ -627,6 +612,8 @@ ntp_update_second(int64_t *adjustment, time_t *newsec)
 	else
 		time_status &= ~STA_PPSSIGNAL;
 #endif /* PPS_SYNC */
+
+	NTPADJ_UNLOCK();
 }
 
 /*
diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
index 0f015b3..b738083 100644
--- a/sys/kern/kern_tc.c
+++ b/sys/kern/kern_tc.c
@@ -135,7 +135,8 @@ SYSCTL_PROC(_kern_timecounter, OID_AUTO, alloweddeviation,
 
 static int tc_chosen;	/* Non-zero if a specific tc was chosen via sysctl. */
 
-static void tc_windup(void);
+static void tc_windup(bool top_call);
+static void tc_windup_locked(void);
 static void cpu_tick_calibrate(int);
 
 void dtrace_getnanotime(struct timespec *tsp);
@@ -1247,7 +1248,7 @@ tc_setclock(struct timespec *ts)
 	bintime2timeval(&bt, &boottime);
 
 	/* XXX fiddle all the little crinkly bits around the fiords... */
-	tc_windup();
+	tc_windup(true);
 	nanotime(&taft);
 	if (timestepwarnings) {
 		log(LOG_INFO,
@@ -1259,13 +1260,30 @@ tc_setclock(struct timespec *ts)
 	cpu_tick_calibrate(1);
 }
 
+static volatile int tc_windup_lock;
+static void
+tc_windup(bool top_call)
+{
+
+	for (;;) {
+		if (atomic_cmpset_int(&tc_windup_lock, 0, 1)) {
+			atomic_thread_fence_acq();
+			tc_windup_locked();
+			atomic_store_rel_int(&tc_windup_lock, 0);
+			break;
+		} else if (!top_call) {
+			break;
+		}
+	}
+}
+
 /*
  * Initialize the next struct timehands in the ring and make
  * it the active timehands.  Along the way we might switch to a different
  * timecounter and/or do seconds processing in NTP.  Slightly magic.
  */
 static void
-tc_windup(void)
+tc_windup_locked(void)
 {
 	struct bintime bt;
 	struct timehands *th, *tho;
@@ -1846,7 +1864,7 @@ tc_ticktock(int cnt)
 	if (count < tc_tick)
 		return;
 	count = 0;
-	tc_windup();
+	tc_windup(false);
 }
 
 static void __inline
@@ -1921,7 +1939,7 @@ inittimecounter(void *dummy)
 	/* warm up new timecounter (again) and get rolling. */
 	(void)timecounter->tc_get_timecount(timecounter);
 	(void)timecounter->tc_get_timecount(timecounter);
-	tc_windup();
+	tc_windup(true);
 }
 
 SYSINIT(timecounter, SI_SUB_CLOCKS, SI_ORDER_SECOND, inittimecounter, NULL);
diff --git a/sys/kern/kern_time.c b/sys/kern/kern_time.c
index 148da2b..82710f7 100644
--- a/sys/kern/kern_time.c
+++ b/sys/kern/kern_time.c
@@ -115,9 +115,7 @@ settime(struct thread *td, struct timeval *tv)
 	struct timeval delta, tv1, tv2;
 	static struct timeval maxtime, laststep;
 	struct timespec ts;
-	int s;
 
-	s = splclock();
 	microtime(&tv1);
 	delta = *tv;
 	timevalsub(&delta, &tv1);
@@ -147,10 +145,8 @@ settime(struct thread *td, struct timeval *tv)
 				printf("Time adjustment clamped to -1 second\n");
 			}
 		} else {
-			if (tv1.tv_sec == laststep.tv_sec) {
-				splx(s);
+			if (tv1.tv_sec == laststep.tv_sec)
 				return (EPERM);
-			}
 			if (delta.tv_sec > 1) {
 				tv->tv_sec = tv1.tv_sec + 1;
 				printf("Time adjustment clamped to +1 second\n");
@@ -161,10 +157,8 @@ settime(struct thread *td, struct timeval *tv)
 
 	ts.tv_sec = tv->tv_sec;
 	ts.tv_nsec = tv->tv_usec * 1000;
-	mtx_lock(&Giant);
 	tc_setclock(&ts);
 	resettodr();
-	mtx_unlock(&Giant);
 	return (0);
 }
 
diff --git a/sys/kern/subr_rtc.c b/sys/kern/subr_rtc.c
index dbad36d..4bac324 100644
--- a/sys/kern/subr_rtc.c
+++ b/sys/kern/subr_rtc.c
@@ -172,11 +172,11 @@ resettodr(void)
 	if (disable_rtc_set || clock_dev == NULL)
 		return;
 
-	mtx_lock(&resettodr_lock);
 	getnanotime(&ts);
 	timespecadd(&ts, &clock_adj);
 	ts.tv_sec -= utc_offset();
 	/* XXX: We should really set all registered RTCs */
+	mtx_lock(&resettodr_lock);
 	error = CLOCK_SETTIME(clock_dev, &ts);
 	mtx_unlock(&resettodr_lock);
 	if (error != 0)