svn commit: r302252 - head/sys/kern

Sat Jul 2 08:04:48 UTC 2016

On Fri, 1 Jul 2016, Konstantin Belousov wrote:

> On Fri, Jul 01, 2016 at 08:39:48PM +1000, Bruce Evans wrote:
>> It seems simple and clean enough, but is too much during a re freeze.
>>
>> I will only make some minor comments about style.
> Well, it is not only about style.  If you have no more comments, I will
> ask for testing.  The patch is about fixing bugs, although in somewhat
> extended scope, so I think it is still fine as the things do not explode.

What about FFCLOCK?  That is hard to test.

> I added the stats to the patch, it is not that intrusive actually.
>
> I still do not see why/do not want to use spinlock for the tc_windup()
> exclusion.  Patch is at the end of the message.

It subverts the mutex/witness method for no good reason.  You can use
mtx_trylock() for the conditional locking.  Not so good reasons for
doing this are to micro-optimize and to avoid hard-disabling interrupts
on the current CPU).  But here optimization is not important and
hard-disabling interrupts is a feature.  Mutexes are only slightly
slower, except with debugging options they are much slower but give
more features.

>>> diff --git a/sys/compat/linprocfs/linprocfs.c b/sys/compat/linprocfs/linprocfs.c
>>> index 56b2ade..a0dce47 100644
>>> --- a/sys/compat/linprocfs/linprocfs.c
>>> +++ b/sys/compat/linprocfs/linprocfs.c
>>> @@ -447,9 +447,11 @@ linprocfs_dostat(PFS_FILL_ARGS)
>>> 	struct pcpu *pcpu;
>>> 	long cp_time[CPUSTATES];
>>> 	long *cp;
>>> +	struct timeval boottime;
>>> 	int i;
>>>
>>> 	read_cpu_time(cp_time);
>>> +	getboottime(&boottime);
>>
>> This is used surprisingly often by too many subsystems.  With the value still
>> broken so that locking it doesn't help much, I would leave it as a global.
> I prefer to keep the KPI consistent.

Not changing the KPI means keeping boottime as a global.  getboottime() is
a good KPI for accessing/converting a volatile implementation detail, but
a correctly implemented boottime wouldn't be volatile and most uses of
boottime are apparently wrong.  Most uses are apparently to convert from
monotic time to real time.  For that, the KPI should be a conversion function.

>...
>> So maybe use a new general function that returns boottimebin for the
>> non-uptime functions only.  Possibly it can add the offset directly.
> I changed getbintime() and getboottimebin() to use the fenced magic and
> fetch boottime inside the loop. This removed the need for binuptime1().

Good.

>>> diff --git a/sys/rpc/rpcsec_gss/svc_rpcsec_gss.c b/sys/rpc/rpcsec_gss/svc_rpcsec_gss.c
>>> index 1d07943..0879299 100644
>>> --- a/sys/rpc/rpcsec_gss/svc_rpcsec_gss.c
>>> +++ b/sys/rpc/rpcsec_gss/svc_rpcsec_gss.c
>>> @@ -504,11 +504,13 @@ svc_rpc_gss_find_client(struct svc_rpc_gss_clientid *id)
>>> {
>>> 	struct svc_rpc_gss_client *client;
>>> 	struct svc_rpc_gss_client_list *list;
>>> +	struct timeval boottime;
>>> 	unsigned long hostid;
>>>
>>> 	rpc_gss_log_debug("in svc_rpc_gss_find_client(%d)", id->ci_id);
>>>
>>> 	getcredhostid(curthread->td_ucred, &hostid);
>>> +	getboottime(&boottime);
>>> 	if (id->ci_hostid != hostid || id->ci_boottime != boottime.tv_sec)
>>> 		return (NULL);
>>
>> Here it is hopefully just a magic id, with the user being a remote system.
>> Any time that doesn't go backwards or forwards so far that it is in the
>> lieftime of an old or new boot instance works well for identifying the
>> boot instance.
> It does not work for leap seconds in the same way as is does not work after
> setclock().  So I just leave this conversion as is.

We need to understand what such conversions are trying to do.

Currently, adding boottime mostly does work for converting monotonic time to
real time, since it is the same as what nanotime() and friends do (except
with more races).  E.g., the leap seconds adjustment normally subtracts
1 second from boottime so that when the monotonic time advances by 1 second
the sum doesn't advance and thus gives the real time with POSIX's broken
encoding.

So the conversion is currently correct if the result is supposed to be
the real time with POSIX's encoding.  It is unclear what this is useful
for.  It is non-monotic, and you can't even depend on reversing the
conversion by subtracting boottime since boottime isn't invariant.

On Fri, 1 Jul 2016, Konstantin Belousov wrote:

> On Fri, Jul 01, 2016 at 08:39:48PM +1000, Bruce Evans wrote:
>> It seems simple and clean enough, but is too much during a re freeze.
>>
>> I will only make some minor comments about style.
> Well, it is not only about style.  If you have no more comments, I will
> ask for testing.  The patch is about fixing bugs, although in somewhat
> extended scope, so I think it is still fine as the things do not explode.

What about FFCLOCK?  That is hard to test.

> I added the stats to the patch, it is not that intrusive actually.
>
> I still do not see why/do not want to use spinlock for the tc_windup()
> exclusion.  Patch is at the end of the message.

It subverts the mutex/witness method for no good reason.  You can use
mtx_trylock() for the conditional locking.  Not so good reasons for
doing this are to micro-optimize and to avoid hard-disabling interrupts
on the current CPU).  But here optimization is not important and
hard-disabling interrupts is a feature.  Mutexes are only slightly
slower, except with debugging options they are much slower but give
more features.

>>> diff --git a/sys/compat/linprocfs/linprocfs.c b/sys/compat/linprocfs/linprocfs.c
>>> index 56b2ade..a0dce47 100644
>>> --- a/sys/compat/linprocfs/linprocfs.c
>>> +++ b/sys/compat/linprocfs/linprocfs.c
>>> @@ -447,9 +447,11 @@ linprocfs_dostat(PFS_FILL_ARGS)
>>> 	struct pcpu *pcpu;
>>> 	long cp_time[CPUSTATES];
>>> 	long *cp;
>>> +	struct timeval boottime;
>>> 	int i;
>>>
>>> 	read_cpu_time(cp_time);
>>> +	getboottime(&boottime);
>>
>> This is used surprisingly often by too many subsystems.  With the value still
>> broken so that locking it doesn't help much, I would leave it as a global.
> I prefer to keep the KPI consistent.

Not changing the KPI means keeping boottime as a global.  getboottime() is
a good KPI for accessing/converting a volatile implementation detail, but
a correctly implemented boottime wouldn't be volatile and most uses of
boottime are apparently wrong.  Most uses are apparently to convert from
monotic time to real time.  For that, the KPI should be a conversion function.

>...
>> So maybe use a new general function that returns boottimebin for the
>> non-uptime functions only.  Possibly it can add the offset directly.
> I changed getbintime() and getboottimebin() to use the fenced magic and
> fetch boottime inside the loop. This removed the need for binuptime1().

Good.

>>> diff --git a/sys/rpc/rpcsec_gss/svc_rpcsec_gss.c b/sys/rpc/rpcsec_gss/svc_rpcsec_gss.c
>>> index 1d07943..0879299 100644
>>> --- a/sys/rpc/rpcsec_gss/svc_rpcsec_gss.c
>>> +++ b/sys/rpc/rpcsec_gss/svc_rpcsec_gss.c
>>> @@ -504,11 +504,13 @@ svc_rpc_gss_find_client(struct svc_rpc_gss_clientid *id)
>>> {
>>> 	struct svc_rpc_gss_client *client;
>>> 	struct svc_rpc_gss_client_list *list;
>>> +	struct timeval boottime;
>>> 	unsigned long hostid;
>>>
>>> 	rpc_gss_log_debug("in svc_rpc_gss_find_client(%d)", id->ci_id);
>>>
>>> 	getcredhostid(curthread->td_ucred, &hostid);
>>> +	getboottime(&boottime);
>>> 	if (id->ci_hostid != hostid || id->ci_boottime != boottime.tv_sec)
>>> 		return (NULL);
>>
>> Here it is hopefully just a magic id, with the user being a remote system.
>> Any time that doesn't go backwards or forwards so far that it is in the
>> lieftime of an old or new boot instance works well for identifying the
>> boot instance.
> It does not work for leap seconds in the same way as is does not work after
> setclock().  So I just leave this conversion as is.

We need to understand what such conversions are trying to do.

Currently, adding boottime mostly does work for converting monotonic time to
real time, since it is the same as what nanotime() and friends do (except
with more races).  E.g., the leap seconds adjustment normally subtracts
1 second from boottime so that when the monotonic time advances by 1 second
the sum doesn't advance and thus gives the real time with POSIX's broken
encoding.

So the conversion is currently correct if the result is supposed to be
the real time with POSIX's encoding.  It is unclear what this is useful
for.  It is non-monotic, and you can't even depend on reversing the
conversion by subtracting boottime since boottime isn't invariant.

> diff --git a/sys/compat/linprocfs/linprocfs.c b/sys/compat/linprocfs/linprocfs.c
> index 56b2ade..a0dce47 100644
> --- a/sys/compat/linprocfs/linprocfs.c
> +++ b/sys/compat/linprocfs/linprocfs.c
> @@ -447,9 +447,11 @@ linprocfs_dostat(PFS_FILL_ARGS)
>  	struct pcpu *pcpu;
>  	long cp_time[CPUSTATES];
>  	long *cp;
> +	struct timeval boottime;
>  	int i;
> 
>  	read_cpu_time(cp_time);
> +	getboottime(&boottime);
>  	sbuf_printf(sb, "cpu %ld %ld %ld %ld\n",
>  	    T2J(cp_time[CP_USER]),
>  	    T2J(cp_time[CP_NICE]),

linprocfs's use of boottime is easy to understand.  One use is quite broken.

linprocfs prints the boot time.  This use is OK.

linprocfs wants to print ki_start.  ki_start is a real time.  ps just
prints it.  But linprocfs subtracts the boot time from it before printing
it.  This make no sense.

Kernel times are kept in p->p_stats->p_start.  These are monotonic times.
ki_start is this monotonic time misconverted by adding the current boottime. 
This would be correct if boottime were invariant.  Currently it adds the
current leap seconds adjustment and larger non-invariance.

p_start is not used for much except to print it in ps and friends, so using
monotonic time for it seems to be wrong.  In FreeBSD-4, it was the real
time.  des changed it in 2003 to "This fixes a potential problem in the
accounting code, which would compute the elapsed time incorrectly if the
Unix time was stepped during the lifetime of the process."  But it
obviously causes much larger non-potential problems -- any step to fix
drift or stopping (suspension) of the clock should not affect times in
the past like the start time, but now does.

Oops, there is a problem in acct_process().  This needs both a real
time and a monotonic time for the start time.  ac_btime is the real
time of the start.  This is calculated by adding boottime to p_start.
ac_etime is the elapsed time.  This is calculated by subtracing the
current monotonic time from the monotonic p_start.  des's change fixes
this.  Applications would still be confused if they add ac_btime to
ac_etime and get a time in the future.

des maintains linprocfs, and perhaps the difference is supposed to be
the elapsed time there too, but its keyword is "starttime".

> diff --git a/sys/fs/devfs/devfs_vnops.c b/sys/fs/devfs/devfs_vnops.c
> index 7cc0f9e..afa3da4 100644
> --- a/sys/fs/devfs/devfs_vnops.c
> +++ b/sys/fs/devfs/devfs_vnops.c
> @@ -707,10 +707,11 @@ devfs_getattr(struct vop_getattr_args *ap)
>  {
>  	struct vnode *vp = ap->a_vp;
>  	struct vattr *vap = ap->a_vap;
> -	int error;
>  	struct devfs_dirent *de;
>  	struct devfs_mount *dmp;
>  	struct cdev *dev;
> +	struct timeval boottime;
> +	int error;
> 
>  	error = devfs_populate_vp(vp);
>  	if (error != 0)

devfs's use of boottime is also easy to understand, and we (you :-) get
to maintain devfs.

devfs has problems initializing times because it creates device files before
the current (real) time is fully initialized (I think before it is
initialized at all).  For early initialization, it is natural to set all
times to the Epoch (0).  This can be used as a flag for uninitialized.
devfs actually uses the condition tv_sec <= 3600 as a condition for
uninitialized.  (3600 is a strange value.  Why not 86400 so as to handle
all possible timezone misadjustments?).  When the todr is on wall time,
this fixup only works East of Greenwich.  It never works here.  Device
times here are initialized to a time 10-11 hours in the future if first
access to the device is before adjkerntz runs to fix up the time.  The
future times are not considered to be uninitialized, so the fixup is
not applied.

same problem may affect other file systems, but mostly only affects
ones with buggy setting of access times for read-only mounts.  This bug
was fixed in ffs a couple of years ago.)

devfs doesn't support birth times, or the problem would be even more
obvious.  Later accesses tend to clobber the future times back to the
current time, but the birth time would be invariant.

This problem is much like the one of getting an invariant boot time.
Real times are not really known until after adjkerntz adjusts for the
timezone offset in the todr and ntp adjusts for the error in the todr.

> diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
> index 0f015b3..c9676fc 100644
> --- a/sys/kern/kern_tc.c
> +++ b/sys/kern/kern_tc.c
> @@ -70,31 +70,22 @@ struct timehands {
> ...
>  static struct timehands th0 = {
> -	&dummy_timecounter,
> -	0,
> -	(uint64_t)-1 / 1000000,
> -	0,
> -	{1, 0},
> -	{0, 0},
> -	{0, 0},
> -	1,
> -	&th1
> +	.th_counter = &dummy_timecounter,
> +	.th_scale = (uint64_t)-1 / 1000000,
> +	.th_offset = {1, 0},

Is there a syntax for avoiding the explicit 0 in a nested initializer?
Something like th_offset.tv_sec = 1.

> @@ -378,8 +384,18 @@ microuptime(struct timeval *tvp)
>  void
>  bintime(struct bintime *bt)
>  {
> +	struct bintime boottimebin;
> +	struct timehands *th;
> +	u_int gen;
> 
> -	binuptime(bt);
> +	do {
> +		th = timehands;
> +		gen = atomic_load_acq_int(&th->th_generation);
> +		*bt = th->th_offset;
> +		bintime_addx(bt, th->th_scale * tc_delta(th));
> +		boottimebin = th->th_boottime;
> +		atomic_thread_fence_acq();
> +	} while (gen == 0 || gen != th->th_generation);
>  	bintime_add(bt, &boottimebin);
>  }

Better add th_boottime in the loop (and not use a local variable).  This
saves copying it in the usual case where the loop is only iterated once.

Note that th_offset is already copied to the caller's variable and not
to a local variable.  This is not so good for adding the boot time to
it.  It might be better to go the other way and copy everything to
local variables, but I fear that register pressure and memory clobbers
will prevent generating best code then.  Best code is to copy everything
to registers, then check the generation count, then combine the registers
outside the loop.

> ...
> diff --git a/sys/sys/time.h b/sys/sys/time.h
> index 395e888..659f8e0 100644
> --- a/sys/sys/time.h
> +++ b/sys/sys/time.h
> @@ -372,8 +372,6 @@ void	resettodr(void);
> 
>  extern volatile time_t	time_second;
>  extern volatile time_t	time_uptime;
> -extern struct bintime boottimebin;
> -extern struct timeval boottime;
>  extern struct bintime tc_tick_bt;
>  extern sbintime_t tc_tick_sbt;
>  extern struct bintime tick_bt;

Can we fix more style bugs in this file?  Here the variables were unsorted
and misindented.

> @@ -440,6 +438,9 @@ void	getbintime(struct bintime *bt);
>  void	getnanotime(struct timespec *tsp);
>  void	getmicrotime(struct timeval *tvp);
> 
> +void	getboottime(struct timeval *boottime);
> +void	getboottimebin(struct bintime *boottimebin);
> +
>  /* Other functions */
>  int	itimerdecr(struct itimerval *itp, int usec);
>  int	itimerfix(struct timeval *tv);
>

The new functions probably belong in a new section, but they are only in
a subsection of the previous section.  They are "Other", not ones described
in the comment for the previous section.  The comment also doesn't describe
the sbintime functions in the section.  The placement of these functions
also breaks the style of putting inline functions earlier.

Other style bugs visible here are verbose variable names and missing 'p'
in pointer names mostly for newer functions.  The worse names were copied
from the function implementations.

Bruce