[RFC] Kernel shared variables

Fri Jun 1 21:21:55 UTC 2012

On Fri, 1 Jun 2012, Giovanni Trematerra wrote:

> I'd like to discuss a way to provide a mechanism to share some read-only
> data between kernel and user space programs avoiding syscall overhead,
> implementing some them, such as gettimeofday(3) and time(3) as ordinary
> user space routine.

This is particularly unsuitable for implementing gettimeofday(), since for
it to work you would need to use approximately 1 CPU spinning in the
kernel to update the time every microsecond.  For time(3), it only needs
a relatively slow update.  For clock_gettime() with nansoeconds precision,
it is even more unsuitable.  For clock_gettime() with precisions between
1 second and 1 microseconds, it is intermediately unsuitable.

It also requires some complications for locking/atomicity and coherency
(much the same as in the kernel.  Not just for times.  For times, the
kernel handles locking/atomicity fairly well, and coherency fairly badly.

> The patch at
> http://www.trematerra.net/patches/ksvar_experimental.patch
>
> is in a very experimental stage. It's just a proof-of-concept.
> Only works for an AMD64 kernel and only for 64-bit applications.
> The idea is to have all the variables that we want to share between kernel
> and user space into one or more consecutive pages of memory that will be
> mapped read-only into every running process. At the start of the first
> shared page
> there'll be a table with as many entries as the number of the shared variables.
> Each entry is a 32-bit value that is the offset between the start of the shared
> page and the start of the variable in the page. The user space processes need
> to find out the map address of shared page and use the table to access to the
> shared variables.

On amd64, 2 32-bit values or 64-bit values with most bits 0 or 1 can be
packed/encoded into 1 64-bit value to give a certain atomicity without
locking.  The corresponding i386 packing into 1 32-bit value doesn't work
so well.

> ...

> Just as proof of concept I re-implemented gettimeofday(3) in user space.
> First of all I didn't remove the entry into the syscall.master, just renamed the
> sys_gettimeofday. I need it for the fallback path.
> In the kernel I introduced a struct wall_clock.
>
> +struct wall_clock
> +{
> +       struct timeval  tv;
> +       struct timezone tz;
> +};

This is much larger than 64 bits.  struct timezone is relatively unimportant.
struct timeval is bloated on amd64 (128 bits), but can be packed into 64
bits (works for a few hundred years).  On i386, it could be packed into
20 bits for tv_usec and 12 bits for an offset for tv_sec.

> Now we defined a shared variable at index WALL_CLOCK_INDEX of type
> struct wall_clock and named wall_clock.
> Inside handleevents I update the info exported by wall_clock.
>
> +       struct timeval tv;
> +
> +       /* update time for userspace gettimeofday */
> +       microtime(&tv);

This is supposed to have no races (microtime() uses a generation counter
to recover from them), and gives the necessary microseconds precision,
provided it is called every microsecond.  This doesn't quite require
a CPU spinning to update.  You could also unmap the variable 1
microsecond after it is accessed and take a pagefault to update it
for the next access.

> +       wall_clock.tv = tv;

This has races (when userland accesses it in the middle of the kernel
update).

> +       wall_clock.tz.tz_minuteswest = tz_minuteswest;
> +       wall_clock.tz.tz_dsttime = tz_dsttime;

This has races, but rarely changes.

> +int
> +gettimeofday(struct timeval *tp, struct timezone *tzp)
> +{
> +
> +       /* fallback to syscall if kernel doesn't export ksvar */
> +       if (!KSVAR_IS_ACTIVE())
> +               return (sys_gettimeofday(tp, tzp));
> +
> +       if (tp != NULL)
> +               *tp = KSVAR(wall_clock)->tv;

Races.  These can be fixed using a generation counter as in the kernel,
but it is much harder since we are not in control of the update times
and the updates must occur much more frequently.  In the kernel, the
critical updates occur only every 1-10 msec (except when timers are
stopped).  This gives an offset, and a difference is added to this,
with atomicity for the difference given even naturally by the hardware
or by locking the hardware.  Then generation count changes only every
1-10 msec, and itself is updated only by implicit serialization
instructions.  No ordering is enforced for stores and loads, but the
implicit serializations have 1-10 msec to flush the stores and loads,
in any order provided that they happen during this time, for the
generation count to work.

> +       if (tzp != NULL)
> +               *tzp = KSVAR(wall_clock)->tz;

Minor races.

> +       return (0);
> +}

I don't see how to implement gettimeofday() in userland without duplicating
most of the kernel's timecounter code.  The kernel can keep an offset,
updated every 1-10 msec.  Userland calls the hardware timecounter and
scales it delicately as in the kernel, but with even more complications
to synchronize with ntpd micro-adjustments and other threads and
timecounter hardware.

> Now when a process will call getimeofday, will call that function actually.
> If the process makes a lot of call to gettimeofday, we will see a
> performance boost.

Without the above problems being solved, we would see it not working
precisely when it is used a lot.  It would report time differences of
0 (or occasionally garbage from races) for events separated by hundreds
of thousands of nanoseconds.  gettimeofday()'s precision and accuracy
are well established, so there must be many programs that depend on
them.  The newer clock_gettime() interfaces give more choice and must
be used if sub-microsecond precision is wanted and sub-microsecond
accuracy is hoped for (with normal X86 hardware, an accuracy of ~10
nanosecond is nearly possible for small differences in relative times
but not for absolute times).  But again, the low precision clock ids
aren't very useful, since if you actually use them enough to notice
their slowness, then they won't distinguish different events.  These
ids are especially bogus when clock_gettime() is implemented as a
syscall, since the syscall overhead is about 90% of the total overhead
provided the timecounter hardware is efficient
(clock_getttime(CLOCK_MONOTONIC, ...) might take 300 nsec, while
clock_gettime(CLOCK_MONOTONIC_FAST_N_BROKEN, ...) takes 330 nsec.  Then
you may as well use precise version).  But if
clock_gettime(CLOCK_MONOTONIC, ...) is a syscall as it needs to be for
good precision, while
clock_gettime(CLOCK_MONOTONIC_FAST_N_BROKEN, ...) is memory mapped, then
the latter becomes almost useful.

BTW, the FAST_N_BROKEN clock ids advertise bogus precision in
clock_getres(), so you can't tell (except from their names) how much
worse their precision is than that of the unbroken versions.
   (clock_getres() is bogusly named (POSIX standard), since it returns
   the precision (the resolution is 1 nsec for all timespec
   interfaces).  POSIX clearly doesn't intend for clock_getres() to
   actually return the resolution, since that is known at compile time
   and parts of the specification of clock_getres() require precision
   semantics.  Many parts are confused in other ways, at least in 2001
   draft7, and are not implemented in FreeBSD.  E.g., clock_settime()
   is specified to truncate the time to a multiple of the "resolution"
   returned by clock_getres() with the same id.  This more or less
   assumes that the clock is in hardware ticks, with no
   micro-adjustments.  FreeBSD does micro-adjustments and delicate
   scaling of hardware clocks, with a low-level resolution of 2**-64
   seconds, and doesn't truncate the time in clock_settime().  File
   times are more interesting, since they often have to be truncated
   to fit on disks; POSIX only started specifying the details of this
   about 4 years ago.  In this mail, I try to use "precision"
   consistently instead of "resolution", except where describing POSIX
   bugs.)
For old clock ids, FreeBSD returns something reasonable (the hardware
or virtual clock update period, rounded up), except for CLOCK_PROF and
CLOCK_VIRTUAL, it returns the period of the unrelated hz clock (the
actual clock is an impossible-to-express combination of the stathz
clock for sampling the user+sys decomposition, the cputick clock for
the total time, and these limited by the resolution of the
getrusage()/calcru() timeval-based API,  1 usec for the tiemval
resolution is probably best on fast machine, but I use 1/stathz
seconds.)  For newer clock ids, FreeBSD the precision of the precise
clock for all the non-precise clocks (the latter have a virtual clock
update period of tc_tick/hz seconds, so they should advertise that as
their precision).  CLOCK_THREAD_CPUTIME_ID uses the cputicker as its
basic clock, but rounds to usec, so its precision is an integral
multiple of 1000 nsec.  The scaling for this is quite different and
broken than that for the old clock ids.  It rounds down instead of
up, and then rounds up to 1000 nsec iff the previous value was 0.
The first scaling step has the nonsense factor of 1000000 instead of
1000000000, and the result is accidentally correct in the usual case
where the cputicker frequency is < 1000000, else garbage (not a multiple
of 1000, and usually too small):

% int
% kern_clock_getres(struct thread *td, clockid_t clock_id, struct timespec *ts)
% {
% 
% 	ts->tv_sec = 0;
% 	switch (clock_id) {
% 	case CLOCK_REALTIME:
% 	case CLOCK_REALTIME_FAST:
% 	case CLOCK_REALTIME_PRECISE:
% 	case CLOCK_MONOTONIC:
% 	case CLOCK_MONOTONIC_FAST:
% 	case CLOCK_MONOTONIC_PRECISE:
% 	case CLOCK_UPTIME:
% 	case CLOCK_UPTIME_FAST:
% 	case CLOCK_UPTIME_PRECISE:

Most of these shouldn't be here (or anywhere).

% 		/*
% 		 * Round up the result of the division cheaply by adding 1.
% 		 * Rounding up is especially important if rounding down
% 		 * would give 0.  Perfect rounding is unimportant.
% 		 */
% 		ts->tv_nsec = 1000000000 / tc_getfrequency() + 1;

Correct value for CLOCK_REALTIME and CLOCK_MONOTONIC.  Also for unportable
aliases of these, but those shouldn't exist.

% 		break;
% 	case CLOCK_VIRTUAL:
% 	case CLOCK_PROF:
% 		/* Accurately round up here because we can do so cheaply. */
% 		ts->tv_nsec = (1000000000 + hz - 1) / hz;

Should use stathz ore maybe just the resolution of a timeval (except when
the precision is limited by the cputicker more than by timevals, use the
former limit).

% 		break;
% 	case CLOCK_SECOND:
% 		ts->tv_sec = 1;
% 		ts->tv_nsec = 0;
% 		break;

Correct.  Unlike most cases, clock_gettime() with this returns an integral
multiple of the precision.  For hardware clock ids with precise hardware,
we don't really know the precision and don't want to round to it (it
will be a small number of nsec, perhaps 0, plus a fraction of a nanosec,
and we don't want to round everything based on these fractions).  But for
software clock ids, the resolution will be about 1-10msec and rounding to
a multiple of 1 or 10 msec would be good if that is the update frequency
(it will often be 1/hz with hz a nice multiple of 10, except for inaccuracies
in the interrupt frequency).

% 	case CLOCK_THREAD_CPUTIME_ID:
% 		/* sync with cputick2usec */
% 		ts->tv_nsec = 1000000 / cpu_tickrate();
% 		if (ts->tv_nsec == 0)
% 			ts->tv_nsec = 1000;
% 		break;

Nonsense scaling.  When cpu_tickrate() <= 1000000, then result is too
small by a factor of 1000 and often invalid since it is not a multiple
of 1000.  But on most or all supported arches, cpu_tickrate() is >
1000000 so the result of the division is 0 and this is fixed up to
the correct value of 1000.

The correct scaling is more like the above:
- use the same scale factor as above.  Here we are scaling to nsec, so
   it is almost irrelvant that cputick2usec() scales to usec.  We should
   just use 1000 from cputick2usec()'s limit on the precision in the
   (usual) case that that limit is stricter (higher) than the one got
   by correct scaling here
- when scaling to nsec, round up as above, not down as this does now
- when scaling to nsec, round up efficiently as above, not with branching
   logic as this does now.  But we probably need the branching logic to
   clamp up to 1000.
Note that clock_gettime(CLOCK_THREAD_CPUTIME_ID, ...) scales to usec
just to use old KPIs which only support timevals, although
clock_gettime() uses timespecs.

% 	default:
% 		return (EINVAL);
% 	}
% 	return (0);
% }

Bruce