KVM Clock

Julian Stecklina jsteckli at os.inf.tu-dresden.de
Wed Jan 15 16:59:25 UTC 2014


On Mi, 2014-01-15 at 17:04 +0100, Roger Pau Monné wrote:
> On 15/01/14 14:45, Julian Stecklina wrote:
> > On Mi, 2014-01-15 at 14:08 +0100, Roger Pau Monné wrote:
> >> On 15/01/14 13:05, Julian Stecklina wrote:
> >>> On 01/14/2014 05:13 PM, Peter Grehan wrote:
> >>>> Hi Julian,
> >>>>
> >>>>> is anyone working on KVM clock support for FreeBSD? If not, I
> >>>>> might take a shot at it.
> >>>>
> >>>> None I know of: go for it :)
> >>>
> >>> Works for me so far: 
> >>> https://github.com/blitz/freebsd/commit/cdc5f872b3e48cc0dda031fc7d6bdedc65c3148f
> >>
> >> Looking
> >>>
> >> at the code it seems some common routines could be shared
> >> between the KVM PV clock and the Xen PV clock
> >> (sys/dev/xen/timer/timer.c). The data passed from the hypervisor to
> >> the guest has exactly the same format (see struct vcpu_time_info in
> >> Xen public headers).
> > 
> > Yes, I know. Didn't get around to making it pretty yesterday evening. ;)
> > I'll post an updated patch, when I have some time. Any suggestions where
> > to put the two common functions?
> > 
> >> At a first sight the KVM clock can benefit from using scale_delta
> >> (which is going to be faster than the C version implemented in
> >> kvmclock_get_timecount),
> > 
> > At least somewhat on amd64. 32-bit looks pretty identical.
> > 
> >>  and xen_fetch_vcpu_tinfo is exactly the same
> >> as kvmclock_fetch.
> > 
> > I think xen_fetch_vcpu_tinfo has a subtle bug:
> >   217         do {
> >   218                 dst->version = src->version;
> >   219                 rmb();
> >   220                 dst->tsc_timestamp = src->tsc_timestamp;
> >   221                 dst->system_time = src->system_time;
> >   222                 dst->tsc_to_system_mul = src->tsc_to_system_mul;
> >   223                 dst->tsc_shift = src->tsc_shift;
> >   224                 rmb();
> >   225         } while ((src->version & 1) | (dst->version ^
> > src->version));
> > 
> > In line 225 src->version is potentially read twice. If you consider the
> > following schedule:
> > 
> > host starts updating data, sets src->version to 1
> > guest reads src->version (1) and stores it into dst->version.
> > guest copies inconsistent data
> > guest reads src->version (1) and computes xor with dst->version.
> > host finishes updating data and sets src->version to 2
> > guest reads src->version (2) and checks whether lower bit is not set.
> > while loop exits with inconsistent data!
> > 
> > I think the C standard (at least C11) permits this as it does not
> > specify in which order the two reads in line 225 need to happen.
> 
> According to the operator precedence and associativity in C
> (http://en.wikipedia.org/wiki/Operators_in_C_and_C%2B%2B#Operator_precedence),
> if I'm reading it right, the condition in the while line will be
> evaluated as follows (because of the left-to-right associativity of the
> | operator):
> 
> 1. (src->version & 1)
> 2. (dst->version ^ src->version)
> 3. result of 1 | result of 2
> 
> So AFAICT the flow that you describe could never happen, because
> (src->version & 1) is always done before (dst->version ^ src->version).

Operator precedence doesn't matter. Consider it written like this:

or(and(src->version, 1), xor(dst->version, src->version))

C gives you no guarantees whether the and or the xor will be executed
first. There is no sequence point in between. And even with a sequence
point, the C11 memory model gives you no guarantees about how the reads
are ordered.

This discussion is somewhat theoretical, because
 a) the hypervisor will probably update the data structure in the
current vCPU context (making memory consistency issues go away).
 b) the compiler will likely not be an asshole. ;)

Pseudocode for a better fetch could be:

dst->version = src->version;
rmb();
... read data ...
rmb();
version_post = src->version;
rmb(); <- keeps the compiler from reading src->version multiple times

and then doing the test with version_post and dst->version.
Unfortunately, rmb() expands into lfence, even if there is no need for
that here. 

Regards,
Julian



More information about the freebsd-virtualization mailing list