svn commit: r297039 - head/sys/x86/x86

Thu Mar 24 00:27:10 UTC 2016

On Wed, 23 Mar 2016, John Baldwin wrote:

> On Wednesday, March 23, 2016 09:58:42 AM Konstantin Belousov wrote:
>> On Mon, Mar 21, 2016 at 11:12:57AM -0700, John Baldwin wrote:
>>> On Saturday, March 19, 2016 05:22:16 AM Konstantin Belousov wrote:
>>>> On Fri, Mar 18, 2016 at 07:48:49PM +0000, John Baldwin wrote:
>>>>>
>>>>> -	for (x = 0; x < delay; x += 5) {
>>>>> +	for (x = 0; x < delay; x++) {
>>>>>  		if ((lapic_read_icr_lo() & APIC_DELSTAT_MASK) ==
>>>>>  		    APIC_DELSTAT_IDLE)
>>>>>  			return (1);
>>>>> -		DELAY(5);
>>>>> +		DELAY(1);
>>>>>  	}
>>>>>  	return (0);
>>>>>  }
>>>>
>>>> Ideally we would structure the loop differently. I think it is more
>>>> efficient WRT latency to only block execution by ia32_pause(), and
>>>> compare the the getbinuptime() results to calculate spent time, on each
>>>> loop step.
>>>
>>> Yes.  I've thought about using the TSC directly to do that, but folks are
>>> worried about the TSC being unstable due to vcpus in a VM migrating
>>> across physical CPUs.  DELAY() does seem to DTRT in that case assuming the
>>> hypervisor doesn't advertise an invariant TSC via cpuid.  We'd have to
>>> essentially duplicate DELAY() (really delay_tc()) inline.
>>
>> If TSC has the behaviour you described, i.e. suddenly jumping random
>> steps on single CPU, from the point of view of kernel, then the system
>> is seriosly misbehaving.  The timekeeping stuff would be badly broken
>> regardless of the ipi_wait().  I do not see why should we worry about
>> that in ipi_wait().
>>
>> I proposed slightly different thing, i.e. using timekeep code to indirect
>> to TSC if configured so.  Below is the proof of concept patch, use of
>> nanouptime() may be too naive, and binuptime() would cause tiny bit less
>> overhead, but I do not want to think about arithmetic.
>
> As you noted, the issue is if a timecounter needs locks (e.g. i8254) though
> outside of that I think the patch is great. :-/  Of course, if the TSC
> isn't advertised as invariant, DELAY() is talking to the timecounter
> directly as well.

The i8254 locks work better in practice than in theory.  Timecounter code
is called from very low levels (fast interrupt handlers) and must work
from there.  And the i8254 timecounter does work in fast interrupt handlers.
The above loop is slightly (?) lower level, so must be more careful.

DELAY() talkes directly to the i8254 if the TSC is not invariant and
the timecounter uses the i8254.  Then the timecounter is slow and
otherwise unusable for DELAY() since it would deadlock in ddb, so the
old i8254 DELAY() which is faster and more careful is used.  The same
(fudged recursive) locking would work here.  But you don't want to use
the i8254 or any other slow timecounter hardware or software.  They
all have a large latency of ~1 usec minimum.

> However, I think we probably can use the TSC.  The only specific note I got
> from Ryan (cc'd) was about the TSC being unstable as a timecounter under KVM.
> That doesn't mean that the TSC is non-mononotic on a single vCPU.  In fact,

It also doesn't need to be invariant provided it is usually monotonic
and doesn't jump ahead by a lot.  Or you can just use a calibrated
loop.  The calibration gets complicated if the CPU is throttled or
otherwise has a variable frequency.  One case is a loop with
ia32_pause() in it.  The pause length can be calibrated for most cases
and is probably longer than the rest of the loop, but it is hard to
be sure if the CPU didn't change it without telling you.  Long loops
can easiliy recalibrate themself by checking an external timer not
very often, but that doesn't work for short loops (ones shorter than
the timer access time).

> thinking about this more I have a different theory to explain how the TSC
> can be out of whack on different vCPUs even if the hardware TSC is in sync
> in the physical CPUs underneath.
>
> One of the things present in the VCMS on Intel CPUs using VT-x is a TSC
> adjustment.  The hypervisor can alter this TSC adjustment during a VM-exit to
> alter the offset between the TSC in the guest and the "real" TSC value in the
> physical CPU itself.  One way a hypervisor might use this is to try to
> "pause" the TSC during a VM-exit by taking TSC timestamps at the start and
> end of a VM-exit and adding that delta to the TSC offset just before each
> VM-entry.  However, if you have two vCPUs, one of which is running in the
> guest and one of which is handling a VM-exit in the hypervisor, the TSC on
> the first vCPU will run while the effective TSC of the second vCPU is paused.
> When the second vCPU resumes after a VM-entry, it's TSC will now "unpause",
> but it will lag the first vCPU by however long it took to handle its VM-exit.
>
> It wouldn't surprise me if KVM was doing this.  bhyve does not do this to my
> knowledge (so the TSC is actually still usable as a timecounter under bhyve
> for some value of "usable").  However, even with this TSC pausing/unpausing,
> the TSC would still increase monotonically on a single vCPU.  For the purposes
> of DELAY() (and other spin loops on a pinned thread such as in
> lapic_ipi_wait()), that is all you need.

Is monotonic really enough?  Suppose you want to wait at least 1 usec.  Then
you can't trust the timer if it does a combination of jumps that add up to
a significant fraction of 1 usec.

To minimise latency, I would try a tight loop with occasional checks.  E.g.,
10-1000 lapic reads separated by ia32_pauses()'s, then check the time.  It
isn't clear how to minimise power use for loops like this.  I couldn't
find anything better than mwait for cooling in loops in ddb i/o.

Bruce