svn commit: r297039 - head/sys/x86/x86

Thu Mar 24 20:41:05 UTC 2016

On Thu, 24 Mar 2016, Konstantin Belousov wrote:

> On Fri, Mar 25, 2016 at 01:57:32AM +1100, Bruce Evans wrote:
>>> In fact, if we can use TSC with the only requirement of being monotonic,
>>> I do not see why do we need TSC at all. We can return to pre-r278325
>>> loop, but calibrate the number of loop iterations for known delay in
>>> 1us, once on boot.  Do you agree with this ?
>>
>> As a comment in the previous version says, the old method is highly
>> bogus since it is sensitive to CPU clock speed.
> We do not care if we time-out in the 5usec or 50usec.  We can even wait
> for 1 second.  The only thing that must be avoided is infinite wait to
> prevent a situation when CPU spins with disabled interrupts infinitely.

That's good then.  Howver, 1 second is really too long.  Related timeouts
in kern_mutex.c and subr_smp.c are quite broken since they use either a
hard-coded timeout or a miscalibrated DELAY():

kern_mutex.c:
X 	while (!_mtx_obtain_lock(m, tid)) {
X 
X 		/* Give interrupts a chance while we spin. */
X 		spinlock_exit();
X 		while (m->mtx_lock != MTX_UNOWNED) {
X 			if (i++ < 10000000) {
X 				cpu_spinwait();
X 				continue;
X 			}

This does less for small i.  But small i is up to 10 million.  On slow
systems this might take many seconds.  Perhaps not as many as 60.

X 			if (i < 60000000 || kdb_active || panicstr != NULL)
X 				DELAY(1);
X 			else
X 				_mtx_lock_spin_failed(m);
X 			cpu_spinwait();

This is supposed to give a timeout of 50 seconds longer than the timeout
for small i.  But DELAY(1) is not very accurate.  In old x86 kernels
where DELAY(1) always uses the i8254, DELAY(1) took about 5 usec on
most systems (it takes 30+ usec on i486, but i486 doesn't support SMP
so this code is not used then; DELAY(1) is still very inaccurate
elsehwere).

So this often gave a timeout of ~300 seconds.  Even 60 is too long to wait.

I use the following fixes:
- reduce 10 million to 1 million
- reduce 50 seconds to 10 seconds plus the timeout for small i
- change DELAY(1) to DELAY(100) and scale the limit to match.

X 		}
X 		spinlock_enter();
X 	}

subr_smp.c:
Y 	while (!CPU_SUBSET(cpus, &map)) {
Y 		/* spin */
Y 		cpu_spinwait();
Y 		i++;
Y 		if (i == 100000000) {
Y 			printf("timeout stopping cpus\n");
Y 			break;
Y 		}
Y 	}

This uses 100 million with no DELAY() to calibrate it.  FreeBSD-7 uses
only 100 thousand here.  This doesn't scale.  My version uses a variable
timeout with default of 10 million or 100 million.  I just remembered
that I am getting lots of these timeouts when entering ddb on a Haswell
CPU with certain kernels between FreeBSD-7 and -current.  ddb tends
to deadlock soon after.  This might be just the small timeout.  I had
forgotten that I changed it.

>> My systems allow speed variations of about 4000:800 = 5:1 for one CPU and
>> about 50:1 for different CPUs.  So the old method gave a variation of up
>> to 50:1.  This can be reduced to only 5:1 using the boot-time calibration.
> What do you mean by 'for different CPUs' ?  I understand that modern ESS
> can give us CPU frequency between 800-4200MHz, which is what you mean
> by 'for one CPU'.  We definitely do not care if 5usec timeout becomes
> 25usecs, since we practically never time-out there at all.

Yes, I actually get 4400:800 on i4790K.

The ratio is even larger than that with a hard-coded limit because old
CPUs are much slower than i4790K.  I sometimes run a 367 MHz (P2 class)
CPU.  It is several times slower than a new CPU at the same clock
frequency, and any throttling would make it even slower.

50 times slower means that a reasonable emergency timeout of 60 seconds
becomes 3000 seconds.  Local users would get tired of waiting and reset,
and remote users might have to wait.

I see that largest lapic wait parameter is only 50000 usec.  100 times
longer than that wouldn't be too bad.  Timeouts in subr_smp.c are already
very long.  Actually there is a problem with the hard-coded 100 million
iteration timeout there.  That has unknown length which is unrelated to
the low-level length.  It needs to be much longer to give the other CPUs
time to stop.

> As I understand the original report, the LAPIC becomes ready for next IPI
> command much faster than 1usec, but not that fast to report readiness on
> the next register read.
>
> The timeout is more practical for APs startup, where microcode must
> initialize and test core, which does take time, and sometimes other
> issues do prevent APs from starting. But this happens during early boot,
> when ESS and trottling cannot happen, so initial calibration is more or
> less accurate.

There is another thread about early DELAY() using the i8254 not working
to calibrate the TSC.  That might be just because DELAY() is interrupted.
DELAY() never bothered to disable interrupts.  Its early use for calibrating
the TSC depends on interrupts mostly not happening then.  (My version is
a bit more careful, but it still doesn't disable interrupts.  It
establishes error bounds provided interrupts are shorter than the i8254
wrap period.)  If the i8254 is virtual, then even disabling interrupts
on the target wouldn't help, since the disabling would only be virtual.

Bruce