svn commit: r219646 - head/sys/x86/isa

Tue Mar 15 18:13:59 UTC 2011

On Tue, 15 Mar 2011, Jung-uk Kim wrote:

> On Monday 14 March 2011 10:31 pm, Bruce Evans wrote:
>> On Mon, 14 Mar 2011, Jung-uk Kim wrote:
>>> Log:
>>>  When TSC is unavailable, broken or disabled and the current
>>> timecounter has better quality than i8254 timer, use it for
>>> DELAY(9).
>>
>> You cannot use a random timecounter for DELAY().  DELAY() is
>> carefully written to not use any locks, so that it doesn't deadlock
>> when called in the "any" context from a bad console driver like
>> syscons when syscons is called from ddb.  (Bad console drivers may
>> have their own locks which deadlock in ddb, but that is another
>> problem.)  Even the i8254 timer, which is the only timer that
>> should be used by DELAY(), normally use locks that used to cause
>> deadlock, but DELAY() was unbroken to bypass the locks when it is
>> called from ddb.
>
> I disagree.  I think we should keep away from i8254 as much as
> possible.

It is adequate for DELAY(), and is the only timer that is available on
all x86.  You need large complications in DELAY() to make it work as
well as the old code using the i8254, for no gain until there is an
x86 without an i8254.

> Actually, i8254 is the only timer that requires locking in
> the first place because of its braindead design.  All other x86
> timers do not require any locking at all.  We even sacrificed upper
> 32 bits of HPET in favor of i386. :-(

Did we?  Timecounters use only 32 bits as an optimization.  Scaling of
64-bit values is slow even on 64-bit machines since it needs 128-bit
intermediate values for the multiplication/shift method to work.

>> Cpufreq and other calibration code should use a normal timecounter
>> via nanouptime() in all cases and never go near low level
>> timecounters or DELAY().  There are (hopefully minor) problems
>> getting the nanotime() initialized before it used.  The dummy
>> timecounter is just what you don't want to use.
>
> That's exactly the problem.

Your fix won't make much difference then.  DELAY() will start with a
dummy timecounter and probably also a zero tsc_freq, so it will use
the i8254 for various things, probably including initial calibration
of the TSC/cpufreq.  Later it may see a better timecounter and use
that to get a better calibration of the TSC/cpufreq.  But its caller
can just as easily check for a better timecounter and not use DELAY()
if it can use a timecounter, without messing with DELAY()'s internals
and having to guard against reentrancy from ddb.  delay_timecounter()
already has all the code for this, except the check for a better
timecounter is external.

>> Even in this patch, it isn't clear that the low level timecounters
>> are initialized before they are used.
>
> The timecounter is always initialized first, then set.

It is still unclear whether they are initialized enough before they
are used since their setting is neither locked nor atomic.  Consider
the TSC initialization.  This is init_TSC() from startrtclock() and
init_TSC_tc() from cpu_initclocks().  The first should make the low-
level timecounter useable, but this is unclear.  The second attaches
the low-level timecounter to the high-level timecounter data structures.
There is no locking for this except possibly Giant.  tsc_freq is
initialized early, so the tests of it will succeed long before the TSC
can become the high-level timecounter.  It is probably usable at a low
level, but this is unclear.  tsc_freq is 64 bits, and the accesses to
it are non-atomic, so especially on 32-bit machines there are races
accessing it.  In fact, I can now see a reproducible bug in DELAY():
- find an x86 that can run at 4GHz
- run it in 386 mode
- start it so that tsc_freq is 0x10000000ULL
- use the machdep.tsc_freq sysctl to tune tsc_freq to 0xFFFFFFFF
- trace through this sysctl using gdb, and using a low quality console
   driver like syscons which calls DELAY().
- the 2 words in tsc_freq will normally be written from the low word
   up.  It will be seen to have value 0x1FFFFFFFFULL after only the first
   word is written.  Not too bad -- just of by a factor of 2.
- I can't quite see how to make it have a really broken value.  At first
   I tried and example with it starting with a value if 0xFFFFFFFF and
   changing it to 0x100000000ULL.  Then after changing only its low word,
   it is 0 so DELAY() will not use the TSC.
Related non-reproducible bugs are also easy to see:
- on one CPU, spin calling the sysctl to toggly tsc_freq between 4G-1 and
   4G
- on another CPU, spin doing something that calls DELAY().  Since there
   is no locking or atomic accesses for the writes to tsc_freq in the sysctl
   or for the reads of it in DELAY(), not to mention elsewhwere, tsc_freq
   can read as zero when it is being changed between 2 nonzero values.  Not
   too bad if it reads as zero -- then it won't be used.  But uses of it
   following it being read as nonzero may find it with changed values that
   are very bad, for example zero.
These bugs are in the old code in DELAY() that uses the TSC, and perhaps
in all code of the form "if (tsc_freq != 0) use(<tsc>)".  Fixes for them
should probably use a generation count.

init_TSC_tc() finishes with tc_init(&tsc_timecounter) which does most of
the work for attaching to the higher-level timecounter structures.
tc_init() has no locking or atomic operations, which I think gives it
races in some contexts, but not in ones involving single stepping it
using ddb, since then there are no other CPUs to race with.  The
races are the potentially-unordered writes of all the variables set
by tc_init().

>>> Modified: head/sys/x86/isa/clock.c
>>> =================================================================
>>> ============= --- head/sys/x86/isa/clock.c	Mon Mar 14 19:31:43
>>> 2011	(r219645) +++ head/sys/x86/isa/clock.c	Mon Mar 14 22:05:59
>>> 2011	(r219646) @@ -245,6 +245,42 @@ getit(void)
>>> 	return ((high << 8) | low);
>>> }
>>>
>>> +static __inline void
>>> +delay_tsc(int n)
>>> +{
>>> +	uint64_t start, end, now;
>>> +
>>> +	sched_pin();
>>> +	start = rdtsc();
>>> +	end = start + (tsc_freq * n) / 1000000;
>>> +	do {
>>> +		cpu_spinwait();
>>> +		now = rdtsc();
>>> +	} while (now < end || (now > start && end < start));
>>> +	sched_unpin();
>>> +}
>>
>> You cannot call random scheduling code from DELAY(), since the
>> scheduling code is not designed to be called in the "any" context.
>> As it happens, sched_pin() and sched_unpin() are safe, and were
>> already used in the TSC case.
>
> Actually, I really like to get rid of this code.  It isn't safe for
> many reasons, e.g., its frequency may change while in the loop and
> end result is unpredictable if TSC is not invariant.  We may limit
> its use for invariant case, though.

Me too.  It can also be used of the TSC is the timecounter, since it it
is good enough for a timecounter than it is good enough for this.  But
it is a lot of work for no benefit to make all cases work.

>>> +static __inline void
>>> +delay_timecounter(struct timecounter *tc, int n)
>>> +{
>>> +	uint64_t end, now;
>>> +	u_int last, mask, u;
>>> +
>>> +	mask = tc->tc_counter_mask;
>>> +	last = tc->tc_get_timecount(tc) & mask;
>>> +	end = tc->tc_frequency * n / 1000000;
>>
>> This depends on the delicate timecounter locking to be safe.  I
>> think it it is safe except possibly in tc_get_timecount(), for the
>> same reasons that nanouptime() can run safely with no visible
>> locking.  tc must be the current timecounter/timehands, which is
>> guaranteed to not be in an in-between state or become so while we
>> are using it, despite there being no visible locking.  Actually
>> there are some minor races:
>> - in nanouptime(), the timecounter generation is checked to avoid
>> using a new generation if necessary, but updates of the generation
>> count and the fileds that it protects are not properly atomic.
>> - here, the check of the generation is missing.  So this works
>> without races when called from ddb (since ddb stops other CPUs),
>> but has minor races in general.
>
> Window of the races is very narrow here and it only happens on i386
> where reading uint64_t is not atomic if my understanding is correct.

I can easily make race windows very wide by single stepping them, and
usually do for testing code like this :-).

> It cannot be any worse than tsc_freq anyway. ;-)

Yes, but it extends old mistakes.

>> Optimizing DELAY() was bogus.  Now the loop for this is
>> duplicatied.
>
> "Bogus" optimization is not exactly duplicated here if you meant "(now
> < end || (now > start && end < start))" is bogus.  Its bogusness
> comes from the fact that TSC is 64-bit.  However, timecounter is
> effectively 32-bit because of its mask.  Now overflow is carefully
> accounted for and it is (almost) correct.

I meant not using the i8254.  Even DELAY()'s API limits it to 1 uS
resolution.  Who cares if the hardware read extends this to 5 uS?

Duplication can be avoided by replacing the rdstc() by conditional
code like (using_tsc ? rdtsc() : (tc->tc_get_timecount(tc) & mask)
in the loop.  We have a 1 uS resolution and a pause in the loop,
so we don't care that the loop takes a little longer due to the
conditional in it.

My userland code for this does a full sysctl to read an emulated
timecounter for this.  Its loop is:

% 	binuptime(&start_bt);
% 	start_tsc = rdtsc();
% 	do {
% 		binuptime(&sample_bt);
% 		tsc = rdtsc();
% 		binuptime(&bt);
% 		bintime_sub(&bt, &sample_bt);
% 		if (bintimecmp(&bt, &best_bt, <)) {
% 			best_bt = bt;
% 			best_sample_bt = sample_bt;
% 			best_tsc = tsc;
% 		}
% 		bt = sample_bt;
% 		bintime_sub(&bt, &start_bt);
% 	} while (++nsamples < min_nsamples ||
% 	    bintimecmp(&best_bt, min_error_btp, >));

binuptime() is emulated so that it has the same API as the kernel.  It
is actually clock_gettime(CLOCK_MONOTONIC, &ts) followed by error handling
and conversion to bintime.  The loop is missing the equivalent of
sched_pin().  I don't know exactly how to do that in userland.  Is
there a single syscall that does it?  I used cpuset(1).

Bruce