svn commit: r219646 - head/sys/x86/isa

Tue Mar 15 19:26:31 UTC 2011

On Tuesday 15 March 2011 02:13 pm, Bruce Evans wrote:
> On Tue, 15 Mar 2011, Jung-uk Kim wrote:
> > On Monday 14 March 2011 10:31 pm, Bruce Evans wrote:
> >> On Mon, 14 Mar 2011, Jung-uk Kim wrote:
> >>> Log:
> >>>  When TSC is unavailable, broken or disabled and the current
> >>> timecounter has better quality than i8254 timer, use it for
> >>> DELAY(9).
> >>
> >> You cannot use a random timecounter for DELAY().  DELAY() is
> >> carefully written to not use any locks, so that it doesn't
> >> deadlock when called in the "any" context from a bad console
> >> driver like syscons when syscons is called from ddb.  (Bad
> >> console drivers may have their own locks which deadlock in ddb,
> >> but that is another problem.)  Even the i8254 timer, which is
> >> the only timer that should be used by DELAY(), normally use
> >> locks that used to cause deadlock, but DELAY() was unbroken to
> >> bypass the locks when it is called from ddb.
> >
> > I disagree.  I think we should keep away from i8254 as much as
> > possible.
>
> It is adequate for DELAY(), and is the only timer that is available
> on all x86.  You need large complications in DELAY() to make it
> work as well as the old code using the i8254, for no gain until
> there is an x86 without an i8254.

Intel started killing off i8254 (and other "legacy" ISA devices), 
starting from "Mobile Internet Device" platforms.  Even for other 
so-called "legacy-free" platforms, it isn't adequate any more because 
of unpredictable SMI# interference.

> > Actually, i8254 is the only timer that requires locking in
> > the first place because of its braindead design.  All other x86
> > timers do not require any locking at all.  We even sacrificed
> > upper 32 bits of HPET in favor of i386. :-(
>
> Did we?  Timecounters use only 32 bits as an optimization.  Scaling
> of 64-bit values is slow even on 64-bit machines since it needs
> 128-bit intermediate values for the multiplication/shift method to
> work.

No, I am talking about reading raw value of HPET.  For amd64, it can 
be used as a good alternative to TSC if we detect it earlier for 
legacy-free platforms.  Unfortunately, it is only useful for 
timecounter ATM.

> >> Cpufreq and other calibration code should use a normal
> >> timecounter via nanouptime() in all cases and never go near low
> >> level timecounters or DELAY().  There are (hopefully minor)
> >> problems getting the nanotime() initialized before it used.  The
> >> dummy timecounter is just what you don't want to use.
> >
> > That's exactly the problem.
>
> Your fix won't make much difference then.  DELAY() will start with
> a dummy timecounter and probably also a zero tsc_freq, so it will
> use the i8254 for various things, probably including initial
> calibration of the TSC/cpufreq.  Later it may see a better
> timecounter and use that to get a better calibration of the
> TSC/cpufreq.  But its caller can just as easily check for a better
> timecounter and not use DELAY() if it can use a timecounter,
> without messing with DELAY()'s internals and having to guard
> against reentrancy from ddb.  delay_timecounter() already has all
> the code for this, except the check for a better timecounter is
> external.

Yes, it doesn't make much difference for that ATM.  However, it DOES 
make differences when DELAY() is used for device drivers.  What I 
really want is DELAY() does the right thing, e.g., holding no spin 
lock, not pinned on a CPU, no tsc_freq hackery, and no side-effect of 
any kind.

> >> Even in this patch, it isn't clear that the low level
> >> timecounters are initialized before they are used.
> >
> > The timecounter is always initialized first, then set.
>
> It is still unclear whether they are initialized enough before they
> are used since their setting is neither locked nor atomic. 
> Consider the TSC initialization.  This is init_TSC() from
> startrtclock() and init_TSC_tc() from cpu_initclocks().  The first
> should make the low- level timecounter useable, but this is
> unclear.  The second attaches the low-level timecounter to the
> high-level timecounter data structures. There is no locking for
> this except possibly Giant.  tsc_freq is initialized early, so the
> tests of it will succeed long before the TSC can become the
> high-level timecounter.  It is probably usable at a low level, but
> this is unclear.  tsc_freq is 64 bits, and the accesses to it are
> non-atomic, so especially on 32-bit machines there are races
> accessing it.  In fact, I can now see a reproducible bug in
> DELAY(): - find an x86 that can run at 4GHz
> - run it in 386 mode
> - start it so that tsc_freq is 0x10000000ULL
> - use the machdep.tsc_freq sysctl to tune tsc_freq to 0xFFFFFFFF

machdep.tsc_freq should never have modified tsc_freq in the first 
place.  I am going to remove that soon.

> - trace through this sysctl using gdb, and using a low quality
> console driver like syscons which calls DELAY().
> - the 2 words in tsc_freq will normally be written from the low
> word up.  It will be seen to have value 0x1FFFFFFFFULL after only
> the first word is written.  Not too bad -- just of by a factor of
> 2. - I can't quite see how to make it have a really broken value. 
> At first I tried and example with it starting with a value if
> 0xFFFFFFFF and changing it to 0x100000000ULL.  Then after changing
> only its low word, it is 0 so DELAY() will not use the TSC.
> Related non-reproducible bugs are also easy to see:
> - on one CPU, spin calling the sysctl to toggly tsc_freq between
> 4G-1 and 4G
> - on another CPU, spin doing something that calls DELAY().  Since
> there is no locking or atomic accesses for the writes to tsc_freq
> in the sysctl or for the reads of it in DELAY(), not to mention
> elsewhwere, tsc_freq can read as zero when it is being changed
> between 2 nonzero values.  Not too bad if it reads as zero -- then
> it won't be used.  But uses of it following it being read as
> nonzero may find it with changed values that are very bad, for
> example zero.
> These bugs are in the old code in DELAY() that uses the TSC, and
> perhaps in all code of the form "if (tsc_freq != 0) use(<tsc>)". 
> Fixes for them should probably use a generation count.
>
> init_TSC_tc() finishes with tc_init(&tsc_timecounter) which does
> most of the work for attaching to the higher-level timecounter
> structures. tc_init() has no locking or atomic operations, which I
> think gives it races in some contexts, but not in ones involving
> single stepping it using ddb, since then there are no other CPUs to
> race with.  The races are the potentially-unordered writes of all
> the variables set by tc_init().

Now don't you think we should really kill delay by TSC? ;-)

> >>> Modified: head/sys/x86/isa/clock.c
> >>> ===============================================================
> >>>== ============= --- head/sys/x86/isa/clock.c	Mon Mar 14
> >>> 19:31:43 2011	(r219645) +++ head/sys/x86/isa/clock.c	Mon Mar 14
> >>> 22:05:59 2011	(r219646) @@ -245,6 +245,42 @@ getit(void)
> >>> 	return ((high << 8) | low);
> >>> }
> >>>
> >>> +static __inline void
> >>> +delay_tsc(int n)
> >>> +{
> >>> +	uint64_t start, end, now;
> >>> +
> >>> +	sched_pin();
> >>> +	start = rdtsc();
> >>> +	end = start + (tsc_freq * n) / 1000000;
> >>> +	do {
> >>> +		cpu_spinwait();
> >>> +		now = rdtsc();
> >>> +	} while (now < end || (now > start && end < start));
> >>> +	sched_unpin();
> >>> +}
> >>
> >> You cannot call random scheduling code from DELAY(), since the
> >> scheduling code is not designed to be called in the "any"
> >> context. As it happens, sched_pin() and sched_unpin() are safe,
> >> and were already used in the TSC case.
> >
> > Actually, I really like to get rid of this code.  It isn't safe
> > for many reasons, e.g., its frequency may change while in the
> > loop and end result is unpredictable if TSC is not invariant.  We
> > may limit its use for invariant case, though.
>
> Me too.  It can also be used of the TSC is the timecounter, since
> it it is good enough for a timecounter than it is good enough for
> this.  But it is a lot of work for no benefit to make all cases
> work.
>
> >>> +static __inline void
> >>> +delay_timecounter(struct timecounter *tc, int n)
> >>> +{
> >>> +	uint64_t end, now;
> >>> +	u_int last, mask, u;
> >>> +
> >>> +	mask = tc->tc_counter_mask;
> >>> +	last = tc->tc_get_timecount(tc) & mask;
> >>> +	end = tc->tc_frequency * n / 1000000;
> >>
> >> This depends on the delicate timecounter locking to be safe.  I
> >> think it it is safe except possibly in tc_get_timecount(), for
> >> the same reasons that nanouptime() can run safely with no
> >> visible locking.  tc must be the current timecounter/timehands,
> >> which is guaranteed to not be in an in-between state or become
> >> so while we are using it, despite there being no visible
> >> locking.  Actually there are some minor races:
> >> - in nanouptime(), the timecounter generation is checked to
> >> avoid using a new generation if necessary, but updates of the
> >> generation count and the fileds that it protects are not
> >> properly atomic. - here, the check of the generation is missing.
> >>  So this works without races when called from ddb (since ddb
> >> stops other CPUs), but has minor races in general.
> >
> > Window of the races is very narrow here and it only happens on
> > i386 where reading uint64_t is not atomic if my understanding is
> > correct.
>
> I can easily make race windows very wide by single stepping them,
> and usually do for testing code like this :-).
>
> > It cannot be any worse than tsc_freq anyway. ;-)
>
> Yes, but it extends old mistakes.
>
> >> Optimizing DELAY() was bogus.  Now the loop for this is
> >> duplicatied.
> >
> > "Bogus" optimization is not exactly duplicated here if you meant
> > "(now < end || (now > start && end < start))" is bogus.  Its
> > bogusness comes from the fact that TSC is 64-bit.  However,
> > timecounter is effectively 32-bit because of its mask.  Now
> > overflow is carefully accounted for and it is (almost) correct.
>
> I meant not using the i8254.  Even DELAY()'s API limits it to 1 uS
> resolution.  Who cares if the hardware read extends this to 5 uS?

Ah, I see.

> Duplication can be avoided by replacing the rdstc() by conditional
> code like (using_tsc ? rdtsc() : (tc->tc_get_timecount(tc) & mask)
> in the loop.  We have a 1 uS resolution and a pause in the loop,
> so we don't care that the loop takes a little longer due to the
> conditional in it.

Agreed.

> My userland code for this does a full sysctl to read an emulated
> timecounter for this.  Its loop is:
>
> % 	binuptime(&start_bt);
> % 	start_tsc = rdtsc();
> % 	do {
> % 		binuptime(&sample_bt);
> % 		tsc = rdtsc();
> % 		binuptime(&bt);
> % 		bintime_sub(&bt, &sample_bt);
> % 		if (bintimecmp(&bt, &best_bt, <)) {
> % 			best_bt = bt;
> % 			best_sample_bt = sample_bt;
> % 			best_tsc = tsc;
> % 		}
> % 		bt = sample_bt;
> % 		bintime_sub(&bt, &start_bt);
> % 	} while (++nsamples < min_nsamples ||
> % 	    bintimecmp(&best_bt, min_error_btp, >));
>
> binuptime() is emulated so that it has the same API as the kernel. 
> It is actually clock_gettime(CLOCK_MONOTONIC, &ts) followed by
> error handling and conversion to bintime.  The loop is missing the
> equivalent of sched_pin().  I don't know exactly how to do that in
> userland.  Is there a single syscall that does it?  I used
> cpuset(1).

Not that I know of.

Jung-uk Kim