svn commit: r219646 - head/sys/x86/isa
Jung-uk Kim
jkim at FreeBSD.org
Tue Mar 15 19:26:31 UTC 2011
On Tuesday 15 March 2011 02:13 pm, Bruce Evans wrote:
> On Tue, 15 Mar 2011, Jung-uk Kim wrote:
> > On Monday 14 March 2011 10:31 pm, Bruce Evans wrote:
> >> On Mon, 14 Mar 2011, Jung-uk Kim wrote:
> >>> Log:
> >>> When TSC is unavailable, broken or disabled and the current
> >>> timecounter has better quality than i8254 timer, use it for
> >>> DELAY(9).
> >>
> >> You cannot use a random timecounter for DELAY(). DELAY() is
> >> carefully written to not use any locks, so that it doesn't
> >> deadlock when called in the "any" context from a bad console
> >> driver like syscons when syscons is called from ddb. (Bad
> >> console drivers may have their own locks which deadlock in ddb,
> >> but that is another problem.) Even the i8254 timer, which is
> >> the only timer that should be used by DELAY(), normally use
> >> locks that used to cause deadlock, but DELAY() was unbroken to
> >> bypass the locks when it is called from ddb.
> >
> > I disagree. I think we should keep away from i8254 as much as
> > possible.
>
> It is adequate for DELAY(), and is the only timer that is available
> on all x86. You need large complications in DELAY() to make it
> work as well as the old code using the i8254, for no gain until
> there is an x86 without an i8254.
Intel started killing off i8254 (and other "legacy" ISA devices),
starting from "Mobile Internet Device" platforms. Even for other
so-called "legacy-free" platforms, it isn't adequate any more because
of unpredictable SMI# interference.
> > Actually, i8254 is the only timer that requires locking in
> > the first place because of its braindead design. All other x86
> > timers do not require any locking at all. We even sacrificed
> > upper 32 bits of HPET in favor of i386. :-(
>
> Did we? Timecounters use only 32 bits as an optimization. Scaling
> of 64-bit values is slow even on 64-bit machines since it needs
> 128-bit intermediate values for the multiplication/shift method to
> work.
No, I am talking about reading raw value of HPET. For amd64, it can
be used as a good alternative to TSC if we detect it earlier for
legacy-free platforms. Unfortunately, it is only useful for
timecounter ATM.
> >> Cpufreq and other calibration code should use a normal
> >> timecounter via nanouptime() in all cases and never go near low
> >> level timecounters or DELAY(). There are (hopefully minor)
> >> problems getting the nanotime() initialized before it used. The
> >> dummy timecounter is just what you don't want to use.
> >
> > That's exactly the problem.
>
> Your fix won't make much difference then. DELAY() will start with
> a dummy timecounter and probably also a zero tsc_freq, so it will
> use the i8254 for various things, probably including initial
> calibration of the TSC/cpufreq. Later it may see a better
> timecounter and use that to get a better calibration of the
> TSC/cpufreq. But its caller can just as easily check for a better
> timecounter and not use DELAY() if it can use a timecounter,
> without messing with DELAY()'s internals and having to guard
> against reentrancy from ddb. delay_timecounter() already has all
> the code for this, except the check for a better timecounter is
> external.
Yes, it doesn't make much difference for that ATM. However, it DOES
make differences when DELAY() is used for device drivers. What I
really want is DELAY() does the right thing, e.g., holding no spin
lock, not pinned on a CPU, no tsc_freq hackery, and no side-effect of
any kind.
> >> Even in this patch, it isn't clear that the low level
> >> timecounters are initialized before they are used.
> >
> > The timecounter is always initialized first, then set.
>
> It is still unclear whether they are initialized enough before they
> are used since their setting is neither locked nor atomic.
> Consider the TSC initialization. This is init_TSC() from
> startrtclock() and init_TSC_tc() from cpu_initclocks(). The first
> should make the low- level timecounter useable, but this is
> unclear. The second attaches the low-level timecounter to the
> high-level timecounter data structures. There is no locking for
> this except possibly Giant. tsc_freq is initialized early, so the
> tests of it will succeed long before the TSC can become the
> high-level timecounter. It is probably usable at a low level, but
> this is unclear. tsc_freq is 64 bits, and the accesses to it are
> non-atomic, so especially on 32-bit machines there are races
> accessing it. In fact, I can now see a reproducible bug in
> DELAY(): - find an x86 that can run at 4GHz
> - run it in 386 mode
> - start it so that tsc_freq is 0x10000000ULL
> - use the machdep.tsc_freq sysctl to tune tsc_freq to 0xFFFFFFFF
machdep.tsc_freq should never have modified tsc_freq in the first
place. I am going to remove that soon.
> - trace through this sysctl using gdb, and using a low quality
> console driver like syscons which calls DELAY().
> - the 2 words in tsc_freq will normally be written from the low
> word up. It will be seen to have value 0x1FFFFFFFFULL after only
> the first word is written. Not too bad -- just of by a factor of
> 2. - I can't quite see how to make it have a really broken value.
> At first I tried and example with it starting with a value if
> 0xFFFFFFFF and changing it to 0x100000000ULL. Then after changing
> only its low word, it is 0 so DELAY() will not use the TSC.
> Related non-reproducible bugs are also easy to see:
> - on one CPU, spin calling the sysctl to toggly tsc_freq between
> 4G-1 and 4G
> - on another CPU, spin doing something that calls DELAY(). Since
> there is no locking or atomic accesses for the writes to tsc_freq
> in the sysctl or for the reads of it in DELAY(), not to mention
> elsewhwere, tsc_freq can read as zero when it is being changed
> between 2 nonzero values. Not too bad if it reads as zero -- then
> it won't be used. But uses of it following it being read as
> nonzero may find it with changed values that are very bad, for
> example zero.
> These bugs are in the old code in DELAY() that uses the TSC, and
> perhaps in all code of the form "if (tsc_freq != 0) use(<tsc>)".
> Fixes for them should probably use a generation count.
>
> init_TSC_tc() finishes with tc_init(&tsc_timecounter) which does
> most of the work for attaching to the higher-level timecounter
> structures. tc_init() has no locking or atomic operations, which I
> think gives it races in some contexts, but not in ones involving
> single stepping it using ddb, since then there are no other CPUs to
> race with. The races are the potentially-unordered writes of all
> the variables set by tc_init().
Now don't you think we should really kill delay by TSC? ;-)
> >>> Modified: head/sys/x86/isa/clock.c
> >>> ===============================================================
> >>>== ============= --- head/sys/x86/isa/clock.c Mon Mar 14
> >>> 19:31:43 2011 (r219645) +++ head/sys/x86/isa/clock.c Mon Mar 14
> >>> 22:05:59 2011 (r219646) @@ -245,6 +245,42 @@ getit(void)
> >>> return ((high << 8) | low);
> >>> }
> >>>
> >>> +static __inline void
> >>> +delay_tsc(int n)
> >>> +{
> >>> + uint64_t start, end, now;
> >>> +
> >>> + sched_pin();
> >>> + start = rdtsc();
> >>> + end = start + (tsc_freq * n) / 1000000;
> >>> + do {
> >>> + cpu_spinwait();
> >>> + now = rdtsc();
> >>> + } while (now < end || (now > start && end < start));
> >>> + sched_unpin();
> >>> +}
> >>
> >> You cannot call random scheduling code from DELAY(), since the
> >> scheduling code is not designed to be called in the "any"
> >> context. As it happens, sched_pin() and sched_unpin() are safe,
> >> and were already used in the TSC case.
> >
> > Actually, I really like to get rid of this code. It isn't safe
> > for many reasons, e.g., its frequency may change while in the
> > loop and end result is unpredictable if TSC is not invariant. We
> > may limit its use for invariant case, though.
>
> Me too. It can also be used of the TSC is the timecounter, since
> it it is good enough for a timecounter than it is good enough for
> this. But it is a lot of work for no benefit to make all cases
> work.
>
> >>> +static __inline void
> >>> +delay_timecounter(struct timecounter *tc, int n)
> >>> +{
> >>> + uint64_t end, now;
> >>> + u_int last, mask, u;
> >>> +
> >>> + mask = tc->tc_counter_mask;
> >>> + last = tc->tc_get_timecount(tc) & mask;
> >>> + end = tc->tc_frequency * n / 1000000;
> >>
> >> This depends on the delicate timecounter locking to be safe. I
> >> think it it is safe except possibly in tc_get_timecount(), for
> >> the same reasons that nanouptime() can run safely with no
> >> visible locking. tc must be the current timecounter/timehands,
> >> which is guaranteed to not be in an in-between state or become
> >> so while we are using it, despite there being no visible
> >> locking. Actually there are some minor races:
> >> - in nanouptime(), the timecounter generation is checked to
> >> avoid using a new generation if necessary, but updates of the
> >> generation count and the fileds that it protects are not
> >> properly atomic. - here, the check of the generation is missing.
> >> So this works without races when called from ddb (since ddb
> >> stops other CPUs), but has minor races in general.
> >
> > Window of the races is very narrow here and it only happens on
> > i386 where reading uint64_t is not atomic if my understanding is
> > correct.
>
> I can easily make race windows very wide by single stepping them,
> and usually do for testing code like this :-).
>
> > It cannot be any worse than tsc_freq anyway. ;-)
>
> Yes, but it extends old mistakes.
>
> >> Optimizing DELAY() was bogus. Now the loop for this is
> >> duplicatied.
> >
> > "Bogus" optimization is not exactly duplicated here if you meant
> > "(now < end || (now > start && end < start))" is bogus. Its
> > bogusness comes from the fact that TSC is 64-bit. However,
> > timecounter is effectively 32-bit because of its mask. Now
> > overflow is carefully accounted for and it is (almost) correct.
>
> I meant not using the i8254. Even DELAY()'s API limits it to 1 uS
> resolution. Who cares if the hardware read extends this to 5 uS?
Ah, I see.
> Duplication can be avoided by replacing the rdstc() by conditional
> code like (using_tsc ? rdtsc() : (tc->tc_get_timecount(tc) & mask)
> in the loop. We have a 1 uS resolution and a pause in the loop,
> so we don't care that the loop takes a little longer due to the
> conditional in it.
Agreed.
> My userland code for this does a full sysctl to read an emulated
> timecounter for this. Its loop is:
>
> % binuptime(&start_bt);
> % start_tsc = rdtsc();
> % do {
> % binuptime(&sample_bt);
> % tsc = rdtsc();
> % binuptime(&bt);
> % bintime_sub(&bt, &sample_bt);
> % if (bintimecmp(&bt, &best_bt, <)) {
> % best_bt = bt;
> % best_sample_bt = sample_bt;
> % best_tsc = tsc;
> % }
> % bt = sample_bt;
> % bintime_sub(&bt, &start_bt);
> % } while (++nsamples < min_nsamples ||
> % bintimecmp(&best_bt, min_error_btp, >));
>
> binuptime() is emulated so that it has the same API as the kernel.
> It is actually clock_gettime(CLOCK_MONOTONIC, &ts) followed by
> error handling and conversion to bintime. The loop is missing the
> equivalent of sched_pin(). I don't know exactly how to do that in
> userland. Is there a single syscall that does it? I used
> cpuset(1).
Not that I know of.
Jung-uk Kim
More information about the svn-src-all
mailing list