svn commit: r265472 - head/bin/dd

Thu May 8 07:17:32 UTC 2014

On Wed, 7 May 2014, Alan Somers wrote:

> On Tue, May 6, 2014 at 9:47 PM, Bruce Evans <brde at optusnet.com.au> wrote:
>> On Tue, 6 May 2014, Alan Somers wrote:

This is about some minor details that I didn't reply to for later followups.

>>> +       if (clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv))
>>> +               err(EX_OSERR, "clock_gettime");
>>> +       if (clock_getres(CLOCK_MONOTONIC_PRECISE, &tv_res))
>>> +               err(EX_OSERR, "clock_getres");
>>
>>
>> clock_getres() is almost useless, and is useless here.  It is broken
>> as designed, since the precision may be less than 1 nanosecond but
>> 1 nanosecond is the smallest positive representable value, but that
>> is not a problem here since clock_gettime() also can't distinguish
>> differences smaller than 1 nanosecond.
>
> Since it's reporting the clock resolution and not precision, and since
> clock_gettime() only reports with 1ns resolution, I don't think it's a
> problem for clock_getres to report with 1ns resolution too.

I got most of the backwardness backwards.  The syscall is clock_getres(),
not clock_getprec(), and the variable name matches this.  But what it
returns is the precision.  The resolution is just that of a timespec (1
nanosecond).  No API is needed to report this.  APIs are needed to report:
- the precision.  The API is misnamed clock_getres()
- the granularity.  This is the minimum time between successive measurements.
   It can be determined by actually doing some measurements.
- the accuracy.  No API is available
For clocks based on timecounters, we use time timecounter clock period
rounded up to nanoseconds for the precision.  With a TSC, this is always
1 nanosecond above 1GHz.

dd needs more like the granularity than the precision, but it doesn't
really matter since the runtime must be much larger than the granularity
for the statistics to be accurate, and usually is.

>> The fixup is now only reachable in 3 cases that can't happen:
>> - when the monotonic time goes backwards due to a kernel bug
>> - when the monotonic time doesn't increase, so that the difference is 0.
>>   Oops, this can happen for timecounters with very low "precision".
>>   You don't need to know the "precision" to check for this.
>
> On my Xeon E5504 systems, I can see adjacent calls to clock_gettime
> return equal values when using one of the _FAST clocks.  It can't be
> proven that this case will never happen with any other clock either,
> so the program needs to handle it.

Hrmph.  This is either from the design error of the existence of the
_FAST clocks, or from the design error of the existence of TSC-low.

First, the _FAST clocks are only supposed to have a resolution of 1/hz.
clock_getres() is quite broken here.  It returns the timecounter
precision for the _FAST clocks too.  Also, if it returned 1/hz, then
it would be inconsistent with the libc implementation of
clock_gettime().  The latter gives gives the timecounter precision.

TSC-low intentionally destroys the hardware TSC precision by right shifting,
due to FUD and to handle a minor problem above 4GHz.  The shift used to
be excessive in most cases.  On freefall it used to be about 7, so the
precision of ~1/2.67 nsec was reduced to ~48 nsec.  This was easy to
see in tests programs for such things.  Now the shift is just 1.
Since 1<<1 is less than 2.67, the loss of precision from the shift is
less than the loss of precision from converting from bintimes to
timespecs.  The shift is still a pessimization.

sysctl has a read-only tunable kern.timecounter.tsc_shift.  Use of this
seems to be quite broken.  The shift count is determined dynamically,
and the tunable barely effects this.  The active shift count is not
written back to the tunable, so you can't see what it is easy.  However,
the shift count is now always 1 except in exceptional cases.  The tunable
defaults to 1.  This is for CPU speeds between 2GHz and 4GHz to implement
the support for the FUD at these speeds.  Above 4GHz, the shift is increased
to 2 without changing the tunable.  Above 8GHz, the shift is increased to
3.  That can't happen yet, but you can tune higher to get a higher
shift count at lower speeds.  You can also tune to 0 to avoid the shift
up to 4GHz.

The shift together with some fencing pessimizations that are not even done
in the kernel version (only libc) is due to FUD.  rdtsc is not a serializing
instruction, so its direct use may give surprising results.  I think it
is serialized with respect to itself on the same CPU.  It is obviously
not serialized with respect to other instructions on the same CPU.  So
it doesn't work properly in code like "rdtsc; <save results>; v++; rdtsc;
<compare results>" even with quite a bit more than v++ between the rdtsc's.
Normally there is much more than v++ between rdtsc's so code like this
works well in practice.  When the rdtsc's are on separate CPUs, it is
just a bug to depend on their order unless there are synchronization
instructions for more than the rdtsc's.

The old kernel code is sloppy about such things.  It tries to do
everything without atomic locking or mutexes.  This mostly works, but
I think it depends on the slowness of syscalls and locking in unrelated
code for some corner cases.  Syscalls put hundreds or thousands of
instructions between successive timecounter hardware reads, so even
if these reads are done on different CPUs the first one has had plenty
of time to complete.  Also, one CPU's TSC is acausal with respect to
another's, the difference is hopefully a backwards step of at most a
couple of cycles.  This would be lost in the nose of the hundreds or
thousands of cycles for the slow syscalls.  Also, any context switch
will do lots of locking operations that may synchronize the rdtscs.

There is official FUD about some of these problems.  An early "fix"
was to shift the TSC count.  I think this "works" just by breaking
the precision of the counter enough for backwards steps to be
invisible in most cases.  A large shift count of 7 reduces the precision
to 128 cycles.  That should hide most problems.  But I think it only
works in about 127 of 128 problem cases if the problem is an acausality
of 2 cycles.  Suppose CPU1 reads the TSC at time 128 and sees 128, and
CPU2 reads the TSC at time 129 and sees 127.  CPU2 does the read later
but sees an earlier time.  I chose the times near a multiple of 128
so that even rounding to a multiple of 128 doesn't fix the problem.
The current normal shift count of 1 hide so many problem cases.  It
can probably hide more than a few cycles of acausality since the shift
instruction itself is so slow (several cycles).

libc worries about the locking problems more than the kernel, and uses
some fence instructions.  Oops, these are in the kernel now.  They
are easier to see in the kernel too (they are spelled as *fence in asm
there, but as rmb() in libc).  Fence instructions don't serialize
rdtsc, but may be needed for something.  The rmb()'s in libc are
replacements for atomic ops.  Such locking operations are intentionally
left out of the software parts of the kernel since the algorithm is
supposed to work without them (it only clearly for UP in-order).
However, the kernel now gets locking operations (mfence or lfence)
in some TSC read functions, depending on the CPU (fences are mostly
selected according to if the CPU supports SSE2; lfence is prefered,
but mfence is used on AMD CPUs for some reason).  There is lots of
bloat to support this.  libc only has the shifting pessimization.

>> - when the monotonic time does increase, but by an amount smaller than
>>   the "precision".  This indicates that the "precision" is wrong.

We have the reverse bug that the precision is too small for the _FAST
syscall case.

The precision is adjusted to match the shifts.

>> In the second case, fixing up to the "precision" may give a large
>> estimate.  The fixup might as well be to a nominal value like 1
>> nanosecond or 1 second.  CLOCK_MONOTONIC can't have a very low
>> precision, and the timing for runs that don't take as long as a
>> large multiple of the precision is inaccurate.  We could also
>> report the result as <indeterminate> in this case.
>
> The second case is the one I'm most concerned about.  Assuming that
> the precision is correct, clock_getres() seems like the best value for
> the fixup.  Anything less than the reported precision would be
> unnecessarily small and give unnecessarily inaccurate results.
> Anything greater would make an implicit and unportable assumption
> about the speed of the hardware.  Do you really think it's a problem
> to fixup to clock_getres() ?

And it is the case broken for the _FAST syscall case (except this case
shouldn't exist, and dd doesn't use it).  Then the time only changes
every 1/hz seconds and the fixup converts differences of nearly 1/hz
seconds (but 0 due to the granularity) to 1 nanosecond (for x86 with
TSC).  With hz = 1000, the error is a factor of 1000000.

I would just use an arbitrary fixup.  I think I pointed out that ping(8)
doesn't worry about this.  It just assumes that the precision of
gettimeofday() is the same as its resolution (1 usec) and that no times
of interest below 1 usec occur (not quite true, since ping latency is
in the microseconds range and you can do very short tests using
ping -fq -c1).

>>> @@ -77,7 +83,7 @@ summary(void)
>>>                      st.trunc, (st.trunc == 1) ? "block" : "blocks");
>>>         if (!(ddflags & C_NOXFER)) {
>>>                 (void)fprintf(stderr,
>>> -                   "%ju bytes transferred in %.6f secs (%.0f
>>> bytes/sec)\n",
>>> +                   "%ju bytes transferred in %.9f secs (%.0f
>>> bytes/sec)\n",
>>
>>
>> nanoseconds resolution is excessive here, and changes the output format.
>> The only use of it is to debug cases where the output is garbage due
>> to the interval being about 1 nanosecond.  Printing nanoseconds resolution
>> is also inconsistent with the fussy "precision" adjustment above.
>
> The higher resolution printf doesn't conflict with the resolution
> adjustment above.  Freefall actually reports 1ns resolution.  But I
> can buy that it's not useful to the user.  Would you like me to change
> it back to %.6 ?

Yes, just change back.  %.6f is probably excessive to.  4.4BSD uses just
seconds and %u.

> Even if nanosecond resolution isn't useful, monotonicity is.  Nobody
> should be using a nonmonotonic clock just to measure durations.  I
> started an audit of all of FreeBSD to look for other programs that use
> gettimeofday to measure durations.  I haven't finished, but I've
> already found a lot, including xz, ping, hastd, fetch, systat, powerd,
> and others.  I don't have time to fix them, though.  Would you be
> interested, or do you know anyone else who would?

There are indeed a lot.  Too many for me to fix :-).

The problem is limited, since for short runs the realtime isn't stepped,
and for long runs the real time may be more appropriate.

Hmm, cron uses CLOCK_REALTIME, sleep(1 or 60) and nanosleep(at most 600),
while crontab uses gettimeofday() and sleep(1).  It has real problems
that are hopefully mostly avoided by using short sleeps and special
handling for minutes rollovers).  Realtime is appropriate for it.
It is unclear what time even sleep() gives.  It should sleep on
monotonic time that is not broken by suspension, but is too old for
POSIX to say anything about that.  POSIX mentions the old alarm()
implementation.  FreeBSD now implements it using nanosleep(), but
nanosleep() is specified to sleep on CLOCK_REALTIME.

Oops, I found some POSIX words that may allow not-so-bizarre behaviour
for nanosleep(): from an old draft:

% 6688 CS           If the value of the CLOCK_REALTIME clock is set via clock_settime( ), the new value of the clock
% 6689              shall be used to determine the time at which the system shall awaken a thread blocked on an
% 6690           absolute clock_nanosleep( ) call based upon the CLOCK_REALTIME clock. If the absolute time
% 6691           requested at the invocation of such a time service is before the new value of the clock, the call
% 6692           shall return immediately as if the clock had reached the requested time normally.
% 6693           Setting the value of the CLOCK_REALTIME clock via clock_settime( ) shall have no effect on any
% 6694           thread that is blocked on a relative clock_nanosleep( ) call. Consequently, the call shall return
% 6695           when the requested relative interval elapses, independently of the new or old value of the clock.

So for clock_gettime(), even when the clock id is CLOCK_REALTIME, stepping
the clock doesn't affect the interval.  But It is now unclear on which clock
the interval is measured.  And what heppens for leap seconds where the
clock is stepped by a non-POSIX method?

For nanosleep():

% 26874              system. But, except for the case of being interrupted by a signal, the suspension time shall not be
% 26875              less than the time specified by rqtp, as measured by the system clock, CLOCK_REALTIME.
% 26876              The use of the nanosleep( ) function has no effect on the action or blockage of any signal.

Here there is no mention of stepping the time, and no option to measure the
time by a clock other that CLOCK_REALTIME.  Does CLOCK_REALTIME "measure"
the time across steps?

Bruce