svn commit: r267029 - head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs

Wed Jun 4 13:13:09 UTC 2014

On Wed, 4 Jun 2014, Bryan Drewery wrote:

> On 6/3/2014 4:06 PM, Alexander Motin wrote:
>> Author: mav
>> Date: Tue Jun  3 21:06:03 2014
>> New Revision: 267029
>> URL: http://svnweb.freebsd.org/changeset/base/267029
>>
>> Log:
>>   Replace gethrtime() with cpu_ticks(), as source of random for the taskqueue
>>   selection.  gethrtime() in our port updated with HZ rate, so unusable for
>>   this specific purpose, completely draining benefit of multiple taskqueues.
>>
>>   MFC after:	2 weeks
>>
>> Modified:
>>   head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c
>>
>> Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c
>> ==============================================================================
>> --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c	Tue Jun  3 21:02:19 2014	(r267028)
>> +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c	Tue Jun  3 21:06:03 2014	(r267029)
>> @@ -953,7 +953,7 @@ spa_taskq_dispatch_ent(spa_t *spa, zio_t
>>  	if (tqs->stqs_count == 1) {
>>  		tq = tqs->stqs_taskq[0];
>>  	} else {
>> -		tq = tqs->stqs_taskq[gethrtime() % tqs->stqs_count];
>> +		tq = tqs->stqs_taskq[cpu_ticks() % tqs->stqs_count];
>>  	}
>>
>>  	taskq_dispatch_ent(tq, func, arg, flags, ent);

Bugs remaining in this include:
- the comment before this function still says that gethrtime() is used:

/*
  * Dispatch a task to the appropriate taskq for the ZFS I/O type and priority.
  * Note that a type may have multiple discrete taskqs to avoid lock contention
  * on the taskq itself. In that case we choose which taskq at random by using
  * the low bits of gethrtime().
  */

- timing functions are a bogus source of random bits.  The get_cyclecount()
   function should be used for this, though its existence is a bug and its
   name and implementation are worse.

On some arches (actually just i386, but should be arm too), cputick()
is the same as get_cyclecount().  cputick() has to be correct for
timing, but get_cyclecount() doesn't, so get_cyclecount() can be more
optimized, but on some arches get_cyclecount() is correct for timing
too.  The implementation of this varies gratuitously with the arch.
Only arm is very bad:

- arm: get_cyclecount() is binuptime() with bogus scaling.  This shows
   another bug in the above change -- depending on the details of the
   ticker and its scaling, there may be no random low bits at all.
   stqs_count is dynamically initialized and it is not clear if it is
   a power of 2, or a prime...  The bogus scaling is:

         binuptime(&bt);
         return ((uint64_t)bt.sec << 56 | bt.frac >> 8);

   The 8 lower bits may or may not be random (or even nonzero) but they are
   always discarded.  The remaining lower bits may or may not be random
   (or even zero).  56 top bits of the seconds part are always discarded.
   Old versions were better.  They xor'ed the seconds and fraction part.
   This is inconsistent with the bad function name, but better for
   randomness.

   It is still an error to use only the low bits.  Using all the bits modulo
   a prime would work well:

 	tq = tqs->stqs_taskq[cpu_ticks() % BIGPRIME % tqs->stqs_count];

   The magic BIGPRIME only needs to be relatively prime to the infinite-
   precision maxof(typeof(get_cyclecount()) + 1), so it doesn't depend
   on the details of get_cyclecount().  For general use, BIGPRIME should
   be 64 bits, but 32 bits should be enough for anyone and the the
   64-bit division for excessive generality would be slow.  The existing
   64-bit division is already excessively general and gratuitously slow.
   Both cpu_ticks() and gethrtime() are 64 bits.  The division should have
   been reduced to 32 bits using (uint32_t)getmumbletime() % count.

   When ntpd is active, the lower bits in the fraction normally contain
   significant randomness from micro-adjustments, so they should not be
   discarded.  Otherwise, there is really no randomness in lower bits,
   but they may change in step with high bits depending on the details
   of the clock frequency and scaling.

- i386: get_cyclecount() is cputicks().  This is different from amd64
   because rdtsc() is not always available and cputicks() handles the
   details of sometimes using a timecounter.  This is unnecessarily
   slow when cputicks() doesn't use rdtsc().  cputicks() isn't inline,
   and it uses lots of function pointers.  It even uses a function
   pointer for rdtsc().  cputicks() is much more important than
   get_cyclecount().  It should be optimized.  arm should use it even
   if it is not optimized.

get_cyclecount() is abused as a cycle count in:
- netinet/sctp_os_bsd.h.  microtime() with an x86 TSC timecounter is
   plenty fast and accurate enough for timestamping 1 GBps ethernet.
   (I sometimes use nanotime(), but find that actually using the
   nanoseconds resolution usually just makes the output too verbose
   with noise in the last 3 decimal digits, so the output formatting
   routine needs to reduce to microseconds.)
- kern/ktr_time.c.  The undocumented unsupported option KTR_TIME defaults
   to being get_cyclecount().  KTR needs to timestamp low-level functions,
   so the timestamp function needs to be lower-level and nanotime() might
   not be low enough.  However, i386 sometimes uses the timecounter,
   and arm always does, to nanotime() must be low enough.  nanotime() also
   gives nicely biased and scaled timestamps.  KTR has null support for
   scaling the timestamps that it makes.  In the kernel, it just prints
   raw values in %10.10lld format.  ktrdump(1) is further from knowing how
   to scale them.  It just prints them or differences of them in %16ju
   format.  16 is excessively wide, yet not wide enough to print
   UINT64_MAX (that takes 20 decimal digits.  Only 16 hex digits).  At
   least it doesn't use the long long abomination.
- kern/subr_bus.c.  get_cyclecount() is not abused, but its result is
   assigned to a variable named attachtime.  This is fed to random_harvest()
   as the vatiable named *entropy, so it is clearly random and not a time.

The existence of get_cyclecount() is a bug since all uses are satisifed
with a timestamp and it is better to optimize the timestamp functions
than have special versions of them for just randomness.  It was created
before cputick() existed.  cputick() is a fast enough timestamp function
for all random uses (it basically uses a cycle counter if available,
else falls back to using the timecounter).

However, bugs pointed out above show that it is confusing to use
functions whose names are related to timers as a source of randomness.
A name with "random" in it would be better.  Also, some arches may
have a random register (in the CPU so it is fast to read) but no
x86-style cycle counter or fast timecounter.  These arches should use
the random register for the random function.  The random register just
doesn't need to be very wide or crypto quality.

Some of the bugs are documented in get_cyclecount(9).  It is documented
as monotonically increasing, although the xor version on arm was far
from that and the current version wraps every 256 seconds.

Bruce