Dynamic reads without locking.

Thu Oct 9 02:19:50 PDT 2003

Harti Brandt wrote:
> You need to lock when reading if you insist on consistent data. Even a
> simple read may be non-atomic (this should be the case for 64bit
> operations on all our platforms). So you need to do
> 
> mtx_lock(&foo_mtx);
> bar = foo;
> mtx_unlock(&foo_mtx);
> 
> if foo is a datatype that is not guaranteed to be red atomically. For
> 8-bit data you should be safe without the lock on any architecture. I'm
> not sure for 16 and 32 bit, but for 64-bit you need the look for all
> our architectures, I think.
> 
> If you don't care about occasionally reading false data (for statistics or
> such stuff) you can go without the lock.

For a read-before-write type read, this is always true.

For an interlock read, this is also true, if you intend to use
the data you read as a timing-dependent one-shot (e.g. for a
condition variable, etc.).

For certain uses, however, it's safe to not lock before the
read *on Intel architectures*.  This can go out the window on
SPARC or PPC, or any architecture where there is no guarantee
that there won't be speculative execution or out-of-order
execution without explicit synchronization.   For instance,
on a PPC, there are two instructions that are used to implement
a mutext, "isync" and "dsync", one which is used to check it,
and the other which is used to take it.  Using these, it's a
lot less expensive to implement a mutex, and you are guaranteed
that after you take a mutex, there's no speculative execution
coherency issues left outstanding.

For things with an explicit periodicity, with no explicit
guarantee of order of operation, reading without a lock will
do no damage.

For example, consider the case of a push-model migration of
processes from one CPU to another, and a (in the general case)
lockless implementation of a per CPU scheduler queue with
explicit migration requirements -- this avoids the FreeBSD issue
of having to take a lock each time you enter the scheduler,
and avoids the IPI that results, as well as implying implicit
CPU affinity; here's how it works:

Each CPU has 3 values associated with it:

	int load	ptr migration queue	ptr run queue

When it's time to schedule a process on a given CPU, here is the
order of operation:

1)	Compare migration queue ptr to NULL; this is a
	lockless operation.

2)	If the ptr is non-NULL,
	a)	take a lock on the migration queue.
	b)	move items on it to the per CPU run queue
		-- LIFO; writing the per CPU run queue is
		a lockless operation, since no other CPU
		is permited access to it (no "pull" of
		work for migration).
	c)	NULL the migration queue head of the now
		empty migration queue.
	d)	drop the migration queue lock.

3)	Take the entry off the top of my run queue; this is
	what I'm going to run next.

4)	Compare my 'load' value against that of other CPUs;
	since a CPU is the only one permitted to write it's
	own load value, this is a lockless operation for the
	read of all other CPU's 'load' value, so long as that
	value is accessible in a single bus/memory cycle.

5)	If my load is significantly higher than the lowest of
	all other CPU's, consider migrating work to the lowest
	of all other CPU's:
	a)	Locate any candidates for migration; these
		is the current top of my own run queue *after*
		the removal of the next thing I'm going to run,
		*without* the "don't migrate" bit set.
	b)	If a candidate exists,
		i)	Remove the item from the local run
			queue (lockless).
		ii)	Take the lock on the target CPU
			migration queue.
		iii)	Put the item on the target migration
			queue.
		iv)	Drop the migration queue lock.

6)	Recalculate my local 'load' value for the benefit of
	other CPUs.

7)	Start executing the item obtained in step #3.

Because the operation occurs periodically, and because you skip
items that are marked non-migrate (you could have a bounded depth
to the search, if so desired), there is the ability to implement
explicit CPU affinity, and there is also the ability to implement
implicit CPU affinity (since there is hysteresis in the act of
deciding to "push" work off to another CPU).

The purpose of LIFO ordering, and examining the migration queue
before examining the local run queue *after* the LIFO ordering of
inserts of work from other CPUs is that the work on the other CPU
that pushed to you was at the top of its run queue, and this
avoids penalizing migrated objects one scheduler cycle latency.

Note that all this works, even in a decoherent system on the
migration queue examination, since the worst case is you delay
a migration for up to two scheduling cycles (one for the 'load'
value on the origin, one for the migration queue examination on
the target), and the common case is a 1.5 cycle delay (since the
grab of the migration queue mutex by the origin forces a coherency
cycle).  And you are guaranteed to be delaying the thing to run
until the next scheduling cycle on the origin CPU anyway.

So there *are* cases, where it's safe, as long as you have either:

1)	Multiple reader, single fixed writer, OR

2)	Multiple locked writer, singe fixed reader, AND

3)	Repetition with implicit hystersis, AND

4)	Insensitivity to order of operation

I'm pretty sure that the cases Jeffrey was thinking about all
fall into that description bucket.

If they don't, then he's probably not caring about SPARC and PPC,
which would make sense, in terms of the model that Dragonfly BSD
is using -- though I expect that they will fail to support NUMA
or clustered systems properly, if that's the case (which I doubt).

-- Terry