svn commit: r252032 - head/sys/amd64/include

Sun Jun 23 09:58:00 UTC 2013

On Sun, 23 Jun 2013, Konstantin Belousov wrote:

> On Sat, Jun 22, 2013 at 06:58:15PM +1000, Bruce Evans wrote:
>> So the i386 version be simply "addl; adcl" to memory.  Each store in
>> this is atomic at the per-CPU level.  If there is no carry, then the
>> separate stores are equivalent to adding separate nonnegative values and
>> the counter value is valid at all times.  If there is carry, then the
>> separate stores are equivalent to adding a negative value followed by
>> a larger positive value.  The counter transiently goes backwards, but
>> no information is lost and the counter is eventually set to the correct
>> forward-going value.
>
> This is quite interesting idea, but I still did not decided if it
> acceptable.  The issue is that we could add the carry to the other
> processor counter, if the preemption kicks in at right time between
> two instructions.  I did not found any argument why would it be
> wrong, the races for fetch operation seems to be the same with either
> local update method.

I thought about exploiting not-so-strong memory ordering to fix the fetches.
On at least x86, stores are not reordered, so if the storer does a simple
"addl, adcl", then the order is not much different from

 	load low 32 bits
 	store low 32 bits
 	load high 32 bits
 	store high 32 bits

The loads can be in any order, but must be before the stores to the
same location.  I think the fetching code can depend on this (maybe
only on x86, but the fetching code needs to be MD anyway).  Thus it
can detect most races by simply rereading the counters in almost any
order until they don't change.  Without this, the fetching code can
easily read high and low bits from unrelated stores (when it is
preempted, the race window is large), and the sophisticated cmpxchg8b
code to make each store atomic is almost useless.

The case that can't be fixed by rereading the counters is when fetching
code runs in between the stores.  If the stores are on a another CPU
that is currently executing them, then we can keep checking that the
counters don't change for the maximum time that any pair of stores
take to complete.  This time is quite long and difficult to determine
(but it can be bounded by hard-disabling interrupts in the storer, and
limited by prefetching).  If the stores are on the same CPU, then we
have no good way to detect that another thread is in the middle of
them, or to switch back to that thread.  So we currently prevent this
case using slow methods.

> On the other hand, for debugging it would be quite confusing state of
> the counters array to observe.

I thought of lots of variations, but couldn't find one that works perfectly.
One idea (that goes with the sign check on the low 32 bits) is to use a
misaligned add to memory to copy the 31st bit as a carry bit to the the
high word.  The value of the counter is supposed to be

    [unsigned value of low 32 bits] + [unsigned value of next 24 bits] << 31
    (high 8 bits not used)

at all times, with the 31st bit zero at most times so that the the carry
operation is rarely executed.  This format is only slightly confusing for
debugging (it basically has 2 31st bits, with the one in the low 32 bits
rarely used).  This format can be updated using something like:

 	addl	%1,%%fs:(%0)		# only small 32-bit increments supported
 	jns	8f
 	addl	$0x80,%%fs:3(%0)	# misaligned memory access
8: ;

No problem if this is not preempted.  The idea of the misaligned memory
access is that the CPU needs to make this atomic enough even though it
crosses a 32-bit boundary.  BTW, is anything guaranteed if the access
also crosses a cache line or page boundary?  The OS would be involved
for page boundaries if there is a page fault.  Better not step on bugs
in that.  What about for misaligned 64-bit or 128-bit stores done by
cmpxchg8b or SSE?  Counters are 64-bit aligned, so I think that at least
CPUs that are basically 64-bit must handle this like I want.

The above doesn't work if it is preempted -- then it might add do the
carry operation more than once.  But it is easy to lock using cmpxchg or
disabling interrupts.  Disabling interrupts requires only small code:

 	addl	%1,%%fs:(%0)		# only small 32-bit increments supported
 	jns	8f
 	pushfl
 	cli
 	testb	$0x80,%%fs:3(%0)
 	je	1f			# carry already done by preemption
 	addl	$0x80,%%fs:3(%0)	# misaligned memory access
1:
 	popfl
8: ;

Another idea is to use the high bits of the high word to encode state.
They can be set atomically enough using addl/andl/orl/xorl since they
are per-CPU.  I couldn't quite get this to work.  You could increment
them to indicate an update in progress.  This only requires 1 extra
add to memory if the adcl is always done (e.g., first add 0x0100000000
to the high word; then addl to the low word; then adcl of (inc >> 32)
- 0x01000000 to the high word.  Increments must be limited to not
change the state bits of course.  The fetching code can easily examine
these bits to learn that there is a problem, but it has no good way
of handling the problem.  The code that set the bits might have been
preempted by the fetching code.  It takes something like mutex priority
propagation to get back to the code that clears finishes the update
and clears the state.  But we are intentionally not using full mutexes.

Bruce