Possible instruction pipelining problem between HT's on the same die ?

Sat Jun 4 02:29:06 GMT 2005

On Fri, 2005-06-03 at 18:47, Matthew Dillon wrote:
> :This is normal behaviour.
> :Take a look at IA-32 Intel Developers ... Vol 3,  
> :Section: 7.2.2 for details + solutions.
> :
> :Stephan
> 
>     Ok.. that section seems to indicate that speculative reads 
>     can pass writes, but it also says that the pipeline sniffs the address
>     within the processor and ensures proper ordering.  The latter part
>     makes sense within the context of a single cpu, but the big question is: 
>     Is that supposed to hold true for interactions with HT cpus (that share
>     the pipeline) as well?  Or not ?  It seems not.

Memory ordering in logical HT CPUs is the same as in real CPUs (see
7.6.1.9)

> 
>     Speculative reads creating out of order situations seems to be the
>     biggest issue.  The AMD manual (Programmers manual volume 3 page
>     186, MFENCE instruction) says this:
> 
>     "The MFENCE instruction is weakly-ordered with respect to data and
>     instruction prefetches.  Speculative loads initiated by the processor,
>     or specified explicitly using cache-prefetch instructions, can be 
>     reordered around an MFENCE".

Speculative loads can pass MFENCE - but can not pass load operations
issued before MFENCE.

>     This seems to be different then what the Intel manual says, and doesn't
>     make much sense.  What's the point of having a fence instruction if it
>     can't guarentee read/write ordering?  Is the AMD manual simply wrong ?

Not wrong - just confusing.

	READ A
	MFENCE
	READ B

can cause

	READ A
	Speculative READ B
	MFENCE

but NOT
	Speculative READ B
	READ A
	MFENCE

>     Other then that, the Intel manual does indicate that speculative reads
>     will not pass locked bus cycle instructions (the AMD manual says nothing
>     about that that I can see). 

AMD Volume 1 - 3.9.2

>  So, presumably, doing a dummy locked bus 
>     cycle operation on e.g. the top of the stack, such as Linux does, would
>     be sufficient to ensure read ordering.  Would you concur with that
>     assessment?

Yes

>     What's really horrible here is that the 'old' value of the data being
>     used is modified at location A something like 30 instructions prior to 
>     the instruction that updates the index (B).   I think this is a 
>     situation that can only occur in an HT configuration, and then only if
>     the speculative read issued by the HT cpu is being held for across
>     30 instructions executed by the primary cpu before the HT cpu issues the
>     read of B.
> 
>     cpu #0 			cpu #1 (HT cpu on same die as cpu #0)
> 
> 				speculatively read A
>     write A			(stalled)
>     [30 instructions]		(stalled x 30)
>     write B			(stalled)
> 				read B
> 				see that B has been updated
> 				read A (get old value for A instead of new)
> 
>     Is that even possible ?  Not only the 30 instruction latency, but also
>     the fact that even with the shared pipeline you have a speculative read
>     on the HT cpu surviving 30 instructions running on cpu #0 (but only one
>     or two on the HT cpu)... even though they share the same pipeline.

Take a look at store buffers.
Reads have a higher priority than writes on some CPUs and data may be
even stored indefinitely long in a store buffer.
( Where it can not be observed by other CPUs)
Reading some of the Intel and AMD errata gives you a good picture.

Stephan