[PATCH] Mantaining turnstile aligned to 128 bytes in i386 CPUs
Matthew Dillon
dillon at apollo.backplane.com
Wed Jan 17 20:47:42 UTC 2007
The cost of using the FPU can simply be thought of in terms of how
many bytes you have to have to copy for it to become worth using
the FPU over a far less complex integer copy loop.
This is really easy to find out, and it is also fairly easy to instrument
a sysctl to set the value used in the comparison and run benchmarks to
determine at what point using the FP unit becomes the better choice.
* Saving the FP state. The kernel doesn't have to save or restore
anything if userland was not using the floating point unit. In
fact, the kernel doesn't even need to FNINIT! All the kernel needs
to do is CLTS and FNCLEX to make the FP unit usable for media copy
instructions, then set CR0_TS when it is finished.
Gee, that's nice! But if on the otherhand userland is using the
floating point unit inbetween every system call then having the
kernel try to use it does require calling fxsave and clearing
npxthread == serious inefficiencies if userland is using the FP unit
heavily. Or, alternatively, it can fxsave AND restore the state
when it is done at a total cost of around 70ns plus write bandwidth
cruft.
In fact, I would say that if userland is not using the FP unit,
that is npxthread == NULL or npxthread != curthread, you should
*DEFINITELY* use the FP unit. Hands down, no question about it.
* First, raw memory bandwidth is governed by RAS cycles. The fewer RAS
cycles you have, the higher the bandwidth.
This means that the more data you can load into the cpu on the 'read'
side of the copy before transitioning to the 'write' side, the better.
With XMM you can load 128 *BYTES* a shot (8 128 bit registers). For
large copies, nothing beats it.
* Modern cpu hardware uses a 128 bit data path for 128 bit media
instructions and can optimize the 128 bit operation all the way through
to a cache line or to main memory. It can't be beat.
Alignment is critical. If the data is not aligned, don't bother. 128
bits means 16 byte alignment.
* No extranious memory writes, no uncached extranious memory reads.
If you do any writes to memory other then to the copy destination
in your copy loop you screw up the cpu's write fifo and destroy
performance.
Systems are so sensitive to this that it is even better to spend the
time linearly mapping large copy spaces into KVM and do a single
block copy then to have an inner per-PAGE loop.
* Use of prefetch or use of movntdq instead of movdqa is highly
problematic. It is possible to use these to optimize very particular
cases but the problem is they tend to nerf all OTHER cases.
I've given up trying to use either mechanism. Instead, I prefer
copying as large a block as possible to remove these variables from
the cpu pipeline entirely. The cpu has a write fifo anyway, you
don't need prefetch instructions if you can use instructions to write
to memory faster then available L2 cache bandwidth. On some cpus
this mandates the use of 64 or 128 bit media instructions or the
cpu can't keep the write FIFO full and starts interleaving reads
and writes on the wrong boundaries (creating more RAS cycles, which
is what kills copy bandwidth).
* RAS transitions also have to be aligned or you get boundary cases
when the memory address transitions a RAS line. This again mandates
maximal alignment (even more then 16 bytes, frankly, which is why
being able to do 128 byte blocks with XMM registers is so nice).
Even though reads and writes are reblocked to the cache line size
by the cpu, your inner loop can still transition a RAS boundary in
the middle of a large block read if it isn't aligned.
But at this point the alignment requirements start to get kinda
silly. 128 byte alignment requirement? I don't think so. I
do a 16-byte alignment check in DragonFly as a pre-req for using
XMM and that's it.
But, as I said in the beginning... all you need is just one variable.
Copying data below that threshold is faster without the FP unit, copying
data above that threshold is faster with the FP unit. Implement it,
test it, and see how you fare. If you are paranoid about having to
save the FP state, then only use the FP unit when npxthread == NULL
(no save required) or npxthread != curthread (save on behalf of a
different thread required, which is ok)... It's that simple.
Pinning is an issue with FreeBSD, one whos effect I cannot comment on.
I don't know about AMD64. You only have 64 bit general registers in 64
bit mode so you may not be able to keep the write pipeline full. But
you do have 8 of them so you are roughly equivalent to MMX (but not
XMM's 8 128 bit registers).
-Matt
More information about the freebsd-arch
mailing list