Fwd: 5-STABLE kernel build with icc broken
Matthew Dillon
dillon at apollo.backplane.com
Thu Mar 31 19:15:21 PST 2005
All I really did was implement a comment that DG had made many years
ago in the PCB structure about making the FPU save area a pointer rather
then hardwiring it into the PCB.
This greatly reduces the complexity of work required to allow
the kernel to 'borrow' the FPU. It basically allows the kernel
to 'stack' save contexts rather then swap-out save contexts. The
result is that the cross-over point for the copy size where the FPU
becomes economical is a much lower value (~2K rather then ~4-8K). The
FPU overhead differences between DFly and FreeBSD for bcopy only matters
for buffers between 2K and 16K in size. After that the copy itself
overshadows the FPU setup overhead.
In DFly the kernel must still check to see whether userland has used
the FPU and save the state before it reuses the FPU in the kernel.
We don't bother to restore the state, we simply allow userland to take
another fault (the idea being that if userland is making several I/O
calls into the kernel in a batch, the FPU state is only saved once).
Once the kernel has done this and adjusted the FPU save area it can
use the FPU at a whim, even though blocking conditions, and then just
throw away the FPU context when it is done. We could theoretically
stack multiple kernel FPU contexts through this mechanism but I don't
see much advantage to it so I don't... I have a lockout bit so if the
kernel is already using the FPU and takes e.g. a preemptive interrupt,
it doesn't go and use the FPU within that preemption.
The use of the XMM registers is a cpu optimization. Modern CPUs,
especially AMD Athlon and Opterons, are more efficient with 128 bit
moves then with 64 bit moves. I experimented with all sorts of
configurations, including the use of special data caching instructions,
but they had so many special cases and degenerate conditions that
I found that simply using straight XMM instructions, reading as big
a glob as possible, then writing the glob, was by far the best solution.
The key for fast block copying is to not issue any memory writes other
then those related directly to the data being copied. This avoids
unnecessary RAS cycles which would otherwise kill copying performance.
In tests I found that copying multi-page blocks in a single loop was
far more efficient then copying data page-by-page precisely because
page-by-page copying was too complex to be able to avoid extranious
writes to memory unrelated to the target buffer inbetween each page copy.
-Matt
More information about the freebsd-hackers
mailing list