Fwd: 5-STABLE kernel build with icc broken

Thu Mar 31 19:15:21 PST 2005

    All I really did was implement a comment that DG had made many years
    ago in the PCB structure about making the FPU save area a pointer rather
    then hardwiring it into the PCB.

    This greatly reduces the complexity of work required to allow
    the kernel to 'borrow' the FPU.   It basically allows the kernel
    to 'stack' save contexts rather then swap-out save contexts.  The
    result is that the cross-over point for the copy size where the FPU
    becomes economical is a much lower value (~2K rather then ~4-8K).  The
    FPU overhead differences between DFly and FreeBSD for bcopy only matters
    for buffers between 2K and 16K in size.  After that the copy itself 
    overshadows the FPU setup overhead.

    In DFly the kernel must still check to see whether userland has used
    the FPU and save the state before it reuses the FPU in the kernel.
    We don't bother to restore the state, we simply allow userland to take
    another fault (the idea being that if userland is making several I/O
    calls into the kernel in a batch, the FPU state is only saved once).

    Once the kernel has done this and adjusted the FPU save area it can
    use the FPU at a whim, even though blocking conditions, and then just
    throw away the FPU context when it is done.  We could theoretically 
    stack multiple kernel FPU contexts through this mechanism but I don't
    see much advantage to it so I don't... I have a lockout bit so if the
    kernel is already using the FPU and takes e.g. a preemptive interrupt,
    it doesn't go and use the FPU within that preemption.

    The use of the XMM registers is a cpu optimization.  Modern CPUs,
    especially AMD Athlon and Opterons, are more efficient with 128 bit 
    moves then with 64 bit moves.   I experimented with all sorts of 
    configurations, including the use of special data caching instructions,
    but they had so many special cases and degenerate conditions that
    I found that simply using straight XMM instructions, reading as big
    a glob as possible, then writing the glob, was by far the best solution.

    The key for fast block copying is to not issue any memory writes other
    then those related directly to the data being copied.  This avoids
    unnecessary RAS cycles which would otherwise kill copying performance.
    In tests I found that copying multi-page blocks in a single loop was
    far more efficient then copying data page-by-page precisely because 
    page-by-page copying was too complex to be able to avoid extranious
    writes to memory unrelated to the target buffer inbetween each page copy.

						-Matt