[PATCH] Mantaining turnstile aligned to 128 bytes in i386 CPUs

Wed Jan 17 20:47:42 UTC 2007

    The cost of using the FPU can simply be thought of in terms of how
    many bytes you have to have to copy for it to become worth using
    the FPU over a far less complex integer copy loop.

    This is really easy to find out, and it is also fairly easy to instrument
    a sysctl to set the value used in the comparison and run benchmarks to
    determine at what point using the FP unit becomes the better choice.

    * Saving the FP state.  The kernel doesn't have to save or restore
      anything if userland was not using the floating point unit.  In
      fact, the kernel doesn't even need to FNINIT!  All the kernel needs
      to do is CLTS and FNCLEX to make the FP unit usable for media copy
      instructions, then set CR0_TS when it is finished.

      Gee, that's nice!  But if on the otherhand userland is using the
      floating point unit inbetween every system call then having the
      kernel try to use it does require calling fxsave and clearing
      npxthread == serious inefficiencies if userland is using the FP unit
      heavily.  Or, alternatively, it can fxsave AND restore the state
      when it is done at a total cost of around 70ns plus write bandwidth
      cruft.

      In fact, I would say that if userland is not using the FP unit,
      that is npxthread == NULL or npxthread != curthread, you should
      *DEFINITELY* use the FP unit.  Hands down, no question about it.

    * First, raw memory bandwidth is governed by RAS cycles.  The fewer RAS
      cycles you have, the higher the bandwidth.

      This means that the more data you can load into the cpu on the 'read'
      side of the copy before transitioning to the 'write' side, the better.

      With XMM you can load 128 *BYTES* a shot (8 128 bit registers).  For
      large copies, nothing beats it.

    * Modern cpu hardware uses a 128 bit data path for 128 bit media
      instructions and can optimize the 128 bit operation all the way through
      to a cache line or to main memory.  It can't be beat.

      Alignment is critical.  If the data is not aligned, don't bother.  128
      bits means 16 byte alignment.

    * No extranious memory writes, no uncached extranious memory reads.
      If you do any writes to memory other then to the copy destination
      in your copy loop you screw up the cpu's write fifo and destroy
      performance.

      Systems are so sensitive to this that it is even better to spend the
      time linearly mapping large copy spaces into KVM and do a single
      block copy then to have an inner per-PAGE loop.

    * Use of prefetch or use of movntdq instead of movdqa is highly 
      problematic.  It is possible to use these to optimize very particular
      cases but the problem is they tend to nerf all OTHER cases. 

      I've given up trying to use either mechanism.  Instead, I prefer 
      copying as large a block as possible to remove these variables from
      the cpu pipeline entirely.  The cpu has a write fifo anyway, you
      don't need prefetch instructions if you can use instructions to write
      to memory faster then available L2 cache bandwidth.  On some cpus
      this mandates the use of 64 or 128 bit media instructions or the
      cpu can't keep the write FIFO full and starts interleaving reads
      and writes on the wrong boundaries (creating more RAS cycles, which
      is what kills copy bandwidth).

    * RAS transitions also have to be aligned or you get boundary cases
      when the memory address transitions a RAS line.  This again mandates
      maximal alignment (even more then 16 bytes, frankly, which is why
      being able to do 128 byte blocks with XMM registers is so nice).
      Even though reads and writes are reblocked to the cache line size
      by the cpu, your inner loop can still transition a RAS boundary in
      the middle of a large block read if it isn't aligned.

      But at this point the alignment requirements start to get kinda
      silly.  128 byte alignment requirement?  I don't think so.  I 
      do a 16-byte alignment check in DragonFly as a pre-req for using
      XMM and that's it.

    But, as I said in the beginning... all you need is just one variable.
    Copying data below that threshold is faster without the FP unit, copying
    data above that threshold is faster with the FP unit.  Implement it,
    test it, and see how you fare.  If you are paranoid about having to 
    save the FP state, then only use the FP unit when npxthread == NULL
    (no save required) or npxthread != curthread (save on behalf of a
    different thread required, which is ok)... It's that simple.

    Pinning is an issue with FreeBSD, one whos effect I cannot comment on.

    I don't know about AMD64.  You only have 64 bit general registers in 64
    bit mode so you may not be able to keep the write pipeline full.  But
    you do have 8 of them so you are roughly equivalent to MMX (but not
    XMM's 8 128 bit registers).

							-Matt