New TCP/IP checksum code and a HOWTO on how to modernize and fix FreeBSD's FP-unit use in the kernel (was Re: ether_crc32_[bl]e())

Fri Jun 4 12:56:38 PDT 2004

:Is someone interested in improving our IP checksum code too?
:
:On i386 it uses assembly language which "works ok" with gcc 3.x (so
:far), but it isn't guaranteed it will work with future versions of gcc.
:Intels C compiler already has problems with it (and it's verified it's
:because of bugs in the asm code), so in case of the use of icc a C
:version is used.
:
:All other architectures use a C version ("MD" code, even if it could be
:made MI... at least it could be shared between the big and little endian
:architectures).
:
:Matthew Dillon rewrote the IP checksum code in dragonfly:
:---snip---
:  Modified files:
:    sys/conf             files.i386 
:    sys/i386/i386        in_cksum.c 
:    sys/i386/include     in_cksum.h 
:    sys/netinet          igmp.c in.h ip_icmp.c 
:  Added files:
:    sys/i386/i386        in_cksum2.s 
:  Log:
:  Rewrite the IP checksum code.  Get rid of all the inline assembly garbage,
:  get rid of old APIs that are no longer used, and build a new 'core' checksum
:...

    I should add that I finally got tired of the original checksum code
    failing at high -On optimization levels with GCC2 and GCC3, not to
    mention the ridiculously unreadable source that tried to optimize down
    to the byte, which is just dumb.  That is why I rewrote it.

    At this point the code has been in our tree for quite a while, with no
    complaints, so I would regard it as 'extremely well tested'.

    I strongly recommend that FreeBSD adopt either this code or the core of
    this code and scrap that aweful C-hybrid-inline-assembly junk.

    Also, final note: remember that IP/TCP checksums are 1's complement
    checksums, which means that they are byte-order-agnostic (the byte order
    of the result will be the same as the byte order of the data, and the
    result will be correct, so if the original data is in network byte order,
    the resulting 1's complement checksum using normal (non translating)
    instructions will also be in network byte order).  This is why one can
    simply use adcl instructions on an Intel/Amd cpu.  In fact, it might be
    possible to use 128-bit media instructions but that's probably overkill.

    --

    You might also want to look at our new MMX/XMM optimized
    bcopy/copyin/copyout.  That was a lot harder to get right (and, most
    especially, it was a lot harder to make the FP state in kernel mode
    be properly saved and restored).  I lost a few filesystems on my
    test box getting the code right :-).  You would not be able to copy it
    directly since our FP state handling is very different from FreeBSD's
    now (which I will describe below), but you ought to be able to use the
    core MMX code.

    Right now FreeBSD is using old FP-stack instructions.  This runs about
    as fast as MMX on an Athlon but, generally speaking, it is a decrepid,
    obsolete use of the FP unit.

    On DFly I made the following changes:

    * I implemented the comment in the old FBSD code (I think DG or Dyson
      made the comment) about having a separate FP save area pointer.  This
      allows the kernel to use the FP unit trivially rather then having
      to copy/restore the user process's FP save area.  This saves an ungodly
      number of cycles in the copy path and greatly simplifies the ability
      of the kernel to use the FP unit.

      Our FP copy code's overhead is now such that we can use the FP unit
      for copies half the size as on FreeBSD and it will still be more optimal 
      then an integer copy, and XMM copies are much, much faster (esp on
      Athlon64's and Opterons) verses the old FBSD code.

    * DFly guarentees that if the FP unit is marked unused, the FP state is
      such that no fninit is required prior to using FP instructions.  This
      saves ~50+ cycles in the best-case copy path.

    * FP-in-use-by-kernel is a per-cpu bit.

    * DFly does not try to optimize copies on pre-fxsave machines.  The 
      minimum required support is FXSAVE + MMX now.  If XMM is available
      (SSE2), then 128 bit media instructions will be used.  I saw no point
      in retaining code that was only just a bit faster then the integer
      code on old machines.

    * I scrapped the old integer copy code as being too complex and
      rewrote it using a middle-of-the-road integer copy (rather then
      having umpteen versions of integer copy).

    * If you attempt to use more of our code, remember that DFly does not
      preemptively migrate threads across cpus so our code doesn't have
      to worry about that.

    * I scrapped the separately-optimized copyin/copyout code and wrote a
      more generic pcb_oncall capability that allows the copy routine itself
      to push the restoration function on its stack, so the same optimized
      copy code is now used for ALL copies (memcpy, bcopy, copyin, copyout).

    Cavets on doing any of this for FreeBSD:  The FP code in FreeBSD is 
    extremely fragile, as I found to my horror when I first tried
    modifying it.  In fact, I think there may still be interrupt and/or
    cpu migration races in the current FreeBSD FP borrowing.  If anyone
    in FBSDland intends to make these changes, I recommend doing it one
    piece at a time, one commit per week:

    - week1: Redo the onfault API to allow the individual copy routines
	     to push their own restore function in a stackable manner.
	     That way copyin/copyout can push its restore function, then
	     call a general optimized bcopy routine which pushes ITS 
	     restore function (to clean up the FP unit).  i.e. onfault
	     failure handling can now be stacked in DFly which allows us
	     to use thet FP optimized bcopy code for copyin/copyout.
	     (refer to the DFly codebase for how to do this).

    - week2: make the save area a pointer instead of fixed in pcb.  Just
	     point it at the PCB for this commit.  I recommend putting
	     the save pointer in the machine dependant thread (per-thread)
	     structure and not embedding it in the PCB.  Make sure fork()
	     does the right thing.

    - week3: change the existing FP optimized code to use the new pointer
	     method instead of the exchange-save-area method (create a 
	     fixed save area in the per-cpu data structure, do not allocate
	     the 512 bytes required for fxsave on the stack).  Keep the
	     global kernel-is-using-fp bit (make it per-cpu), and pin the cpu
	     for the duration of the FP copy.

    - week4: (rest)
	     give the last set of changes 2 weeks to settle and do intensive
	     testing to make sure there aren't any leaks, because a mistake
	     here can cause filesystem corruption.

    - week5: change the FP copy requirements to require FXSAVE/FXRSTR and
	     adjust the existing FP copy code to use FXSAVE/FXRSTR instead
	     of fnsave/frstr.

    - week6: rip out the old FP copy code and replace it with the new MMX/XMM
	     code.  Rip out the old integer copy code and just use a good
	     solid integer copy algorithm as we have that works well with
	     586 and later cpus.  (import the DFLY FP copy core here.  It
	     would be the absolute last step).

    p.s. and if you need a kick in the pants, our PIPE code and any
    medium-sized block copies from the filesystem cache (e.g. using dd),
    which basically just tests copyin/copyout/bcopy performance, beats
    the crap out of FreeBSD-5 now on P4's and Athlon64/Opterons.  Both
    are able to take advantage of the MMX/XMM optimized copies, especially
    due to the far lower FP setup overhead our kernel has now due to the
    pointer save area change and other things.

	(after a few runs to pre-cache)

dhcp62# dd if=test.dat bs=32k | cat > /dev/null
335544320 bytes transferred in 0.570208 secs (588459681 bytes/sec) (DRAGONFLY)
dhcp61# dd if=test.dat bs=32k | cat > /dev/null
335544320 bytes transferred in 0.901231 secs (372317753 bytes/sec) (FREEBSD-5)

dhcp62# dd if=test.dat of=/dev/null bs=32k
335544320 bytes transferred in 0.283803 secs (1182313275 bytes/sec) (DRAGONFLY)
dhcp61# dd if=test.dat of=/dev/null bs=32k
335544320 bytes transferred in 0.378349 secs  (886864966 bytes/sec) (FREEBSD-5)

    (with witness turned off in FreeBSD-5, so you don't get that cop-out. 
    with witness turned on the results are so horrible I won't even bother
    pasting them in, to save you guys the embarassment).

    That is what being able to use an XMM based copy for copyin/copyout gives
    you.  I think it is well worth the effort, but a *lot* of effort is
    required if FreeBSD wants to do it right.  It took me three weeks to get
    it right in DragonFly working nearly full time, but you would have the
    advantage of learning from all my mistakes :-).

						-Matt