New TCP/IP checksum code and a HOWTO on how to modernize and fix
FreeBSD's FP-unit use in the kernel (was Re: ether_crc32_[bl]e())
Matthew Dillon
dillon at apollo.backplane.com
Fri Jun 4 12:56:38 PDT 2004
:Is someone interested in improving our IP checksum code too?
:
:On i386 it uses assembly language which "works ok" with gcc 3.x (so
:far), but it isn't guaranteed it will work with future versions of gcc.
:Intels C compiler already has problems with it (and it's verified it's
:because of bugs in the asm code), so in case of the use of icc a C
:version is used.
:
:All other architectures use a C version ("MD" code, even if it could be
:made MI... at least it could be shared between the big and little endian
:architectures).
:
:Matthew Dillon rewrote the IP checksum code in dragonfly:
:---snip---
: Modified files:
: sys/conf files.i386
: sys/i386/i386 in_cksum.c
: sys/i386/include in_cksum.h
: sys/netinet igmp.c in.h ip_icmp.c
: Added files:
: sys/i386/i386 in_cksum2.s
: Log:
: Rewrite the IP checksum code. Get rid of all the inline assembly garbage,
: get rid of old APIs that are no longer used, and build a new 'core' checksum
:...
I should add that I finally got tired of the original checksum code
failing at high -On optimization levels with GCC2 and GCC3, not to
mention the ridiculously unreadable source that tried to optimize down
to the byte, which is just dumb. That is why I rewrote it.
At this point the code has been in our tree for quite a while, with no
complaints, so I would regard it as 'extremely well tested'.
I strongly recommend that FreeBSD adopt either this code or the core of
this code and scrap that aweful C-hybrid-inline-assembly junk.
Also, final note: remember that IP/TCP checksums are 1's complement
checksums, which means that they are byte-order-agnostic (the byte order
of the result will be the same as the byte order of the data, and the
result will be correct, so if the original data is in network byte order,
the resulting 1's complement checksum using normal (non translating)
instructions will also be in network byte order). This is why one can
simply use adcl instructions on an Intel/Amd cpu. In fact, it might be
possible to use 128-bit media instructions but that's probably overkill.
--
You might also want to look at our new MMX/XMM optimized
bcopy/copyin/copyout. That was a lot harder to get right (and, most
especially, it was a lot harder to make the FP state in kernel mode
be properly saved and restored). I lost a few filesystems on my
test box getting the code right :-). You would not be able to copy it
directly since our FP state handling is very different from FreeBSD's
now (which I will describe below), but you ought to be able to use the
core MMX code.
Right now FreeBSD is using old FP-stack instructions. This runs about
as fast as MMX on an Athlon but, generally speaking, it is a decrepid,
obsolete use of the FP unit.
On DFly I made the following changes:
* I implemented the comment in the old FBSD code (I think DG or Dyson
made the comment) about having a separate FP save area pointer. This
allows the kernel to use the FP unit trivially rather then having
to copy/restore the user process's FP save area. This saves an ungodly
number of cycles in the copy path and greatly simplifies the ability
of the kernel to use the FP unit.
Our FP copy code's overhead is now such that we can use the FP unit
for copies half the size as on FreeBSD and it will still be more optimal
then an integer copy, and XMM copies are much, much faster (esp on
Athlon64's and Opterons) verses the old FBSD code.
* DFly guarentees that if the FP unit is marked unused, the FP state is
such that no fninit is required prior to using FP instructions. This
saves ~50+ cycles in the best-case copy path.
* FP-in-use-by-kernel is a per-cpu bit.
* DFly does not try to optimize copies on pre-fxsave machines. The
minimum required support is FXSAVE + MMX now. If XMM is available
(SSE2), then 128 bit media instructions will be used. I saw no point
in retaining code that was only just a bit faster then the integer
code on old machines.
* I scrapped the old integer copy code as being too complex and
rewrote it using a middle-of-the-road integer copy (rather then
having umpteen versions of integer copy).
* If you attempt to use more of our code, remember that DFly does not
preemptively migrate threads across cpus so our code doesn't have
to worry about that.
* I scrapped the separately-optimized copyin/copyout code and wrote a
more generic pcb_oncall capability that allows the copy routine itself
to push the restoration function on its stack, so the same optimized
copy code is now used for ALL copies (memcpy, bcopy, copyin, copyout).
Cavets on doing any of this for FreeBSD: The FP code in FreeBSD is
extremely fragile, as I found to my horror when I first tried
modifying it. In fact, I think there may still be interrupt and/or
cpu migration races in the current FreeBSD FP borrowing. If anyone
in FBSDland intends to make these changes, I recommend doing it one
piece at a time, one commit per week:
- week1: Redo the onfault API to allow the individual copy routines
to push their own restore function in a stackable manner.
That way copyin/copyout can push its restore function, then
call a general optimized bcopy routine which pushes ITS
restore function (to clean up the FP unit). i.e. onfault
failure handling can now be stacked in DFly which allows us
to use thet FP optimized bcopy code for copyin/copyout.
(refer to the DFly codebase for how to do this).
- week2: make the save area a pointer instead of fixed in pcb. Just
point it at the PCB for this commit. I recommend putting
the save pointer in the machine dependant thread (per-thread)
structure and not embedding it in the PCB. Make sure fork()
does the right thing.
- week3: change the existing FP optimized code to use the new pointer
method instead of the exchange-save-area method (create a
fixed save area in the per-cpu data structure, do not allocate
the 512 bytes required for fxsave on the stack). Keep the
global kernel-is-using-fp bit (make it per-cpu), and pin the cpu
for the duration of the FP copy.
- week4: (rest)
give the last set of changes 2 weeks to settle and do intensive
testing to make sure there aren't any leaks, because a mistake
here can cause filesystem corruption.
- week5: change the FP copy requirements to require FXSAVE/FXRSTR and
adjust the existing FP copy code to use FXSAVE/FXRSTR instead
of fnsave/frstr.
- week6: rip out the old FP copy code and replace it with the new MMX/XMM
code. Rip out the old integer copy code and just use a good
solid integer copy algorithm as we have that works well with
586 and later cpus. (import the DFLY FP copy core here. It
would be the absolute last step).
p.s. and if you need a kick in the pants, our PIPE code and any
medium-sized block copies from the filesystem cache (e.g. using dd),
which basically just tests copyin/copyout/bcopy performance, beats
the crap out of FreeBSD-5 now on P4's and Athlon64/Opterons. Both
are able to take advantage of the MMX/XMM optimized copies, especially
due to the far lower FP setup overhead our kernel has now due to the
pointer save area change and other things.
(after a few runs to pre-cache)
dhcp62# dd if=test.dat bs=32k | cat > /dev/null
335544320 bytes transferred in 0.570208 secs (588459681 bytes/sec) (DRAGONFLY)
dhcp61# dd if=test.dat bs=32k | cat > /dev/null
335544320 bytes transferred in 0.901231 secs (372317753 bytes/sec) (FREEBSD-5)
dhcp62# dd if=test.dat of=/dev/null bs=32k
335544320 bytes transferred in 0.283803 secs (1182313275 bytes/sec) (DRAGONFLY)
dhcp61# dd if=test.dat of=/dev/null bs=32k
335544320 bytes transferred in 0.378349 secs (886864966 bytes/sec) (FREEBSD-5)
(with witness turned off in FreeBSD-5, so you don't get that cop-out.
with witness turned on the results are so horrible I won't even bother
pasting them in, to save you guys the embarassment).
That is what being able to use an XMM based copy for copyin/copyout gives
you. I think it is well worth the effort, but a *lot* of effort is
required if FreeBSD wants to do it right. It took me three weeks to get
it right in DragonFly working nearly full time, but you would have the
advantage of learning from all my mistakes :-).
-Matt
More information about the freebsd-arch
mailing list