Checksum/copy (was: Re: cvs commit: src/sys/netinet ip_output.c)
bde at zeta.org.au
Thu Mar 27 22:20:57 PST 2003
On Thu, 27 Mar 2003, Nate Lawson wrote:
> On Thu, 27 Mar 2003, Bruce Evans wrote:
> > On Wed, 26 Mar 2003, Mike Silbersack wrote:
> > > On Wed, 26 Mar 2003, Nate Lawson wrote:
> > > > I don't want to hijack the thread too much, but has thought gone into a
> > > > combined checksum and copy function? The first mention I can remember of
> > > > this is in RFC 817 p. 19-20.
> > Is this RFC old? Combined checksum and copy hasn't been a larger
> > optimization since L1 caches became large enough, since to a first
> > approximation, everything is dominated by memory bandwidth and another
> > pass to calculate the checksum is free because copying left all the
> > data in the L1 cache.
> Yes, the RFC is old. However, there still may be performance advantages
> in ILP because while the data is being fetched the first time (for the
> copy), idle slots can be filled with an incremental checksum update.
I'm sure there are some advantages on some CPUs but doubt that they are
significant. I'll some old code for filling pipelines in in_cksum() on
Pentium I's to a trimmed Cc list in separate mail. I never committed this
because the improvement was marginal on Pentium I's, and memory has become
slower relative to CPUs since Pentium I's were new.
> > > Heh, I don't think anyone has. What actually would make sense is for
> > > someone who feels like doing ASM timing to look at our bcopy routines /
> > > etc.
> > I spent a lot of time on this about 7 years ago. See ~bde/cache on
> > freefall for old versions of programs that try lots of different
> > copy/read/write checksum methods. Better hardware made the differences
> > between various methods relatively small. One can probably do better
> > (50%?) for largish (1K+ ?) buffers using SSE instructions on i386's
> > now.
> We definitely should have an SSE version for P3+. The 128 bit
> instructions make a big difference. And for checksumming, you can do 8
> packed adds at once.
Is it 8 * 128 bits at once? 8-way superscalar must be on the horizon if
not routine now. What is the state of the art for keeping 8 ALUs fed with
data (assuming that all the data is in the cache?
More information about the cvs-src