A slight change to tcpip_fillheaders...

Thu Jun 3 08:45:27 UTC 2010

On Wed, 2 Jun 2010, George Neville-Neil wrote:

> A while back another src developer mentioned that he had gotten better performance by changing
> tcpip_fillheaders() in the following way:

It's unlikely to make any significant or even measurable difference,
but...  I got apparent large (up to 20%) timing improvments in networking
by similar changes (mostly going the other way and avoiding and/or
bzero()), but further investigation showed that the improvements were
very mysterious and mostly unrelated to my code changes -- equal
improvements could be obtained by adding padding or reducing code size
in unrelated code.  This might be explained by thrashing or avoiding
thrashing of cache(s), but the effect was too large to easily explain,
and too mysterious to easily obtain or avoid.

> Index: tcp_subr.c
> ===================================================================
> --- tcp_subr.c	(revision 209083)
> +++ tcp_subr.c	(working copy)
> @@ -392,28 +392,19 @@
> 		struct ip *ip;
>
> 		ip = (struct ip *)ip_ptr;
> +		bzero(ip, sizeof(*ip));
> 		ip->ip_v = IPVERSION;
> 		ip->ip_hl = 5;
> 		ip->ip_tos = inp->inp_ip_tos;
> -		ip->ip_len = 0;
> -		ip->ip_id = 0;
> -		ip->ip_off = 0;
> 		ip->ip_ttl = inp->inp_ip_ttl;
> -		ip->ip_sum = 0;
> 		ip->ip_p = IPPROTO_TCP;
> 		ip->ip_src = inp->inp_laddr;
> 		ip->ip_dst = inp->inp_faddr;
> 	}
> +	bzero(th, sizeof(*th));
> 	th->th_sport = inp->inp_lport;
> 	th->th_dport = inp->inp_fport;
> -	th->th_seq = 0;
> -	th->th_ack = 0;
> -	th->th_x2 = 0;
> 	th->th_off = 5;
> -	th->th_flags = 0;
> -	th->th_win = 0;
> -	th->th_urp = 0;
> -	th->th_sum = 0;		/* in_pseudo() is called later for ipv4 */
> }
>
> /*

The version with bzero() is a small or null pessimization assuming
that the compiler is infinitiely smart.  If all the fields are initialized
by both versions, then the result is the same, so the compiler may
produce the same code for each.  The version with bzero() obvious
initializes all the fields, but might not.  Howver, bzero() is extern
in FreeBSD, and gcc doesn't attempt optimizations like inlining it
despite it being extern, so the version with bzero() is fundamentally
slower in theory.  My changes initially involved implementing  all
bzero() with small fixed size using __builtin_memset().  gcc will
inline these, hopefully using writes of a register or immediate constant
value of 0.  Many i386 versions of gcc are stupid about this and use
the slow i386 string instructions instead, but most use separate writes
of 0 if the length is small, and I try to choose the small size so
small that the separate writes are used.  For the version with the
explicit separate writes of 0, many of the fields are smaller than
the register size, but none are volatile so the compiler is free to
combine the writes.  gcc does optimizations like this.  I don't know
if it actually does them all or if it is inhibited by the writes of
nonzero for a few fields (C requires write ordering to appear to be
strict, but hopefully there aren't any aliasing problems in the above
which require it to actually be strict).

> I have tried this change with NetPIPE (NPtcp -b 100000) on a pair of machines using Intel igb devices and found
> that it provides no improvement, but I am wondering if other people want to try this and
> see if it improves throughput at all.  I was testing this on a Nehalem class machine, not sure if it
> might help on other architectures.

And don't forget, for such a change to be good, it needs to be good on most
supported machines and not too bad on other supported machines.  This is
especially hard to even test for changes related to memory and caches.

Bruce