A slight change to tcpip_fillheaders...

George Neville-Neil gnn at freebsd.org
Thu Jun 3 14:31:57 UTC 2010


On Jun 3, 2010, at 04:45 , Bruce Evans wrote:
> 
> On Wed, 2 Jun 2010, George Neville-Neil wrote:
> 
>> A while back another src developer mentioned that he had gotten better performance by changing
>> tcpip_fillheaders() in the following way:
> 
> It's unlikely to make any significant or even measurable difference,
> but...  I got apparent large (up to 20%) timing improvments in networking
> by similar changes (mostly going the other way and avoiding and/or
> bzero()), but further investigation showed that the improvements were
> very mysterious and mostly unrelated to my code changes -- equal
> improvements could be obtained by adding padding or reducing code size
> in unrelated code.  This might be explained by thrashing or avoiding
> thrashing of cache(s), but the effect was too large to easily explain,
> and too mysterious to easily obtain or avoid.
> 
>> Index: tcp_subr.c
>> ===================================================================
>> --- tcp_subr.c	(revision 209083)
>> +++ tcp_subr.c	(working copy)
>> @@ -392,28 +392,19 @@
>> 		struct ip *ip;
>> 
>> 		ip = (struct ip *)ip_ptr;
>> +		bzero(ip, sizeof(*ip));
>> 		ip->ip_v = IPVERSION;
>> 		ip->ip_hl = 5;
>> 		ip->ip_tos = inp->inp_ip_tos;
>> -		ip->ip_len = 0;
>> -		ip->ip_id = 0;
>> -		ip->ip_off = 0;
>> 		ip->ip_ttl = inp->inp_ip_ttl;
>> -		ip->ip_sum = 0;
>> 		ip->ip_p = IPPROTO_TCP;
>> 		ip->ip_src = inp->inp_laddr;
>> 		ip->ip_dst = inp->inp_faddr;
>> 	}
>> +	bzero(th, sizeof(*th));
>> 	th->th_sport = inp->inp_lport;
>> 	th->th_dport = inp->inp_fport;
>> -	th->th_seq = 0;
>> -	th->th_ack = 0;
>> -	th->th_x2 = 0;
>> 	th->th_off = 5;
>> -	th->th_flags = 0;
>> -	th->th_win = 0;
>> -	th->th_urp = 0;
>> -	th->th_sum = 0;		/* in_pseudo() is called later for ipv4 */
>> }
>> 
>> /*
> 
> The version with bzero() is a small or null pessimization assuming
> that the compiler is infinitiely smart.  If all the fields are initialized
> by both versions, then the result is the same, so the compiler may
> produce the same code for each.  The version with bzero() obvious
> initializes all the fields, but might not.  Howver, bzero() is extern
> in FreeBSD, and gcc doesn't attempt optimizations like inlining it
> despite it being extern, so the version with bzero() is fundamentally
> slower in theory.  My changes initially involved implementing  all
> bzero() with small fixed size using __builtin_memset().  gcc will
> inline these, hopefully using writes of a register or immediate constant
> value of 0.  Many i386 versions of gcc are stupid about this and use
> the slow i386 string instructions instead, but most use separate writes
> of 0 if the length is small, and I try to choose the small size so
> small that the separate writes are used.  For the version with the
> explicit separate writes of 0, many of the fields are smaller than
> the register size, but none are volatile so the compiler is free to
> combine the writes.  gcc does optimizations like this.  I don't know
> if it actually does them all or if it is inhibited by the writes of
> nonzero for a few fields (C requires write ordering to appear to be
> strict, but hopefully there aren't any aliasing problems in the above
> which require it to actually be strict).
> 
>> I have tried this change with NetPIPE (NPtcp -b 100000) on a pair of machines using Intel igb devices and found
>> that it provides no improvement, but I am wondering if other people want to try this and
>> see if it improves throughput at all.  I was testing this on a Nehalem class machine, not sure if it
>> might help on other architectures.
> 
> And don't forget, for such a change to be good, it needs to be good on most
> supported machines and not too bad on other supported machines.  This is
> especially hard to even test for changes related to memory and caches.

For what it's worth I checked the assembly for both versions as well.  The bzero
version does not inline, as you said, and the original does do a move of
0 for each and every field, again on Nehalem with our default version of
gcc.

I think that for now I will leave this alone, the code is clear either way,
and what I cared about was finding out if the code could be sped up.

Best,
George



More information about the freebsd-net mailing list