svn commit: r243594 - head/sys/netinet

Fri Dec 7 22:21:45 UTC 2012

Andre,

I wrote a test program that hoists the hash logic outside of the kernel.

There may be bugs in it, it was written quickly this AM before coffee.

What the program does is allocate a number "fake sockets" set them all 
up to run against port 80, then inlines the netinet hash function.

It then iterates looking up all the connections.

It outputs stats about the amount of memory used/wasted.

Please do have a look at both this output AND my code.  I'm more worried 
about my code because of lack of coffee (you should be too :))

My program is here:
http://mu.org/~bright/tcphash.tgz

Just run "make" then you can run it, usage is:
   arg1 maxsockets
   arg2 number of sockets per hashhead
   arg3 number of times to iterate the lookup stress test.

Here's a sample output, what it's doing is telling you memory usage for 
the various sizes, check out the output for timing.

The sweet spot really appears to be 4, this gives an overhead of 2 bytes 
per socket which is below %0.00 percent overall wastage, it is only 
twice as slow as 1:1 mapping   34.483 seconds versus 17.682 seconds:
+-zsh:87> i=4
+-zsh:87> ./subr_hash 10000000 4 5
sizeof(freebsd_socket) = 680, sizeof(hashhead) = 8
allocated 2097152 hashheads for 10000000 sockets.
memory wastage: %0.00 (144008 bytes out of 6816777216)
hash wastage: %0.86 (144008 bytes) (18001 wasted buckets out of 2097152 
entries)
....
./subr_hash 10000000 $i 5  30.85s user 0.51s system 90% cpu 34.483 total

The rest of the output is here:

~/tcphash % set -x ; for i in 1 2 4 8 ; do time ./subr_hash 10000000 $i 
5 ; done
+-zsh:87> i=1
+-zsh:87> ./subr_hash 10000000 1 5
sizeof(freebsd_socket) = 680, sizeof(hashhead) = 8
allocated 8388608 hashheads for 10000000 sockets.
memory wastage: %0.30 (20372816 bytes out of 6867108864)
hash wastage: %30.36 (20372816 bytes) (2546602 wasted buckets out of 
8388608 entries)
0
1
2
3
4
./subr_hash 10000000 $i 5  15.80s user 0.49s system 92% cpu 17.682 total
+-zsh:87> i=2
+-zsh:87> ./subr_hash 10000000 2 5
sizeof(freebsd_socket) = 680, sizeof(hashhead) = 8
allocated 4194304 hashheads for 10000000 sockets.
memory wastage: %0.05 (3090992 bytes out of 6833554432)
hash wastage: %9.21 (3090992 bytes) (386374 wasted buckets out of 
4194304 entries)
0
1
2
3
4
./subr_hash 10000000 $i 5  20.69s user 0.44s system 94% cpu 22.443 total
+-zsh:87> i=4
+-zsh:87> ./subr_hash 10000000 4 5
sizeof(freebsd_socket) = 680, sizeof(hashhead) = 8
allocated 2097152 hashheads for 10000000 sockets.
memory wastage: %0.00 (144008 bytes out of 6816777216)
hash wastage: %0.86 (144008 bytes) (18001 wasted buckets out of 2097152 
entries)
0
1
2
3
4
./subr_hash 10000000 $i 5  30.85s user 0.51s system 90% cpu 34.483 total
+-zsh:87> i=8
+-zsh:87> ./subr_hash 10000000 8 5
sizeof(freebsd_socket) = 680, sizeof(hashhead) = 8
allocated 1048576 hashheads for 10000000 sockets.
memory wastage: %0.00 (704 bytes out of 6808388608)
hash wastage: %0.01 (704 bytes) (88 wasted buckets out of 1048576 entries)
0
1
2
3
4
./subr_hash 10000000 $i 5  52.16s user 0.74s system 86% cpu 1:01.51 total

~~~~~~~~~~~~~~~~~~~~~

About the "it's good enough" mentality.

Imagine yourself as a CTO/founder of a company.  You decide to check out 
FreeBSD because you heard it's so great, so you download it, install it 
on your 256GB server and blast it with connections.  You then perform 
the same test on Linux.

Linux out of the box does better.

Why would you take the time to tune FreeBSD?  It makes no sense.

It's the same thing as going to two car dealerships, trying two cars, 
with the parameters that you want the fastest car because you have to 
race TOMORROW.  You don't get the car that "might be faster, if you just 
took some time to tune it" you just pick the faster car.

I am in no way trying to stomp on embedded, but if people in embedded 
are EXPERTS for the most part anyhow.  (or maybe we need an option 
EMBEDDED which sacrifices performance for space from the get-go to make 
it easier for those people, but still please consider how FreeBSD got to 
be at Yahoo.

David Filo downloaded Linux and FreeBSD.  FreeBSD worked better out of 
the box.  He went with FreeBSD.

~~~~~~~~~~~~~~~~~~~~~

Last thing I will say about the comments in the code, for someone 
familiar with TCP such as yourself, this all makes sense at a glance, 
but for someone like me who has to context switch between many areas, 
it's a challenge to keep enough context in my head with the terseness of 
our comments to be effective all around.  If you feel the need to 
correct technical issues in the comments, then by all means, but 
honestly I find it dis-inviting to us networking/tcp/netdev specialists 
to trim what essentially is documentation.  Names like powerof2rdfloor() 
make sense to you because you wrote it, not necessarily to the rest of 
us that have to come in later and figure out what is going on.

Feel free to remove the comments, but it will really just hamper my (and 
other generalists) from being able to pitch in this area as effectively.

~~~~~~~~~~~~~~~~~~~~~

At this point it's your call, it's your code.  Do what you want to, but 
please take into careful consideration the "out of the box CTO test" and 
the minor overhead of 4 vs 8 in terms of memory but how big it is on 
performance.

-Alfred

On 12/7/12 10:54 AM, Andre Oppermann wrote:
> On 07.12.2012 18:31, Alfred Perlstein wrote:
>> It's good to put the power or two thing in its own function.
>>
>> That said, about the comment changes,  hash size changes and auto 
>> tuning changes respectively....
>>
>> Comments: Dude... It's not like removing the helpful comments is 
>> going to speed up either compiling this code or how fast it can 
>> process packets.  Please leave those in.
>
> The comment sentence starting with "previously" is not useful anymore.
> Nobody cares what is was before, it's no longer relevant.  That's why
> such comments belong only into commit messages.
>
>> The hash: Your comment about it not being perfect is incorrect. By 
>> making it 1/8 you guarantee that it can never be perfect when fully 
>> loaded. If it is so important for you to guarantee that the minimum 
>> hash traversal on this hash is at LEAST FOUR when fully loaded then 
>> by all means make the change.  I think it's unbelievably wrong, penny 
>> wise/dollar foolish or maybe bit wise/CPU foolish but for some reason 
>> others agree with you for reasons that still do not make sense to me. 
>> So again by all means pessimize the hash table to save a few bytes.
>
> It's a bit more than a few bytes.  On my modest dev box the TCBHASH at 
> 1/8 would
> be 256kB times two for TCBHASH and PORTHASH.  The hash will never be 
> perfect
> as there isn't a perfect hash function for this because the input is 
> unbounded.
>
> If you have more than 32k concurrent connections on a modest server 
> then there
> may be a few hash collisions.  That's fine.  So far we've managed to 
> survive
> with a hash table of only 512 slots.  Having a small chain on a hash slot
> isn't bad.  Those with > 32k concurrent connections tend to have a lot 
> more
> RAM.  At 64GB the hash table will be 128k entries (2MB times two).  If 
> you
> have more than 128k concurrent connections you either specifically 
> tune your
> kernel anyway, or it is slightly less efficient than it possibly could 
> be.
>
> The hash table size doesn't limit the number of concurrent 
> connections.  Only
> maxsockets does that.  With my other changes the maxsockets limit is 
> massively
> increased (1/8 of physpages) compared to when it was dependent on 
> maxusers.
>
> My point is that for the vast majority a smaller hash table 
> (maxsockets / 8)
> is entirely sufficient and certainly better than the previous default 
> of 512.
> Anything more than that is IMHO excessive for the vast majority of users.
> While a megabyte isn't that much we don't want to waste it too easily 
> either.
> There's no need to swing from one extreme to another.
>
>> The tuning: Additionally you've removed the informational output that 
>> shows what the input was and what it was changed to. It's as if you 
>> want to make it harder for my techs to figure out what had gone wrong.
>
> It still says "WARNING: TCB hash size not a power of 2, rounded down".
> If the admin can't figure out why his value isn't a power of 2 then all
> bets are off.  You wouldn't tell him either with your message.  He gets
> a warning message and told what happened.  From there he has to look it
> up in /boot/loader.conf anyway.
>
>> Tuning 2: Additionally it appears you've removed the safety net in 
>> this code for clipping it down to min 512.
>
> Nope, it's there: powerof2rdfloor(tcp_tcbhashsize, TCBMINHASHSIZE)
>
>> I don't get why this is such an issue, you do realize this code is 
>> run only once and is not in the critical path?  We should be 
>> optimizing those code for utility, user-friendliness and general 
>> readability not as if it was part of TCP's fast path.
>
> User friendliness would be to either not have to worry about this at
> all or to update the related and relevant man pages like tuning(7).
> Users are unlikely to even find the right place to look for in the
> source code.
>