Followup from Verisign after last week's developer summit

Thu May 23 16:45:51 UTC 2013

I am adding freebsd-net to this and will re-summarize to get additional input. Thanks for all of the initial suggestions.

For benefit of those on freebsd-net@, we are noticing significant locking contention on the V_tcpinfo lock under moderately high connection establishment and teardown rates (around 45-50k connections per second). Our profiling suggests the lock contention on V_tcpinfo effectively single-threads all TCP connections. Similar testing on Linux with equivalent hardware does not show this contention and can get a much higher connection establishment rate. We can attach profiling and test details if anyone would like.

JHB recommends:
- He has seen similar results in other kinds of testing. 
- Linux uses RCU for the locking on the equivalent table (we've confirmed this to be the case).
- Looking into a lock per bucket on the PCB lookup.

Jeff recommends:
- Changing the lock strategy so the hash lookup can be effectively pushed further down into the stack.
- Making the [list] iterators more complex like those in use in the hash lookup now.

We are starting down these paths to try to break the locking down. We'll post some initial patch ideas soon. Meanwhile, any additional suggestions are certainly welcome.

Finally, I will mention that we have enabled PCBGROUPS in some of our testing with 9.1 and found no change for our particular workload with high connection establishment rates.

Thanks,
Mike

-----Original Message-----
From: Jeff Roberson [mailto:jroberson at jroberson.net] 
Sent: Wednesday, May 22, 2013 12:50 AM
To: John Baldwin
Cc: Bentkofsky, Michael; rwatson at freebsd.org; jeff at freebsd.org; Charbon, Julien
Subject: Re: Followup from Verisign after last week's developer summit

On Tue, 21 May 2013, Jeff Roberson wrote:

> On Tue, 21 May 2013, John Baldwin wrote:
>
>> On Monday, May 20, 2013 9:48:02 am Bentkofsky, Michael wrote:
>>> Greetings gentlemen,
>>> 
>>> It was a pleasure to meet you all last week at the FreeBSD developer 
>>> summit.
>> I would like to thank you for spending the time to discuss all the 
>> wonderful internals of the network stack. We also thoroughly enjoyed 
>> the discussion on receive side scaling.
>>> 
>>> I'm sure you will remember both Julien Charbon and me asking 
>>> questions
>> regarding the TCP stack implementation, specifically around the 
>> locking internals. I am hoping to follow-up with a path forward so we 
>> might be able to enhance the connection rate performance. Our 
>> internal testing has found that the V_tcpinfo lock prevents TCP 
>> scaling under high connection setup and teardown rates. In fact, we 
>> surmise that a new "FIN flood" attack may be possible to degrade 
>> server connections significantly.
>>> 
>>> In short, we are interested in changing this locking strategy and 
>>> hope to
>> get input from someone with more familiarity with the implementation. 
>> We're willing to be part of the coding effort and are willing to 
>> submit our suggestions to the community. I think we might just need 
>> some occasional input.
>>> 
>>> Also, I will point out that our similar testing on Linux shows that 
>>> the
>> comparable performance between the two operating systems on the same 
>> multi- core hardware is significantly different. We're able to drive 
>> over 200,000 connections per second on a Linux server compared to 
>> fewer than 50,000 on the FreeBSD server. We have kernel profiling 
>> details that we can share if you'd like.
>> 
>> I have seen similar results with a redis cluster at work (we ended up 
>> deploying proxies to allow applications to reuse existing connections 
>> to avoid this).  I believe Linux uses RCU for this table.  You could 
>> perhaps use an rm lock instead of an rw lock.  On idea I considered 
>> was to split the the pcbhash lock up further so you had one lock per 
>> hash bucket so that you could allow concurrent connection 
>> setup/teardown so long as they were referencing different buckets.  
>> However, I did not think this would have been useful for the case at 
>> work since those connections were insane (single packet request 
>> followed by single packet reply with all the setup/teardown overhead) 
>> and all going to the same listening socket (so all the setup's would 
>> hash to the same bucket).  Handling concurrent setup on the same 
>> listen socket is a PITA but is in fact the common case.
>
> I don't think it's simply a synchronization primitive problem.  It 
> looks to me like the fundamental issue is that the lock order for the 
> tables is prior to the inp lock which means we have to grab it very 
> early. Presumably this is the classic sort of container -> 
> datastructure, datastructure -> container lock order problem.  This 
> seems to be made more complex by protecting the list of all pcbs, the 
> port allocation, and parts of the hash by the same lock.
>
> Have we tried to further decompose this lock?  I would experiment with 
> that as a first step.  Is this grabbed in so many places just due to 
> the complex lock order issue?  That seems to be the case.  There are 
> only a handful of fields marked as protected by the inp info lock.  Do 
> we know that this list is complete?
>
> My second step would be to attempt to turn the locking on its head. 
> Change the lock order from inp lock to inp info lock.  You can resolve 
> the lookup problem by adding an atomic reference count that holds the 
> datastructure while you drop the hash lock and before you acquire the 
> inp lock.  Then you could re-validate the inp after lookup.  I suspect 
> it's not that simple and there are higher level races that you'll 
> discover are being serialized by this big lock but that's just a hunch.
>

I read some more.  We have already done this lookup/ref/etc. dance for the hash lock.  It handles the hard cases of multiple inp_* calls and synchronizing the ports, bind, connect, etc.  It looks like the list locks have been optimized to make the iterators simple.  I think this is backwards now.  We should make the iterators complex and the normal setup/teardown path simple.  The iterators can follow a model like the hash lock using sentinels to hold their place.  We have the same pattern elsewhere.  It would allow you to acquire the INP_INFO lock after the INP lock and push it much deeper into the stack.

Jeff

> What do you think Robert?  If it would make improving the tcb locking 
> simpler it may fall under the umbrella of what Isilon needs but I'm 
> not sure that's the case.  Certainly my earlier attempts at deferred 
> processing were made more complex by this arrangement.
>
> Thanks,
> Jeff
>
>> 
>> The best forum for discussing this is probably on net@ as there are 
>> likely other interested parties who might have additional ideas.  
>> Also, it might be interesting to look at how connection groups try to 
>> handle this.  I believe they use an altenate method of decomposing 
>> the global lock into smaller chunks, and I think they might do 
>> something to help mitigate the listen socket problem (perhaps they 
>> duplicate listen sockets in all groups)?  Robert would be able to 
>> chime in on that, but I believe he is not really back home until next 
>> week.
>> 
>> --
>> John Baldwin
>> 
>