TCP stack lock contention with short-lived connections

Mon May 26 13:37:01 UTC 2014

  Hi Navdeep

On 23/05/14 23:37, Navdeep Parhar wrote:
> On 05/23/14 13:52, Julien Charbon wrote:
>> On 23/05/14 14:06, Julien Charbon wrote:
>>> On 27/02/14 11:32, Julien Charbon wrote:
>>>> On 07/11/13 14:55, Julien Charbon wrote:
>>>>> On Mon, 04 Nov 2013 22:21:04 +0100, Julien Charbon
>>>>> <jcharbon at verisign.com> wrote:
>>>>>> I have put technical and how-to-repeat details in below
>>>>>> PR:
>>>>>>
>>>>>> kern/183659: TCP stack lock contention with short-lived
>>>>>> connections
>>>>>> http://www.freebsd.org/cgi/query-pr.cgi?pr=183659
>>>>>>
>>>>>> We are currently working on this performance improvement
>>>>>> effort;  it will impact only the TCP locking strategy not
>>>>>> the TCP stack logic itself.  We will share on freebsd-net
>>>>>> the patches we made for reviewing and improvement
>>>>>> propositions;  anyway this change might also require enough
>>>>>> eyeballs to avoid tricky race conditions introduction in
>>>>>> TCP stack.
>>
>> Joined the two cumulative patches (tcp-scale-inp-list-v1.patch and
>> tcp-scale-pcbinfo-rlock-v1.patch) we discussed the most at BSDCan
>> 2014.
>>
>> [...]
>>
>> _However_ it would be a miracle that this change does not introduce
>> new race condition(s) (hence the 'alpha' tag in commit message).
>> Most of TCP stack locking strategy being now on inpcb lock
>> shoulders.  That said, from our tests point of view, this change is
>> completely stable: No kernel/lock assertion, no unexpected TCP
>> behavior, stable performance results.  Moreover, before tagging
>> this change as 'beta' we need to test more thoroughly these
>> features:
>>
>> - VNET, - PCBGROUP/RSS/TCP timer per cpu, - TCP Offloading (we need
>> a NIC with a good TCP offloading support)
>
> I can assess the impact (and fix any fallout) on the parts of the
> kernel that deal with TCP_OFFLOAD, and the TOE driver in
> dev/cxgbe/tom.  But I was hoping to do that only after there was
> general agreement on net@ that these locking changes are sound and
> should be taken into HEAD. Lack of reviews seems to be holding this
> back, correct?

  Correct, these changes have been reviewed and tested only internally 
yet.  As we aren't finding any issues, we share them for wider testing, 
comments and reviews.

  An advice for reviewers:  When reading code don't be fooled by 
remaining ipi_lock read lock (INP_INFO_RLOCK), you should consider 
ipi_lock as completely deleted in TCP stack.  All TCP code that was 
under ipi_lock umbrella is now only protected by inp_lock.  Just keep 
that in mind.

  Below, just for your information, more details on context of these 
changes:

  o The rough consensus at BSDCan was that there is a shared interest 
for scalability improvement of TCP workloads with potential high rate of 
connections establishment and tear-down.

  o Our requirements for this task were:
   - Achieve more than 100k TCP connections per second without dropping 
a single packet in reception
   - Use a strategy that does not require to change all network stacks 
in a row (TCP, UDP, RAW, etc.)
   - Be able to progressively introduce better performance, leveraging 
already in place mutex strategy
   - Keep the TCP stack stable (obviously)

  o Solutions we did try to implement and that failed:

   #1 Decompose ipi_lock in finer grained locks on ipi_hashbase's bucket 
(i.e. add a mutex in struct inpcbhead).  Did not work as the induced 
change was quite big, and keeping network stacks shared code (in 
in_pcb.{h, c}) clean was much more difficult than expected.

   #2 Revert the lock ordering from:

ipi_lock > inpcb lock > ipi_hash_lock, pcbgroup locks

  to:

inpcb lock > ipi_lock, ipi_hash_lock, pcbgroup locks

   The change was just a all-or-nothing giant patch, it did not handle 
the full inp list traversal functions and having a different lock 
ordering between TCP stack and other network stacks was quite a 
challenge to maintain.

  My 2 cents.

--
Julien