panic with tcp timers

Fri Jun 17 14:51:22 UTC 2016

 Hi Gleb,

On 6/17/16 6:53 AM, Gleb Smirnoff wrote:
>   At Netflix we are observing a race in TCP timers with head.
> The problem is a regression, that doesn't happen on stable/10.
> The panic usually happens after several hours at 55 Gbit/s of
> traffic.
> 
> What happens is that tcp_timer_keep finds t_tcpcb being
> NULL. Some coredumps have tcpcb already initialized,
> with non-NULL t_tcpcb and in TCPS_ESTABLISHED state. Which
> means that other CPU was working on the tcpcb while
> the faulted one was working on the panic. So, this all looks
> like a use after free, which conflicts with new allocation.
> 
> Comparing stable/10 and head, I see two changes that could
> affect that:
> 
> - callout_async_drain
> - switch to READ lock for inp info in tcp timers
> 
> That's why you are in To, Julien and Hans :)
> 
> We continue investigating, and I will keep you updated.
> However, any help is welcome. I can share cores.

 Thanks for sharing.  Let me run our TCP tests on a recent version of
HEAD to see if by chance I can reproduce it.  If I am not able to
reproduce it I will ask for debug kernel and cores and see if I can help.

 Few notes here:

 -  Around 2 months ago I did test HEAD with callout_async_drain() in
TCP timers with our TCP QA testsuite but no kernel panic.  That said I
did not let our test run during several hours.

 - At Verisign we run 10 with READ lock for inp info in tcp timers
change.  Again, it does not mean this change has no impact here.

 My 2 cents.

--
Julien

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 496 bytes
Desc: OpenPGP digital signature
URL: <http://lists.freebsd.org/pipermail/freebsd-current/attachments/20160617/e89e2b8e/attachment.sig>