Panic in the udp_input() under heavy load

Tue Dec 27 20:56:40 UTC 2011

So it's actually happening:

Nov  8 21:38:02 dal09 kernel: BZZT! Something is terribly wrong, up == 
NULL! inp = 0xffffff05e5798bd0
Nov 13 03:34:49 dal09 kernel: BZZT! Something is terribly wrong, up == 
NULL! inp = 0xffffff02e5b05930
Nov 30 04:18:11 dal09 kernel: BZZT! Something is terribly wrong, up == 
NULL! inp = 0xffffff03b2d2e000
Nov 30 20:24:12 dal09 kernel: BZZT! Something is terribly wrong, up == 
NULL! inp = 0xffffff03a35e33f0
Nov 30 22:03:20 dal09 kernel: BZZT! Something is terribly wrong, up == 
NULL! inp = 0xffffff03a6349690
Dec  5 03:33:01 dal09 kernel: BZZT! Something is terribly wrong, up == 
NULL! inp = 0xffffff02e0c9e930
Dec  9 06:02:06 dal09 kernel: BZZT! Something is terribly wrong, up == 
NULL! inp = 0xffffff038a4fea80

I'd love to try that socket closure locking patch that the Robert 
suggested, but kinda loaded right now. Robert, will it be too much to 
ask if you could provide me with the patch that applies to the latest 
8-STABLE for a test? I'd give it a spin on 2-3 production boxes. And 
yes, those servers do a lot of socket ops per second, I'd say in the 
order of hundreds if not thousands per second.

-Maxim

On 11/7/2011 3:25 PM, Bjoern A. Zeeb wrote:
> On Mon, 7 Nov 2011, Maxim Sobolev wrote:
>
>> On 11/7/2011 2:57 PM, Maxim Sobolev wrote:
>>> On 11/7/2011 10:24 AM, Bjoern A. Zeeb wrote:
>>>> Unlikely; the inp is properly locked there and the udp info attach
>>>> better still be valid there; your problem is most likely elsewhere;
>>>> try to see if you have other threads and see what they do at the same
>>>> time, etc. You would need to race with udp_detach(); you also want
>>>> to make sure that the inp still looks sane from either ddb or a dump
>>>> and we are not talking about random memory corruption here.
>>>
>>> Well, as you can see from the trace it points pretty strongly to that
>>> piece of code. And as I said this panic is completely reproducible,
>>> we've seen it at least 5 times to date in exactly this location.
>>> Unfortunately the trace is rather long so we could not capture it in
>>> full before, until we've switched to the 80x50 mode.
>>>
>>> If it was a memory corruption it would be just random fault, while here
>>> we have it failing in this point reliably.
>>>
>>> Unfortunately the panic happens in the driver thread context (I
>>> believe), so the KDB/dump is not working. After panicing the machine
>>> just hangs there. Keyboard is not working and I need to do a hard reset.
>>>
>>> Is there any other explanation that you can think of? Is it possible for
>>> some other portion of the code (i.e. network driver, DMA engine etc) to
>>> trash this structure by writing something off bound? Or something along
>>> the lines?
>>
>> OK, I've put the following catch to prove the case:
>>
>>        up = intoudpcb(inp);
>>        if (up == NULL) {
>>                printf("BZZT! Something is terribly wrong, up ==
>> NULL!\n");
>>                INP_RUNLOCK(inp);
>>                goto badunlocked;
>>        }
>>        if (up->u_tun_func == NULL) {
>>
>> I am going to give it a spin on two busiest boxes and see if I can log
>> anything.
>
> Now if you are clever you'd also log the inp there as the above will
> only prove the case that something is wrong but still not help us in
> anything to figure out what.