Old panic report...

George Neville-Neil gnn at freebsd.org
Mon Dec 3 16:23:49 UTC 2012


Howdy,

Was cleaning out my inbox and found this.  Whoever has infiniband on their plate atm should
take a look at this report.

Best,
George


Begin forwarded message:

> From: John Baldwin <jhb at freebsd.org>
> Subject: Fwd: Re: Kernel panic caused by OFED mlx4 driver
> Date: May 30, 2012 3:17:25 EDT
> To: "George Neville-Neil" <gnn at freebsd.org>
> 
> FYI...
> 
> -- 
> John Baldwin
> 
> From: Olivier Cinquin <ocinquin at uci.edu>
> Subject: Re: Kernel panic caused by OFED mlx4 driver
> Date: May 26, 2012 12:20:30 EDT
> To: John Baldwin <jhb at freebsd.org>
> 
> 
> Hi John,
> I thought I'd let you know I have things working now. Thanks for your fix.
> 
> I also wanted to mention that I've identified another problem. This problem is unlikely to affect me in practice, and I don't know if it's closely related to your areas of expertise and interest, but I just thought I'd mention it. When running performance tests of the IP over Infiniband connection, I found that iperf reported dismal numbers (around ~1Mb/s). I did further testing and found much higher rates using the following:
> cat /dev/zero | ssh other_machines_ip "cat > /dev/null"
> and monitoring traffic with systat -ifs. The throughput of the latter test is limited by CPU usage of ssh. Using multiple instances of the above test running in parallel, I could get total throughput up to ~10Gb/s. However, if after reaching that throughput I launched another instance of the test, total throughput suddenly dropped back down to very low levels.
> 
> My guess is that there's a congestion management problem, which I have no idea how to solve (just to play around, I tried loading kernel modules cc_cubic.ko or cc_htcp.ko but that didn't address the problem). It doesn't matter that much to me because my usage is unlikely to produce rates above 10Gb/s, but other people might run into the problem (and the iperf results are misleading for all users).
> 
> Best wishes,
> Olivier
> 
> 
> 
> 
> 
> On May 23, 2012, at 6:35 AM, John Baldwin wrote:
> 
>> On Tuesday, May 22, 2012 4:52:52 pm Olivier Cinquin wrote:
>>> Here you go...
>>> Olivier
>>> 
>>> 
>>> interrupt                          total       rate
>>> irq275: mlx4_core0                     0          0
>>> irq276: mlx4_core0                     0          0
>>> irq277: mlx4_core0                     0          0
>>> irq278: mlx4_core0                     0          0
>>> irq279: mlx4_core0                     0          0
>>> irq280: mlx4_core0                     0          0
>>> irq281: mlx4_core0                     0          0
>>> irq282: mlx4_core0                     0          0
>>> irq283: mlx4_core0                     0          0
>>> irq284: mlx4_core0                     0          0
>>> irq285: mlx4_core0                     0          0
>>> irq286: mlx4_core0                     0          0
>>> irq287: mlx4_core0                     0          0
>>> irq288: mlx4_core0                     0          0
>>> irq289: mlx4_core0                     0          0
>>> irq290: mlx4_core0                     0          0
>>> irq291: mlx4_core0                     0          0
>>> irq292: mlx4_core0                     0          0
>>> irq293: mlx4_core0                     0          0
>>> irq294: mlx4_core0                     0          0
>>> irq295: mlx4_core0                     0          0
>>> irq296: mlx4_core0                     0          0
>>> irq297: mlx4_core0                     0          0
>>> irq298: mlx4_core0                     0          0
>>> irq299: mlx4_core0                     0          0
>>> irq300: mlx4_core0                     0          0
>>> irq301: mlx4_core0                     0          0
>>> irq302: mlx4_core0                     0          0
>>> irq303: mlx4_core0                     0          0
>>> irq304: mlx4_core0                     0          0
>>> irq305: mlx4_core0                     0          0
>>> irq306: mlx4_core0                     0          0
>>> irq307: mlx4_core0                     0          0
>>> irq308: mlx4_core0                     0          0
>>> irq309: mlx4_core0                     0          0
>>> irq310: mlx4_core0                     0          0
>>> irq311: mlx4_core0                     0          0
>>> irq312: mlx4_core0                     0          0
>>> irq313: mlx4_core0                     0          0
>>> irq314: mlx4_core0                     0          0
>>> irq315: mlx4_core0                     0          0
>>> irq316: mlx4_core0                     0          0
>>> irq317: mlx4_core0                     0          0
>>> irq318: mlx4_core0                     0          0
>>> irq319: mlx4_core0                     0          0
>>> irq320: mlx4_core0                     0          0
>>> irq321: mlx4_core0                     0          0
>>> irq322: mlx4_core0                     0          0
>>> irq323: mlx4_core0                     0          0
>>> irq324: mlx4_core0                     0          0
>>> irq325: mlx4_core0                     0          0
>>> irq326: mlx4_core0                     0          0
>>> irq327: mlx4_core0                     0          0
>>> irq328: mlx4_core0                     0          0
>>> irq329: mlx4_core0                     0          0
>>> irq330: mlx4_core0                     0          0
>>> irq331: mlx4_core0                     0          0
>>> irq332: mlx4_core0                     0          0
>>> irq333: mlx4_core0                     0          0
>>> irq334: mlx4_core0                     0          0
>>> irq335: mlx4_core0                     0          0
>>> irq336: mlx4_core0                     0          0
>>> irq337: mlx4_core0                     0          0
>>> irq338: mlx4_core0                     0          0
>>> irq339: mlx4_core0                   426          0
>>> Total                            3076439        341
>> 
>> 64 interrupts, wow!  Well, that explains why you hit this bug then.  I'll 
>> commit the fix.
>> 
>>> 
>>> On May 22, 2012, at 1:42 PM, John Baldwin wrote:
>>> 
>>>> On Tuesday, May 22, 2012 2:48:52 pm Olivier Cinquin wrote:
>>>>> Thanks, that seems to have fixed the problem! Will this patch make it 
>> into 
>>>> the next release?
>>>>> I have no idea how many interrupts my card has. I'm happy to find out if 
>> you 
>>>> let me know how, if that can help you in any way.
>>>>> Should I expect everything to work fine now? I take it the card is 
>>>> recognized since ib0 is attached to mlx4_0 port 1
>>>>> 
>>>>> mlx4_core0: <mlx4_core> mem 0xdfe00000-0xdfefffff,0xdc800000-0xdcffffff 
>> irq 
>>>> 36 at device 0.0 on pci3
>>>>> mlx4_core: Mellanox ConnectX core driver v1.0-ofed1.5.2 (August 4, 2010)
>>>>> mlx4_core: Initializing mlx4_core
>>>>> vboxdrv: fAsync=0 offMin=0x123c offMax=0xec01
>>>>> vboxnet0: Ethernet address: 0a:00:27:00:00:00
>>>>> mlx4_en: Mellanox ConnectX HCA Ethernet driver v1.5.2 (July 2010)
>>>>> mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0-ofed1.5.2 (August 4, 
>> 2010)
>>>>> ib0: max_srq_sge=31
>>>>> ib0: max_cm_mtu = 0x10000, num_frags=16
>>>>> ib0: Attached to mlx4_0 port 1
>>>>> 
>>>>> 
>>>>> (I need to get cables before I can test connectivity between different 
>>>> machines).
>>>>> 
>>>>> Thanks again for your help!
>>>> 
>>>> Very interesting!  Can you get the output of 'vmstat -ai | grep -v stray'?
>>>> 
>>>> -- 
>>>> John Baldwin
>>>> 
>>> 
>>> 
>> 
>> -- 
>> John Baldwin
>> 
> 
> 
> 



More information about the freebsd-infiniband mailing list