Panic With Large Network Copy

Wed May 30 14:27:29 UTC 2007

On May 29, 2007, at 4:26 PM, Kris Kennaway wrote:

> On Tue, May 29, 2007 at 03:36:49PM -0700, Scott Willson wrote:
>> I am seeing hard (often no core dump) crashes on a new AMD64 box
>> running 6.2 RELEASE. When I try to rsync 10+ GB of backup files to
>> the new box, I can reliably crash it after about 20 minutes; often
>> quicker if I do something else intensive at the same time, like
>> compile MySQL. Here are the box specs:
>> ASUS M2NPV-VM motherboard
>> AMD A64 3800+ 2.4G CPU
>> 2 x 1 GB SuperTalent DDR2 667 RAM
>> 2 x 500G Samsung SATA2 drives
>> MATSHITADVD-ROM SR-8585 DVD drive (ancient)
>>
>> Most times, I don't even get a core dump. Here's one I did get:
>> panic: double fault
>> Uptime: 20m26s
>> Dumping 2014 MB (2 chunks)
>>   chunk 0: 1MB (159 pages) ... ok
>>   chunk 1: 2014MB (515552 pages) 1998 1982 1966 1950 1934 1918 1902
>> 1886 1870 1854 1838 1822 1806 1790 1774 1758 1742 1726 1710 1694 1678
>> 1662 1646 1630 1614 1598 1582 1566 1550 1534 1518 1502 1486 1470 1454
>> 1438 1422 1406 1390 1374 1358 1342 1326 1310 1294 1278 1262 1246 1230
>> 1214 1198 1182 1166 1150 1134 1118 1102 1086 1070 1054 1038 1022 1006
>> 990 974 958 942 926 910 894 878 862 846 830 814 798 782 766 750 734
>> 718 702 686 670 654 638 622 606 590 574 558 542 526 510 494 478 462
>> 446 430 414 398 382 366 350 334 318 302 286 270 254 238 222 206 190
>> 174 158 142 126 110 94 78 62 46 30 14
>>
>> #0  doadump () at pcpu.h:172
>> 172             __asm __volatile("movq %%gs:0,%0" : "=r" (td));
>> (kgdb) backtrace
>> #0  doadump () at pcpu.h:172
>> #1  0x0000000000000004 in ?? ()
>> #2  0xffffffff803f6093 in boot (howto=260) at /usr/src/sys/kern/
>> kern_shutdown.c:409
>> #3  0xffffffff803f6696 in panic (fmt=0xffffff0079a08be0 "X??y") at /
>> usr/src/sys/kern/kern_shutdown.c:565
>> #4  0xffffffff80610e70 in dblfault_handler () at /usr/src/sys/amd64/
>> amd64/trap.c:680
>> #5  0xffffffff805fe2f2 in Xdblfault () at /usr/src/sys/amd64/amd64/
>> exception.S:192
>> #6  0xffffffff80439844 in m_tag_delete_chain (m=0x0, t=0x0) at /usr/
>> src/sys/kern/uipc_mbuf2.c:346
>> #7  0xffffffff803eac0d in mb_dtor_mbuf (mem=0x0, size=0, arg=0x0)  
>> at /
>> usr/src/sys/kern/kern_mbuf.c:338
>> #8  0xffffffff80592a24 in uma_zfree_arg (zone=0x0, item=0x0,
>> udata=0x0) at /usr/src/sys/vm/uma_core.c:2270
>> #9  0xffffffff804371f0 in m_freem (mb=0x0) at uma.h:303
>> #10 0xffffffff80634125 in nve_ospackettx (ctx=0xffffff00798aac00,
>> id=0xffffffffb19ea6d0, success=0) at /usr/src/sys/dev/nve/if_nve.c: 
>> 1551
>
> This looks like a nve driver bug to me.  You may wish to try the  
> nfe driver.
>
> Kris

Thanks for the suggestion, Kris.

I compiled a new kernel without nve, compiled nfe-20070512.tar.gz  
with the e1000phy.patch, and I enabled device polling:
e1000phy0: <Marvell 88E1116 Gigabit PHY> on miibus0
e1000phy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX,  
1000baseTX-FDX, auto
nfe0: Ethernet address: 00:1a:92:cb:b2:eb
nfe0: [FAST]

No more panics, but I see a lot of error messages under load:
May 29 20:25:17 brooklyn kernel: nfe0: tx v2 error 0x6204<UNDERFLOW>
May 29 20:28:15 brooklyn kernel: nfe0: watchdog timeout (missed Tx  
interrupts) -- recovering

The only odd thing about my current setup is that the server is  
sharing a old hub with other old hardware, and it looks like I've  
just got 10baseT:
ifconfig nfe0
nfe0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
         options=8<VLAN_MTU>
         inet 192.168.1.154 netmask 0xffffff00 broadcast 192.168.1.255
         ether 00:1a:92:cb:b2:eb
         media: Ethernet autoselect (10baseT/UTP <half-duplex>)
         status: active

For now, I've installed an old spare Ethernet card, and I see no  
errors, so I'm going to roll with that for now. I'm also going to  
followup with the nfe driver's maintainer in case he's interested.

Scott