kern/180775: if_bxe driver broken with Broadcom BCM57711 cards

Sébastien RICCIO sr at swisscenter.com
Tue Jul 23 20:50:01 UTC 2013


>Number:         180775
>Category:       kern
>Synopsis:       if_bxe driver broken with Broadcom BCM57711 cards
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Tue Jul 23 20:50:01 UTC 2013
>Closed-Date:
>Last-Modified:
>Originator:     Sébastien RICCIO
>Release:        9.1-RELEASE
>Organization:
SwissCenter
>Environment:
FreeBSD filer-01-a 9.1-RELEASE-p4 FreeBSD 9.1-RELEASE-p4 #0 r253571M: Tue Jul 23 16:09:05 CEST 2013     root at filer-01-a:/usr/obj/usr/src/sys/GENERIC  amd64
>Description:
Hi!

We recently installed FreeBSD 9.1 64bit on a Dell PowerEdge R510 system in which we have two BCM57711 (for a total of four 10Gbit interfaces.)

We're planning to use it as a storage filer using ZFS/NFS.

Actually in test, the filer is connected with two 10gigs interfaces to a 10ge Dell PowerConnect switch that serves some linux clients using 10ge cards too.

We get into a lot of troubles trying to get something working out of this setup.

--

First issue:

Without any special tweaking, when we're reading or writing to the NFS server from a client, the network card crashes and become. In the logs I can see:

Jul 19 11:49:26 filer-01-a kernel: bxe0: ---------- Begin crash dump ----------
Jul 19 11:49:26 filer-01-a kernel: bxe0: ------------------------------ Idle Check ------------------------------
Jul 19 11:49:26 filer-01-a kernel: bxe0: ERROR CFC: AC > 1 - LCID 39 CID_CAM 0x7 Value is 0xc
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING QM: VOQ_0, VOQ credit is not equal to initial credit. Values are 0xf8 0x140
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING QM: P0 Byte credit is not equal to initial credit. Values are 0x5a1c 0x8000
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING CCM: XX protection CAM is not empty. Value is 0x1
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING XCM: XX protection CAM is not empty. Value is 0x1
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING BRB1: BRB is not empty. Value is 0x3
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING TCM: FIC0_INIT_CRD is not 64. Value is 0x30
Jul 19 11:49:26 filer-01-a kernel: bxe0: ERROR TSEM: interrupt status 0 is not 0. Value is 0x10000
Jul 19 11:49:26 filer-01-a kernel: bxe0: ERROR CSEM: interrupt status 0 is not 0. Value is 0x10000
Jul 19 11:49:26 filer-01-a kernel: bxe0: ERROR XSEM: interrupt status 0 is not 0. Value is 0x10000
Jul 19 11:49:26 filer-01-a kernel: bxe0: bxe_idle_chk(): Failed with 4 error(s) and 0 warning(s)!
Jul 19 11:49:26 filer-01-a kernel: bxe0: ------------------------------------------------------------------------
Jul 19 11:49:26 filer-01-a kernel: bxe0: ------------------------------ Idle Check ------------------------------
Jul 19 11:49:26 filer-01-a kernel: bxe0: ERROR CFC: AC > 1 - LCID 39 CID_CAM 0x7 Value is 0xc
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING QM: VOQ_0, VOQ credit is not equal to initial credit. Values are 0xf8 0x140
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING QM: P0 Byte credit is not equal to initial credit. Values are 0x5a1c 0x8000
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING CCM: XX protection CAM is not empty. Value is 0x1
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING XCM: XX protection CAM is not empty. Value is 0x1
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING BRB1: BRB is not empty. Value is 0x4
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING TCM: FIC0_INIT_CRD is not 64. Value is 0x30
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING PRS: TCM current credit is not 0. Value is 0x10
Jul 19 11:49:26 filer-01-a kernel: bxe0: ERROR TSEM: interrupt status 0 is not 0. Value is 0x10000
Jul 19 11:49:26 filer-01-a kernel: bxe0: ERROR CSEM: interrupt status 0 is not 0. Value is 0x10000
Jul 19 11:49:26 filer-01-a kernel: bxe0: ERROR XSEM: interrupt status 0 is not 0. Value is 0x10000
Jul 19 11:49:26 filer-01-a kernel: bxe0: bxe_idle_chk(): Failed with 4 error(s) and 0 warning(s)!
Jul 19 11:49:26 filer-01-a kernel: bxe0: ------------------------------------------------------------------------
Jul 19 11:49:26 filer-01-a kernel: bxe0: ----------  End crash dump  ----------

A reboot of the system is even not enough. After rebooting the system, I can't even ping any hosts on the network. It seems that it leaves the card in a bogus state that requires a complete power cycle to get the cards back in business.

We found out that disabling: tso4 txcsum rxcsum on the cards prevent this from happening.

So although I think it's not, let's say we have a fix for this setting in rc.conf something like this:
ifconfig_bxe0="inet 10.50.50.11 netmask 255.255.255.0 mtu 9000  -tso4 -txcsum -rxcsum"

--

Second issue,

Issuing an ifconfig mtu 9000 on the interfaces randomly produce this error:

Jul 19 09:47:03 filer-01-a kernel: bxe0: /usr/src/sys/dev/bxe/if_bxe.c(10934): Memory allocation failure! Cannot fill fp[04] RX chain.
Jul 19 09:47:03 filer-01-a kernel: bxe0: /usr/src/sys/dev/bxe/if_bxe.c(3921): NIC initialization failed, aborting!
Jul 19 09:47:12 filer-01-a kernel: bxe3: /usr/src/sys/dev/bxe/if_bxe.c(10934): Memory allocation failure! Cannot fill fp[04] RX chain.
Jul 19 09:47:12 filer-01-a kernel: bxe3: /usr/src/sys/dev/bxe/if_bxe.c(3921): NIC initialization failed, aborting!

That sounds quite bad and, I can't reproduce it with mtu 1500 setting. (But does it makes sens to use a MTU of 1500 on a 10gig local network...?) 

--

Third issue,

part 1)

We've tried two interfaces (each interface with an mtu of 9000) using lagg, like this:

ifconfig bxe0 up -tso4 -txcsum -rxcsum mtu 9000
ifconfig bxe2 up -tso4 -txcsum -rxcsum mtu 9000
ifconfig lagg0 create
ifconfig lagg0 up laggproto failover laggport bxe0 laggport bxe2 10.50.50.11/24

This instantanely crashes the kernel and cause a machine reboot. The log says:

Jul 19 09:47:12 filer-01-a kernel: 
Jul 19 09:47:12 filer-01-a kernel: 
Jul 19 09:47:12 filer-01-a kernel: Fatal trap 12: page fault while in kernel mode
Jul 19 09:47:12 filer-01-a kernel: cpuid = 0; apic id = 20
Jul 19 09:47:12 filer-01-a kernel: fault virtual address        = 0x6d
Jul 19 09:47:12 filer-01-a kernel: fault code           = supervisor read data, page not present
Jul 19 09:47:12 filer-01-a kernel: instruction pointer  = 0x20:0xffffffff808d5879
Jul 19 09:47:12 filer-01-a kernel: stack pointer                = 0x28:0xffffff80003227f0
             --*** BOOOM REBOOT ***-- 
Jul 19 09:49:49 filer-01-a syslogd: kernel boot file is /boot/kernel/kernel

/var/crash/core.txt.0 returns:

Unread portion of the kernel message buffer:
Fatal trap 12: page fault while in kernel mode
cpuid = 5; apic id = 33
fault virtual address   = 0x6d
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff808d5879
stack pointer           = 0x28:0xffffff80003227f0
frame pointer           = 0x28:0xffffff8000322820
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 12 (swi6: task queue)
trap number             = 12
panic: page fault
cpuid = 5
KDB: stack backtrace:
#0 0xffffffff809208a6 at kdb_backtrace+0x66
#1 0xffffffff808ea8be at panic+0x1ce
#2 0xffffffff80bd8240 at trap_fatal+0x290
#3 0xffffffff80bd857d at trap_pfault+0x1ed
#4 0xffffffff80bd8b9e at trap+0x3ce
#5 0xffffffff80bc315f at calltrap+0x8
#6 0xffffffff8045da8c at bxe_free_buf_rings+0x4c
#7 0xffffffff8046c0d5 at bxe_init_locked+0x125
#8 0xffffffff80470cfe at bxe_ioctl+0x4fe
#9 0xffffffff8099d08f at if_setlladdr+0x1ff
#10 0xffffffff8174c94a at lagg_port_setlladdr+0x8a
#11 0xffffffff8092cf55 at taskqueue_run_locked+0x85
#12 0xffffffff8092d0da at taskqueue_run+0x3a
#13 0xffffffff808be8d4 at intr_event_execute_handlers+0x104
#14 0xffffffff808c0076 at ithread_loop+0xa6
#15 0xffffffff808bb9ef at fork_exit+0x11f
#16 0xffffffff80bc368e at fork_trampoline+0xe
Uptime: 39m41s
Dumping 1505 out of 32735 MB:..2%..11%..21%..31%..41%..52%..61%..71%..81%..91%

Reading symbols from /boot/kernel/zfs.ko...Reading symbols from /boot/kernel/zfs.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/zfs.ko
Reading symbols from /boot/kernel/opensolaris.ko...Reading symbols from /boot/kernel/opensolaris.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/opensolaris.ko
Reading symbols from /boot/kernel/if_lagg.ko...Reading symbols from /boot/kernel/if_lagg.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/if_lagg.ko
#0  doadump (textdump=Variable "textdump" is not available.
) at pcpu.h:224
224     pcpu.h: No such file or directory.
        in pcpu.h
(kgdb) #0  doadump (textdump=Variable "textdump" is not available.
) at pcpu.h:224
#1  0xffffffff808ea3a1 in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:448
#2  0xffffffff808ea897 in panic (fmt=0x1 <Address 0x1 out of bounds>)
    at /usr/src/sys/kern/kern_shutdown.c:636
#3  0xffffffff80bd8240 in trap_fatal (frame=0xc, eva=Variable "eva" is not available.
)
    at /usr/src/sys/amd64/amd64/trap.c:857
#4  0xffffffff80bd857d in trap_pfault (frame=0xffffff8000322740, usermode=0)
    at /usr/src/sys/amd64/amd64/trap.c:773
#5  0xffffffff80bd8b9e in trap (frame=0xffffff8000322740)
    at /usr/src/sys/amd64/amd64/trap.c:456
#6  0xffffffff80bc315f in calltrap ()
    at /usr/src/sys/amd64/amd64/exception.S:228
#7  0xffffffff808d5879 in free (addr=0xffffff80083e5000, 
    mtp=0xffffffff81198ba0) at uma_int.h:413
#8  0xffffffff8045da8c in bxe_free_buf_rings (sc=0xffffff8000c1c000)
    at /usr/src/sys/dev/bxe/if_bxe.c:3787
#9  0xffffffff8046c0d5 in bxe_init_locked (sc=0x0, load_mode=0)
    at /usr/src/sys/dev/bxe/if_bxe.c:4063
#10 0xffffffff80470cfe in bxe_ioctl (ifp=0xfffffe000ec59000, command=Variable "command" is not available.
)
    at /usr/src/sys/dev/bxe/if_bxe.c:9668
#11 0xffffffff8099d08f in if_setlladdr (ifp=0xfffffe000ec59000, 
    lladdr=0xfffffe00125da4c8 "", len=6) at /usr/src/sys/net/if.c:3304
#12 0xffffffff8174c94a in lagg_port_setlladdr (arg=Variable "arg" is not available.
)
    at /usr/src/sys/modules/if_lagg/../../net/if_lagg.c:495
#13 0xffffffff8092cf55 in taskqueue_run_locked (queue=0xfffffe000e833980)
    at /usr/src/sys/kern/subr_taskqueue.c:308
#14 0xffffffff8092d0da in taskqueue_run (queue=0xfffffe000e833980)
    at /usr/src/sys/kern/subr_taskqueue.c:322
#15 0xffffffff808be8d4 in intr_event_execute_handlers (p=Variable "p" is not available.
)
    at /usr/src/sys/kern/kern_intr.c:1262
#16 0xffffffff808c0076 in ithread_loop (arg=0xfffffe000e66c140)
    at /usr/src/sys/kern/kern_intr.c:1275
#17 0xffffffff808bb9ef in fork_exit (
    callout=0xffffffff808bffd0 <ithread_loop>, arg=0xfffffe000e66c140, 
    frame=0xffffff8000322c40) at /usr/src/sys/kern/kern_fork.c:992
#18 0xffffffff80bc368e in fork_trampoline ()
    at /usr/src/sys/amd64/amd64/exception.S:602
#19 0x0000000000000000 in ?? ()
#20 0x0000000000000000 in ?? ()
#21 0x0000000000000001 in ?? ()
#22 0x0000000000000000 in ?? ()
#23 0x0000000000000000 in ?? ()
#24 0x0000000000000000 in ?? ()
#25 0x0000000000000000 in ?? ()
#26 0x0000000000000000 in ?? ()
#27 0x0000000000000000 in ?? ()
#28 0x0000000000000000 in ?? ()
#29 0x0000000000000000 in ?? ()
#30 0x0000000000000000 in ?? ()
#31 0x0000000000000000 in ?? ()
#32 0x0000000000000000 in ?? ()
#33 0x0000000000000000 in ?? ()
#34 0x0000000000000000 in ?? ()
#35 0x0000000000000000 in ?? ()
#36 0x0000000000000000 in ?? ()
#37 0x0000000000000000 in ?? ()
#38 0x0000000000000000 in ?? ()
#39 0x0000000000000000 in ?? ()
#40 0x0000000000000000 in ?? ()
#41 0x0000000000000000 in ?? ()
#42 0x0000000000000000 in ?? ()
#43 0x0000000000000005 in ?? ()
#44 0xffffffff81244180 in tdq_cpu ()
#45 0xfffffe000e698000 in ?? ()
#46 0x0000000000000000 in ?? ()
#47 0xffffff8000322b30 in ?? ()
#48 0xffffff8000322ad8 in ?? ()
#49 0xfffffe000e6728e0 in ?? ()
#50 0xffffffff8091352e in sched_switch (td=0x0, newtd=0xfffffe000e66c140, 
    flags=Variable "flags" is not available.
) at /usr/src/sys/kern/sched_ule.c:1921
Previous frame inner to this frame (corrupt stack?)
(kgdb) 

Okay guess it has something to do again with the MTU 9000 but this time it does completly panic the kernel. This is no good.


Part 2) Trying bonding with normal MTU 1500

ifconfig bxe0 up -tso4 -txcsum -rxcsum mtu 1500
ifconfig bxe2 up -tso4 -txcsum -rxcsum mtu 1500
ifconfig lagg0 create
ifconfig lagg0 up laggproto failover laggport bxe0 laggport bxe2 10.50.50.11/24

This time. No error messages, no crash. Yiha! 

But no. Even everything seems to be correct, the bonding is not working. We can't ping any host on the network.
Also the lagg0 says: No carrier 

see:

bxe0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM>
        ether 00:10:18:98:35:f8
        inet6 fe80::210:18ff:fe98:35f8%bxe0 prefixlen 64 scopeid 0x3 
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect (10Gbase-SR <full-duplex>)
        status: active
bxe2: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM>
        ether 00:10:18:98:35:f8
        inet6 fe80::210:18ff:fe95:eaa0%bxe2 prefixlen 64 scopeid 0x5 
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect (10Gbase-SR <full-duplex>)
        status: active
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM>
        ether 00:10:18:98:35:f8
        inet6 fe80::7a2b:cbff:fe1a:eab1%lagg0 prefixlen 64 scopeid 0x14 
        inet 10.50.50.11 netmask 0xffffff00 broadcast 10.50.50.255
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: no carrier
        laggproto failover lagghash l2,l3,l4
        laggport: bxe2 flags=0<>
        laggport: bxe0 flags=1<MASTER>

Please note that priore to installing freebsd, the machine was running a Debian 7 GNU/Linux 64 bit OS where we had the cards bonded and MTU'ed to 9000 without any crash or stability issue.
So it looks to me that there is something really wrong with the broadcom driver on freebsd 9.1, at least with the NIC's used in Dell servers.

Provided that broadcom themselves doesn't supply drivers for freebsd Is there any possible fix ?

>How-To-Repeat:
ifconfig bxe0 mtu 9000

or

ifconfig bxe0 up -tso4 -txcsum -rxcsum mtu 9000
ifconfig bxe2 up -tso4 -txcsum -rxcsum mtu 9000
ifconfig lagg0 create
ifconfig lagg0 up laggproto failover laggport bxe0 laggport bxe2 10.50.50.11/24

or even

ifconfig bxe0 up -tso4 -txcsum -rxcsum mtu 1500
ifconfig bxe2 up -tso4 -txcsum -rxcsum mtu 1500
ifconfig lagg0 create
ifconfig lagg0 up laggproto failover laggport bxe0 laggport bxe2 10.50.50.11/24
>Fix:
none known yet

>Release-Note:
>Audit-Trail:
>Unformatted:


More information about the freebsd-bugs mailing list