Very bad Realtek problems

Mon Oct 27 19:51:33 UTC 2014

Hi, all.

I've been having sporadic and serious problems with the Realtek gigabit
interface built into my motherboard. Periodically, it just freezes up. I've
tried several things to no avail: turning on DEVICE_POLLING, frobbing
bootloader options and sysctl settings, etc.

I had a solid week of function with the following:

hw.re.msi_disable="1"
hw.re.msix_disable="1"
dev.re.0.int_rx_mod=0     <-- this one says it can be a loader tuneable, but
                              it didn't work that way - I had to set it from
                              sysctl.conf

And then after a reboot, I locked up again on pushing the interface a little
with an rsync. However, I've seen interactive sessions lock the thing up too.
It's not just when I'd doing big transfers.

It's not clear what's happening. I have been capturing stats periodically
with 'sysctl dev.re.0.stats=1', but that doesn't always show a problem. For
instance, during one of the lock-ups last night, after a reboot, I got this:

re0 statistics:
Tx frames : 171306
Rx frames : 20271
Tx errors : 0
Rx errors : 0
Rx missed frames : 0
Rx frame alignment errs : 0
Tx single collisions : 0
Tx multiple collisions : 0
Rx unicast frames : 20271
Rx broadcast frames : 0
Rx multicast frames : 0
Tx aborts : 0
Tx underruns : 0

After running overnight, with sporadic automated transfers:

re0 statistics:
Tx frames : 4658945
Rx frames : 1258514
Tx errors : 0
Rx errors : 33
Rx missed frames : 0
Rx frame alignment errs : 3591
Tx single collisions : 0
Tx multiple collisions : 0
Rx unicast frames : 1255880
Rx broadcast frames : 2411
Rx multicast frames : 223
Tx aborts : 0
Tx underruns : 0

I was seeing the "Rx multicast frames" creep up each time I saw a freeze last
night, which was confusing in that I'm not sure why there'd be any multicast
traffic.

Here's the card from dmesg, with MSI/X turned off:

re0: <RealTek 8168/8111 B/C/CP/D/DP/E/F/G PCIe Gigabit Ethernet> port 0xe800-0xe8ff mem 0xfbfff000-0xfbffffff,0xfbff8000-0xfbffbfff irq 18 at device 0.0 on pci2
re0: Chip rev. 0x2c000000
re0: MAC rev. 0x00200000
miibus0: <MII bus> on re0
rgephy0: <RTL8169S/8110S/8211 1000BASE-T media interface> PHY 1 on miibus0
rgephy0:  none, 10baseT, 10baseT-FDX, 10baseT-FDX-flow, 100baseTX,
100baseTX-FDX, 100baseTX-FDX-flow, 1000baseT, 1000baseT-master,
1000baseT-FDX, 1000baseT-FDX-master, 1000baseT-FDX-flow,
1000baseT-FDX-flow-master, auto, auto-flow
re0: Ethernet address: bc:ae:c5:bd:44:e7

The motherboard with this included:

Base Board Information
        Manufacturer: ASUSTeK Computer INC.
        Product Name: M4A88T-M
        Version: Rev X.0x
        Serial Number: MF70B1G04201588
        Asset Tag: To Be Filled By O.E.M.
        Features:
                Board is a hosting board
                Board is replaceable
        Location In Chassis: To Be Filled By O.E.M.
        Chassis Handle: 0x0003
        Type: Motherboard
        Contained Object Handles: 0

In general I've been saying "ifconfig re0 down ; ifconfig re0 up" to kick the
interface, but last night a friendly person from IRC mentioned that I could
work around this by running a steady ping and frobbing mediatype when I see
the pings fail. So, I've got this running:

while true
do
ping -c 1 -t 1 firewall > /dev/null 2>&1
if [ $? -ne 0 ]; then
    date
    echo "toggling re0"
    echo
    ifconfig re0 media 1000baseT mediaopt full-duplex,flowcontrol,master
    ifconfig re0 media autoselect mediaopt flowcontrol              
    sleep 3
fi
sleep 1
done

This has been noting failures sporadically throughout the day, but it's
allowing traffic to continue moving, albeit with the occasional hiccough.

This hardware has been running Debian for a couple years, and it's never had
so much as a short hiccough, so I have confidence that the hardware is fine.
It suggests that there's something the Linux driver is doing to handle this
hardware that FreeBSD isn't doing. For a while I was dual-booting and I'd see
errors with FreeBSD running that were't there under Debian.

I'd started diving into the source, both Linux and FreeBSD, but I lack
sufficient exposure to ethernet driver code to be able to get a high-level
picture of what they're doing, and as such I haven't yet noticed any special-
case or hardware glitch handling that we're missing, although I might find
something eventually.

I'm struggling with finding a way to see what's actually happening with this.
I've toggled MSI and MSI-X handling, I've turned down interrupt handling
delays, I've tried both I/O and memory register transfers, although I'd not
actually clear what's happening differently there. I've had polling variously
enabled and disabled.

One thing to note is that last night's horror while I was trying to move some
back-up data was after rebooting from Windows. (Installed on a partition for
gaming...) It made me wonder if we're not fully setting up some state on the
card. I'd have what felt like a solid, glitchless week before that.

FWIW, I'm running 10.1-RC3 on this box and I've seen issues from early on
while I was still running 10.0-RELEASE.

Thanks in advance for clues. This is a showstopper for futher deployment for
me, as I've got these Realtek on-board cards in several boxes, and while the
media frobbing largely works, it's not something I can inflict on my users.

-- 
Mason Loring Bliss  ((   If I have not seen as far as others, it is because
 mason at blisses.org   ))   giants were standing on my shoulders. - Hal Abelson