[Bug 218894] Network dropouts on em(4) due to jumbo cluster failures

Wed Apr 26 16:50:04 UTC 2017

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=218894

            Bug ID: 218894
           Summary: Network dropouts on em(4) due to jumbo cluster
                    failures
           Product: Base System
           Version: 11.0-RELEASE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Many People
          Priority: ---
         Component: kern
          Assignee: freebsd-bugs at FreeBSD.org
          Reporter: mandrews at bit0.com

This is a hard one to reproduce on demand, unfortunately.

After about two weeks of uptime, several of our systems of varying vintage will
lose most/all network connectivity for several minutes, then go back to working
as normal without anything having been done.

When these drops happen, "netstat -i" shows a jump in Ierrs, and "netstat -m"
shows a jump in "requests for jumbo clusters denied" for 9K.  So something is
causing jumbo allocations to fail, which in turn causes Ierrs, which in turn
causes temporary loss of connectivity.  When it starts happening, it usually
happens once or twice a day, and gets worse over time until a reboot clears it
up for a while.

Some Googling on this indicates that this might be a memory fragmentation
issue, and that after some uptime, it might be hard for the kernel to find a
contiguous block of memory larger than 4K (1 page), and that there might be a
defragmentation process that isn't happening.  I'm now unable to find the
specific page that led me to that wild theory though; that was a few weeks ago.
 Oopsie.

We only run jumbo frames on one VLAN, and we use an MTU of 5000 instead of 9000
because we have some Supermicro PDSMi+-based systems we can't yet get rid of
(grumble) that use 82573L NICs that have hardware bugs and once choked on
anything bigger (and 82573E NICs that don't do jumbos at all).  Whether that's
contributing to the problem, in trying to allocate 4K+1K vs 4K+4K+1K, I'm not
sure.

I'm also running with "-tso" because leaving TSO on causes problems with NFS
stalls for us -- similar problems, but probably unrelated to this issue.

The affected systems have 82574L NICs and 82579LM NICs -- Supermicro X9SCM-F,
X8STi-F, X8SIE-F, X8DT6-F.

The one igb-based (I350) system we have, a Supermicro X9DRD-7LN4F, doesn't seem
to be affected by this issue at all.

This is 11.0-RELEASE, which has em driver 7.6.1 in it.  I have tried the
net/intel-em-kmod-7.6.2 port and it doesn't help.

Short of buying a crapload of igb or ixgbe cards and/or turning off jumbo
frames, any ideas on how to troubleshoot and fix this before 11.1-RELEASE? 
Anything I can pull out of netstat, sysctl, etc?

-- 
You are receiving this mail because:
You are the assignee for the bug.