misc/183390: 10gigabit networking problems

Mon Oct 28 11:10:00 UTC 2013

>Number:         183390
>Category:       misc
>Synopsis:       10gigabit networking problems
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Mon Oct 28 11:10:00 UTC 2013
>Closed-Date:
>Last-Modified:
>Originator:     Antal Pataki
>Release:        9.2
>Organization:
Granaglia Ltd.
>Environment:
FreeBSD storagex.lan.granaglia.com 9.2-RELEASE FreeBSD 9.2-RELEASE #0 r255898: Thu Sep 26 22:50:31 UTC 2013     root at bake.isc.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64
>Description:
Hardware: IBM x3500 m4 (2x E5-2620, 16GB RAM)
Intel X520 DA2 10Gbit NIC (PCI-Express x8)
IBM ServeRAID M1115 with 8x600GB 15k rpm SAS disk.

System setup:
The system is installed into a geli'ed zpool.
The Intel 10Gbit NIC is direct-connected to an other IBM x3500 m4 (same Intel card too) what is running VmWare ESXi 5.5.

The system provides an NFS share to the ESXi system trough the 10 gigabit connection.

The problem:

Without any load if I ping the other machine trough the 10 gigabit connection, the ping output is like this:
root at storagex:~ # ping 10.3.3.2
PING 10.3.3.2 (10.3.3.2): 56 data bytes
(...cutoff...)
64 bytes from 10.3.3.2: icmp_seq=89 ttl=64 time=0.106ms
ping: sendto: File too large
64 bytes from 10.3.3.2: icmp_seq=91 ttl=64 time=0.092ms
..etc..etc.

Sometimes the "ping: sendto: File too large" message don't coming for many hours, sometimes its floods the console!

When this starts to happens, the other end, the ESXi machine shows int he logs, the StorageApdHandler process starts a times for the NFS share, because it didn't receives back the NFS heartbeat.
After a few seconds, the ESXi machine starts to show in the lock:
NFSLock: xxx: Stop accessing fd 0xxxxxxx x
After a few seconds again, on the ESXi machine, the StorageApd Handler enters the NFS share to All Path Down state, and drops the NFS connection.

After this, if I try to ping the FreeBSD machine from the ESXi machine, the ESXi show "host is down",
and on the FreeBSD machine the ping repeats the "ping: sendto: File too large" message.

To resolve this, only ifconfig ix1 down and after ifconfig ix1 up works.

After resetting the interface like this, sometimes the connection and the ping works for minutes, sometimes works for hours - and again starting the situation described above.

I have screenshoots from the "ping: sendto: File too large" message.

We tried the default ixgbe driver, and the newest from the Intel's website.
With both drives is the same issue.

We analysed that, if the transfer rate over the 10Gbit connection reaches over 5Gbit/sec, the problem comes more faster, maybe in 20-40 minutes, sometimes after 5 minutes.

If we leave the machine only to ping the each other, sometimes the problem didn't come for days, but come.

>How-To-Repeat:
Install an Intel X520 10gbit NIC into a FreeBSD 9.2 system.

Connect it to an other host via 10gbit ethernet. (We tried with ESXi 5.1 and 5.5.)

Start to ping the other end and leave it for hours.

Engage some high traffic (utilise the connection over 5Gbit/sec), probably via NFS to an ESXi 5.5 host on the other side.

Wait some hours.
>Fix:

>Release-Note:
>Audit-Trail:
>Unformatted: