9.2 ixgbe tx queue hang (was: Network loss)

Thu Mar 6 10:24:51 UTC 2014

(creating a new thread, because I’m no longer sure this is related to Johan’s thread that I originally used to discuss this)

On 27.02.2014, at 18:02, Jack Vogel <jfvogel at gmail.com> wrote:

> I would make SURE that you have enough mbuf resources of whatever size pool
> that you are
> using (2, 4, 9K), and I would try the code in HEAD if you had not.
> 
> Jack

Jack, we've upgraded some other systems on which I get more time to debug (no impact for customers). Although those systems use the nfsclient too, I no longer think that NFS is the source of the problem (hence the new thread). I think it’s the ixgbe driver and/or card. When our problem occurs, it looks like it’s a single tx queue that gets stuck somehow (its buf_ring remains full).

I tracked ping using dtrace to determine the source of ENOBUFS it returns every few packets when things get weird:

# dtrace -n 'fbt:::return / arg1 == ENOBUFS && execname == "ping" / { stack(); }'
dtrace: description 'fbt:::return ' matched 25476 probes
CPU     ID                    FUNCTION:NAME
 26   7730            ixgbe_mq_start:return 
              if_lagg.ko`lagg_transmit+0xc4
              kernel`ether_output_frame+0x33
              kernel`ether_output+0x4fe
              kernel`ip_output+0xd74
              kernel`rip_output+0x229
              kernel`sosend_generic+0x3f6
              kernel`kern_sendit+0x1a3
              kernel`sendit+0xdc
              kernel`sys_sendto+0x4d
              kernel`amd64_syscall+0x5ea
              kernel`0xffffffff80d35667

The only way ixgbe_mq_start could return ENOBUFS would be when drbr_enqueue() encouters a full tx buf_ring. Since a new ping packet probably has no flow id, it should be assigned to a queue based on curcpu, which made me try to pin ping to single cpus to check wether it’s always the same tx buf_ring that reports being full. This turned out to be true:

# cpuset -l 0 ping 10.0.4.5
PING 10.0.4.5 (10.0.4.5): 56 data bytes
64 bytes from 10.0.4.5: icmp_seq=0 ttl=255 time=0.347 ms
64 bytes from 10.0.4.5: icmp_seq=1 ttl=255 time=0.135 ms

# cpuset -l 1 ping 10.0.4.5
PING 10.0.4.5 (10.0.4.5): 56 data bytes
64 bytes from 10.0.4.5: icmp_seq=0 ttl=255 time=0.184 ms
64 bytes from 10.0.4.5: icmp_seq=1 ttl=255 time=0.232 ms

# cpuset -l 2 ping 10.0.4.5
PING 10.0.4.5 (10.0.4.5): 56 data bytes
ping: sendto: No buffer space available
ping: sendto: No buffer space available
ping: sendto: No buffer space available
ping: sendto: No buffer space available
ping: sendto: No buffer space available

# cpuset -l 3 ping 10.0.4.5
PING 10.0.4.5 (10.0.4.5): 56 data bytes
64 bytes from 10.0.4.5: icmp_seq=0 ttl=255 time=0.130 ms
64 bytes from 10.0.4.5: icmp_seq=1 ttl=255 time=0.126 ms
[…snip...]

The system has 32 cores, if ping runs on cpu 2, 10, 18 or 26, which use the third tx buf_ring, ping reliably return ENOBUFS. If ping is run on any other cpu using any other tx queue, it runs without any packet loss.

So, when ENOBUFS is returned, this is not due to an mbuf shortage, it’s because the buf_ring is full. Not surprisingly, netstat -m looks pretty normal:

# netstat -m
38622/11823/50445 mbufs in use (current/cache/total)
32856/11642/44498/132096 mbuf clusters in use (current/cache/total/max)
32824/6344 mbuf+clusters out of packet secondary zone in use (current/cache)
16/3906/3922/66048 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/33024 9k jumbo clusters in use (current/cache/total/max)
0/0/0/16512 16k jumbo clusters in use (current/cache/total/max)
75431K/41863K/117295K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines

In the meantime I’ve checked the commit log of the ixgbe driver in HEAD and besides there are little differences between HEAD and 9.2, I don’t see a commit that fixes anything related to what were seeing…

So, what’s the conclusion here? Firmware bug that’s only triggered under 9.2? Driver bug introduced between 9.1 and 9.2 when new multiqueue stuff was added? Jack, how should we proceed?

Markus

On Thu, Feb 27, 2014 at 8:05 AM, Markus Gebert
<markus.gebert at hostpoint.ch>wrote:

> 
> On 27.02.2014, at 02:00, Rick Macklem <rmacklem at uoguelph.ca> wrote:
> 
>> John Baldwin wrote:
>>> On Tuesday, February 25, 2014 2:19:01 am Johan Kooijman wrote:
>>>> Hi all,
>>>> 
>>>> I have a weird situation here where I can't get my head around.
>>>> 
>>>> One FreeBSD 9.2-STABLE ZFS/NFS box, multiple Linux clients. Once in
>>>> a while
>>>> the Linux clients loose their NFS connection:
>>>> 
>>>> Feb 25 06:24:09 hv3 kernel: nfs: server 10.0.24.1 not responding,
>>>> timed out
>>>> 
>>>> Not all boxes, just one out of the cluster. The weird part is that
>>>> when I
>>>> try to ping a Linux client from the FreeBSD box, I have between 10
>>>> and 30%
>>>> packetloss - all day long, no specific timeframe. If I ping the
>>>> Linux
>>>> clients - no loss. If I ping back from the Linux clients to FBSD
>>>> box - no
>>>> loss.
>>>> 
>>>> The errors I get when pinging a Linux client is this one:
>>>> ping: sendto: File too large
> 
> We were facing similar problems when upgrading to 9.2 and have stayed with
> 9.1 on affected systems for now. We've seen this on HP G8 blades with
> 82599EB controllers:
> 
> ix0 at pci0:4:0:0: class=0x020000 card=0x18d0103c chip=0x10f88086 rev=0x01
> hdr=0x00
>    vendor     = 'Intel Corporation'
>    device     = '82599EB 10 Gigabit Dual Port Backplane Connection'
>    class      = network
>    subclass   = ethernet
> 
> We didn't find a way to trigger the problem reliably. But when it occurs,
> it usually affects only one interface. Symptoms include:
> 
> - socket functions return the 'File too large' error mentioned by Johan
> - socket functions return 'No buffer space' available
> - heavy to full packet loss on the affected interface
> - "stuck" TCP connection, i.e. ESTABLISHED TCP connections that should
> have timed out stick around forever (socket on the other side could have
> been closed ours ago)
> - userland programs using the corresponding sockets usually got stuck too
> (can't find kernel traces right now, but always in network related syscalls)
> 
> Network is only lightly loaded on the affected systems (usually 5-20 mbit,
> capped at 200 mbit, per server), and netstat never showed any indication of
> ressource shortage (like mbufs).
> 
> What made the problem go away temporariliy was to ifconfig down/up the
> affected interface.
> 
> We tested a 9.2 kernel with the 9.1 ixgbe driver, which was not really
> stable. Also, we tested a few revisions between 9.1 and 9.2 to find out
> when the problem started. Unfortunately, the ixgbe driver turned out to be
> mostly unstable on our systems between these releases, worse than on 9.2.
> The instability was introduced shortly after to 9.1 and fixed only very
> shortly before 9.2 release. So no luck there. We ended up using 9.1 with
> backports of 9.2 features we really need.
> 
> What we can't tell is wether it's the 9.2 kernel or the 9.2 ixgbe driver
> or a combination of both that causes these problems. Unfortunately we ran
> out of time (and ideas).
> 
> 
>>> EFBIG is sometimes used for drivers when a packet takes too many
>>> scatter/gather entries.  Since you mentioned NFS, one thing you can
>>> try is to
>>> disable TSO on the intertface you are using for NFS to see if that
>>> "fixes" it.
>>> 
>> And please email if you try it and let us know if it helps.
>> 
>> I've think I've figured out how 64K NFS read replies can do this,
>> but I'll admit "ping" is a mystery? (Doesn't it just send a single
>> packet that would be in a single mbuf?)
>> 
>> I think the EFBIG is replied by bus_dmamap_load_mbuf_sg(), but I
>> don't know if it can happen for an mbuf chain with < 32 entries?
> 
> We don't use the nfs server on our systems, but they're (new)nfsclients.
> So I don't think our problem is nfs related, unless the default rsize/wsize
> for client mounts is not 8K, which I thought it was. Can you confirm this,
> Rick?
> 
> IIRC, disabling TSO did not make any difference in our case.
> 
> 
> Markus
> 
>