kern/183381: Use of 9k buffers in if_em.c hangs with resource starvation

Mon Oct 28 05:20:01 UTC 2013

The following reply was made to PR kern/183381; it has been noted by GNATS.

From: David Gilbert <dave at daveg.ca>
To: bug-followup at FreeBSD.org
Cc:  
Subject: Re: kern/183381: Use of 9k buffers in if_em.c hangs with resource
 starvation
Date: Mon, 28 Oct 2013 01:12:12 -0400

 This is a multi-part message in MIME format.
 --------------000409080009070305040300
 Content-Type: text/plain; charset=ISO-8859-1
 Content-Transfer-Encoding: 7bit

 As promised, here is the email conversation:

 Subject: *Or it could be ZFS memory starvation and 9k packets (was Re:
 istgt causes massive jumbo nmbclusters loss)*
 ------------------------

 From: *Zaphod Beeblebrox* <zbeeble at gmail.com <mailto:zbeeble at gmail.com>>
 Date: Sat, Oct 26, 2013 at 1:16 AM
 To: FreeBSD Net <freebsd-net at freebsd.org
 <mailto:freebsd-net at freebsd.org>>, freebsd-fs <freebsd-fs at freebsd.org
 <mailto:freebsd-fs at freebsd.org>>

 At first I thought this was entirely the interaction of istgt and 9k
 packets, but after some observation (and a few more hangs) I'm
 reasonably positive it's a form of resource starvation related to ZFS
 and 9k packets.

 To reliably trigger the hang, I need to do something that triggers a
 demand for 9k packets (like istgt traffic, but also bit torrent traffic
 --- as you see the MTU is 9014) and it must have been some time since
 the system booted.  ZFS is fairly busy (with both NFS and SMB guests),
 so it generally takes quite a bit of the 8G of memory for itself.

 Now... below the netstat -m shows 1399 9k bufs with 376 available.  When
 the network gets busy, I've seen 4k or even 5k bufs in total... never
 near the 77k max.  After some time of lesser activity, the number of 9k
 buffers returns to this level.

 When the problem occurs, the number of denied buffers will shoot up at
 the rate of several hundred or even several thousand per second, but the
 system will not be "out" of memory.  Top will show 800 meg often in the
 free column when this happens.  While it's happening, when I'm logged
 into the console, none of these stats seem out of place, save the number
 of denied 9k buffer allocations and the "cache" of 9k buffers will be
 less than 10 (but I've never seen it at 0).

 On Tue, Oct 22, 2013 at 3:42 PM, Zaphod Beeblebrox <zbeeble at gmail.com
 <mailto:zbeeble at gmail.com>> wrote:

     I have a server

     FreeBSD virtual.accountingreality.com
     <http://virtual.accountingreality.com> 9.2-STABLE FreeBSD 9.2-STABLE
     #13 r256549M: Tue Oct 15 16:29:48 EDT 2013    
     root at virtual.accountingreality.com:/usr/obj/usr/src/sys/VRA  amd64

     That has an em0 with jumbo packets enabled:

     em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu
     9014

     It has (among other things): ZFS, NFS, iSCSI (via istgt) and Samba.

     Every day or two, it looses it's ability to talk to the network. 
     ifconfig down/up on em0 gives the message about not being able to
     allocate the receive buffers...

     With everything running, but with specifically iSCSI not used,
     everything seems good.  When I start hitting istgt, I see the denied
     stat for 9k mbufs rise very rapidly (this amount only took a few
     seconds):

     [1:47:347]root at virtual:/usr/local/etc/iet> netstat -m
     1313/877/2190 <tel:1313%2F877%2F2190> mbufs in use (current/cache/total)
     20/584/604/523514 mbuf clusters in use (current/cache/total/max)
     20/364 mbuf+clusters out of packet secondary zone in use (current/cache)
     239/359/598/261756 4k (page size) jumbo clusters in use
     (current/cache/total/max)
     1023/376/1399/77557 9k jumbo clusters in use (current/cache/total/max)
     0/0/0/43626 16k jumbo clusters in use (current/cache/total/max)
     10531K/6207K/16738K bytes allocated to network (current/cache/total)
     0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
     0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
     0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
     0/50199/0 requests for jumbo clusters denied (4k/9k/16k)
     0/0/0 sfbufs in use (current/peak/max)
     0 requests for sfbufs denied
     0 requests for sfbufs delayed
     0 requests for I/O initiated by sendfile
     0 calls to protocol drain routines

     ... the denied number rises... and somewhere in the millions or more
     the machine stops --- but even with the large number of denied 9k
     clusters, the "9k jumbo clusters in use" line will always indicate
     some available.

     ... so is this a tuning or a bug issue?  I've tried ietd ---
     basically it doesn't want to work with a zfs zvol, it seems (refuses
     to use it).

 ----------
 From: *Garrett Wollman* <wollman at hergotha.csail.mit.edu
 <mailto:wollman at hergotha.csail.mit.edu>>
 Date: Sat, Oct 26, 2013 at 1:52 AM
 To: zbeeble at gmail.com <mailto:zbeeble at gmail.com>
 Cc: net at freebsd.org <mailto:net at freebsd.org>

 In article
 <CACpH0MfEy50Y5QOZCdn2co_JmY_QPfVRxYwK-73W0WYsHB-Fqw at mail.gmail.com
 <mailto:CACpH0MfEy50Y5QOZCdn2co_JmY_QPfVRxYwK-73W0WYsHB-Fqw at mail.gmail.com>>
 you write:

 >Now... below the netstat -m shows 1399 9k bufs with 376 available.  When
 >the network gets busy, I've seen 4k or even 5k bufs in total... never near
 >the 77k max.  After some time of lesser activity, the number of 9k buffers
 >returns to this level.

 The network interface (driver) almost certainly should not be using 9k
 mbufs.  These buffers are physically contiguous, and after not too
 much activity, it will be nearly impossible to allocate three
 physically contiguous buffers.

 >> That has an em0 with jumbo packets enabled:
 >>
 >> em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9014

 I don't know for certain about em(4), but it very likely should not be
 using 9k mbufs.  Intel network hardware has done scatter-gather since
 nearly the year dot.  (Seriously, I wrote a network driver for the
 i82586 back at the very beginning of FreeBSD's existence, and *that*
 part had scatter-gather.  No jumbo frames, though!)

 The entire existence of 9k and 16k mbufs is probably a mistake.  There
 should not be any network interfaces that are modern enough to do
 jumbo frames but ancient enough to require physically contiguous pages
 for each frame.  I don't know if the em(4) driver is written such that
 you can just disable the use of those mbufs, though.  You could try
 making this change, though.  Look for this code in if_em.c:

         /*
         ** Figure out the desired mbuf
         ** pool for doing jumbos
         */
         if (adapter->max_frame_size <= 2048)
                 adapter->rx_mbuf_sz = MCLBYTES;
         else if (adapter->max_frame_size <= 4096)
                 adapter->rx_mbuf_sz = MJUMPAGESIZE;
         else
                 adapter->rx_mbuf_sz = MJUM9BYTES;

 Comment out the last two lines and change the else if (...) to else.
 It's not obvious that the rest of the code can cope with this, but it
 does work that way on other Intel hardware so it seems like it may be
 worth a shot.

 -GAWollman

 ----------
 From: *Zaphod Beeblebrox* <zbeeble at gmail.com <mailto:zbeeble at gmail.com>>
 Date: Sat, Oct 26, 2013 at 2:55 PM
 To: Garrett Wollman <wollman at hergotha.csail.mit.edu
 <mailto:wollman at hergotha.csail.mit.edu>>
 Cc: net at freebsd.org <mailto:net at freebsd.org>

 To be clear, I made just this patch:

 Index: if_em.c
 ===================================================================
 --- if_em.c     (revision 256870)
 +++ if_em.c     (working copy)
 @@ -1343,10 +1343,10 @@
         */
         if (adapter->hw.mac.max_frame_size <= 2048)
                 adapter->rx_mbuf_sz = MCLBYTES;
 -       else if (adapter->hw.mac.max_frame_size <= 4096)
 +       else /*if (adapter->hw.mac.max_frame_size <= 4096) */
                 adapter->rx_mbuf_sz = MJUMPAGESIZE;
 -       else
 -               adapter->rx_mbuf_sz = MJUM9BYTES;
 +       /* else
 +               adapter->rx_mbuf_sz = MJUM9BYTES; */

         /* Prepare receive descriptors and buffers */
         if (em_setup_receive_structures(adapter)) {

 (which is against 9.2-STABLE if you're looking).

 The result is that no 9k clusters appear to be allocated.  I'm still
 running the system as before, but so far the problem has not recurred. 
 Of note, given your comment, is that this patch doesn't appear to break
 anything, either.  Should I send-pr it?

 ----------
 From: *Garrett Wollman* <wollman at bimajority.org
 <mailto:wollman at bimajority.org>>
 Date: Sat, Oct 26, 2013 at 7:18 PM
 To: Zaphod Beeblebrox <zbeeble at gmail.com <mailto:zbeeble at gmail.com>>
 Cc: net at freebsd.org <mailto:net at freebsd.org>

 <<On Sat, 26 Oct 2013 14:55:19 -0400, Zaphod Beeblebrox
 <zbeeble at gmail.com <mailto:zbeeble at gmail.com>> said:

 > The result is that no 9k clusters appear to be allocated.  I'm still
 > running the system as before, but so far the problem has not recurred.  Of
 > note, given your comment, is that this patch doesn't appear to break
 > anything, either.  Should I send-pr it?

 You bet.  Otherwise it will get lost.  Hopefully it can be assigned to
 whoever is maintaining this driver as a reminder.

 -GAWollman

 --------------000409080009070305040300
 Content-Type: text/html; charset=ISO-8859-1
 Content-Transfer-Encoding: 7bit

 <html>
   <head>

     <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
   </head>
   <body bgcolor="#FFFFFF" text="#000000">
     As promised, here is the email conversation:<br>
     <br>
     Subject: <b class="gmail_sendername">Or it could be ZFS memory
       starvation and 9k packets (was Re: istgt causes massive jumbo
       nmbclusters loss)</b><br>
     ------------------------<br>
     <br>
     <span class="undefined"><font color="#888">From: <b
           class="undefined">Zaphod Beeblebrox</b> <span dir="ltr"><<a
             href="mailto:zbeeble at gmail.com">zbeeble at gmail.com</a>></span><br>
         Date: Sat, Oct 26, 2013 at 1:16 AM<br>
         To: FreeBSD Net <<a href="mailto:freebsd-net at freebsd.org">freebsd-net at freebsd.org</a>>,
         freebsd-fs <<a href="mailto:freebsd-fs at freebsd.org">freebsd-fs at freebsd.org</a>><br>
       </font><br>
     </span><br>
     <div dir="ltr">
       <div>
         <div>At first I thought this was entirely the interaction of
           istgt and 9k packets, but after some observation (and a few
           more hangs) I'm reasonably positive it's a form of resource
           starvation related to ZFS and 9k packets.<br>
           <br>
         </div>
         To reliably trigger the hang, I need to do something that
         triggers a demand for 9k packets (like istgt traffic, but also
         bit torrent traffic --- as you see the MTU is 9014) and it must
         have been some time since the system booted.  ZFS is fairly busy
         (with both NFS and SMB guests), so it generally takes quite a
         bit of the 8G of memory for itself.<br>
         <br>
       </div>
       Now... below the netstat -m shows 1399 9k bufs with 376
       available.  When the network gets busy, I've seen 4k or even 5k
       bufs in total... never near the 77k max.  After some time of
       lesser activity, the number of 9k buffers returns to this level.<br>
       <br>
       When the problem occurs, the number of denied buffers will shoot
       up at the rate of several hundred or even several thousand per
       second, but the system will not be "out" of memory.  Top will show
       800 meg often in the free column when this happens.  While it's
       happening, when I'm logged into the console, none of these stats
       seem out of place, save the number of denied 9k buffer allocations
       and the "cache" of 9k buffers will be less than 10 (but I've never
       seen it at 0).<br>
       <div>
         <div>
           <div>
             <div>
               <div>
                 <div class="gmail_extra"><br>
                   <br>
                   <div class="gmail_quote">On Tue, Oct 22, 2013 at 3:42
                     PM, Zaphod Beeblebrox <span dir="ltr"><<a
                         href="mailto:zbeeble at gmail.com" target="_blank">zbeeble at gmail.com</a>></span>
                     wrote:<br>
                     <blockquote class="gmail_quote" style="margin:0 0 0
                       .8ex;border-left:1px #ccc solid;padding-left:1ex">
                       <div dir="ltr">
                         <div>
                           <div>
                             <div>
                               <div>
                                 <div>I have a server<br>
                                   <br>
                                   FreeBSD <a
                                     href="http://virtual.accountingreality.com"
                                     target="_blank">virtual.accountingreality.com</a>
                                   9.2-STABLE FreeBSD 9.2-STABLE #13
                                   r256549M: Tue Oct 15 16:29:48 EDT
                                   2013    
                                   <a class="moz-txt-link-abbreviated" href="mailto:root at virtual.accountingreality.com:/usr/obj/usr/src/sys/VRA">root at virtual.accountingreality.com:/usr/obj/usr/src/sys/VRA</a> 
                                   amd64<br>
                                   <br>
                                   That has an em0 with jumbo packets
                                   enabled:<br>
                                   <br>
                                   em0:
                                   flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST>
                                   metric 0 mtu 9014<br>
                                   <br>
                                 </div>
                                 It has (among other things): ZFS, NFS,
                                 iSCSI (via istgt) and Samba.<br>
                                 <br>
                               </div>
                               Every day or two, it looses it's ability
                               to talk to the network.  ifconfig down/up
                               on em0 gives the message about not being
                               able to allocate the receive buffers...<br>
                               <br>
                             </div>
                             With everything running, but with
                             specifically iSCSI not used, everything
                             seems good.  When I start hitting istgt, I
                             see the denied stat for 9k mbufs rise very
                             rapidly (this amount only took a few
                             seconds):<br>
                             <br>
                             [1:47:347]root at virtual:/usr/local/etc/iet>
                             netstat -m<br>
                             <a href="tel:1313%2F877%2F2190"
                               value="+13138772190" target="_blank">1313/877/2190</a>
                             mbufs in use (current/cache/total)<br>
                             20/584/604/523514 mbuf clusters in use
                             (current/cache/total/max)<br>
                             20/364 mbuf+clusters out of packet secondary
                             zone in use (current/cache)<br>
                             239/359/598/261756 4k (page size) jumbo
                             clusters in use (current/cache/total/max)<br>
                             1023/376/1399/77557 9k jumbo clusters in use
                             (current/cache/total/max)<br>
                             0/0/0/43626 16k jumbo clusters in use
                             (current/cache/total/max)<br>
                             10531K/6207K/16738K bytes allocated to
                             network (current/cache/total)<br>
                             0/0/0 requests for mbufs denied
                             (mbufs/clusters/mbuf+clusters)<br>
                             0/0/0 requests for mbufs delayed
                             (mbufs/clusters/mbuf+clusters)<br>
                             0/0/0 requests for jumbo clusters delayed
                             (4k/9k/16k)<br>
                             0/50199/0 requests for jumbo clusters denied
                             (4k/9k/16k)<br>
                             0/0/0 sfbufs in use (current/peak/max)<br>
                             0 requests for sfbufs denied<br>
                             0 requests for sfbufs delayed<br>
                             0 requests for I/O initiated by sendfile<br>
                             0 calls to protocol drain routines<br>
                             <br>
                           </div>
                           ... the denied number rises... and somewhere
                           in the millions or more the machine stops ---
                           but even with the large number of denied 9k
                           clusters, the "9k jumbo clusters in use" line
                           will always indicate some available.<br>
                           <br>
                         </div>
                         ... so is this a tuning or a bug issue?  I've
                         tried ietd --- basically it doesn't want to work
                         with a zfs zvol, it seems (refuses to use it).<br>
                         <br>
                       </div>
                     </blockquote>
                   </div>
                   <br>
                 </div>
               </div>
             </div>
           </div>
         </div>
       </div>
     </div>
     <br>
     ----------<br>
     <span class="undefined"><font color="#888">From: <b
           class="undefined">Garrett Wollman</b> <span dir="ltr"><<a
             href="mailto:wollman at hergotha.csail.mit.edu">wollman at hergotha.csail.mit.edu</a>></span><br>
         Date: Sat, Oct 26, 2013 at 1:52 AM<br>
         To: <a href="mailto:zbeeble at gmail.com">zbeeble at gmail.com</a><br>
         Cc: <a href="mailto:net at freebsd.org">net at freebsd.org</a><br>
       </font><br>
     </span><br>
     <div class="im">In article <<a
 href="mailto:CACpH0MfEy50Y5QOZCdn2co_JmY_QPfVRxYwK-73W0WYsHB-Fqw at mail.gmail.com">CACpH0MfEy50Y5QOZCdn2co_JmY_QPfVRxYwK-73W0WYsHB-Fqw at mail.gmail.com</a>>
       you write:<br>
       <br>
       >Now... below the netstat -m shows 1399 9k bufs with 376
       available.  When<br>
       >the network gets busy, I've seen 4k or even 5k bufs in
       total... never near<br>
       >the 77k max.  After some time of lesser activity, the number
       of 9k buffers<br>
       >returns to this level.<br>
       <br>
     </div>
     The network interface (driver) almost certainly should not be using
     9k<br>
     mbufs.  These buffers are physically contiguous, and after not too<br>
     much activity, it will be nearly impossible to allocate three<br>
     physically contiguous buffers.<br>
     <div class="im"><br>
       >> That has an em0 with jumbo packets enabled:<br>
       >><br>
       >> em0:
       flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0
       mtu 9014<br>
       <br>
     </div>
     I don't know for certain about em(4), but it very likely should not
     be<br>
     using 9k mbufs.  Intel network hardware has done scatter-gather
     since<br>
     nearly the year dot.  (Seriously, I wrote a network driver for the<br>
     i82586 back at the very beginning of FreeBSD's existence, and *that*<br>
     part had scatter-gather.  No jumbo frames, though!)<br>
     <br>
     The entire existence of 9k and 16k mbufs is probably a mistake.
      There<br>
     should not be any network interfaces that are modern enough to do<br>
     jumbo frames but ancient enough to require physically contiguous
     pages<br>
     for each frame.  I don't know if the em(4) driver is written such
     that<br>
     you can just disable the use of those mbufs, though.  You could try<br>
     making this change, though.  Look for this code in if_em.c:<br>
     <br>
             /*<br>
             ** Figure out the desired mbuf<br>
             ** pool for doing jumbos<br>
             */<br>
             if (adapter->max_frame_size <= 2048)<br>
                     adapter->rx_mbuf_sz = MCLBYTES;<br>
             else if (adapter->max_frame_size <= 4096)<br>
                     adapter->rx_mbuf_sz = MJUMPAGESIZE;<br>
             else<br>
                     adapter->rx_mbuf_sz = MJUM9BYTES;<br>
     <br>
     Comment out the last two lines and change the else if (...) to else.<br>
     It's not obvious that the rest of the code can cope with this, but
     it<br>
     does work that way on other Intel hardware so it seems like it may
     be<br>
     worth a shot.<br>
     <br>
     -GAWollman<br>
     <br>
     ----------<br>
     <span class="undefined"><font color="#888">From: <b
           class="undefined">Zaphod Beeblebrox</b> <span dir="ltr"><<a
             href="mailto:zbeeble at gmail.com">zbeeble at gmail.com</a>></span><br>
         Date: Sat, Oct 26, 2013 at 2:55 PM<br>
         To: Garrett Wollman <<a
           href="mailto:wollman at hergotha.csail.mit.edu">wollman at hergotha.csail.mit.edu</a>><br>
         Cc: <a href="mailto:net at freebsd.org">net at freebsd.org</a><br>
       </font><br>
     </span><br>
     <div dir="ltr">
       <div>
         <div>
           To be clear, I made just this patch:<br>
           <br>
           Index: if_em.c<br>
 ===================================================================<br>
           --- if_em.c     (revision 256870)<br>
           +++ if_em.c     (working copy)<br>
           @@ -1343,10 +1343,10 @@<br>
                   */<br>
                   if (adapter->hw.mac.max_frame_size <= 2048)<br>
                           adapter->rx_mbuf_sz = MCLBYTES;<br>
           -       else if (adapter->hw.mac.max_frame_size <= 4096)<br>
           +       else /*if (adapter->hw.mac.max_frame_size <=
           4096) */<br>
                           adapter->rx_mbuf_sz = MJUMPAGESIZE;<br>
           -       else<br>
           -               adapter->rx_mbuf_sz = MJUM9BYTES;<br>
           +       /* else<br>
           +               adapter->rx_mbuf_sz = MJUM9BYTES; */<br>
           <br>
                   /* Prepare receive descriptors and buffers */<br>
                   if (em_setup_receive_structures(adapter)) {<br>
           <br>
         </div>
         (which is against 9.2-STABLE if you're looking).<br>
         <br>
       </div>
       The result is that no 9k clusters appear to be allocated.  I'm
       still running the system as before, but so far the problem has not
       recurred.  Of note, given your comment, is that this patch doesn't
       appear to break anything, either.  Should I send-pr it?<br>
     </div>
     <br>
     ----------<br>
     <span class="undefined"><font color="#888">From: <b
           class="undefined">Garrett Wollman</b> <span dir="ltr"><<a
             href="mailto:wollman at bimajority.org">wollman at bimajority.org</a>></span><br>
         Date: Sat, Oct 26, 2013 at 7:18 PM<br>
         To: Zaphod Beeblebrox <<a href="mailto:zbeeble at gmail.com">zbeeble at gmail.com</a>><br>
         Cc: <a href="mailto:net at freebsd.org">net at freebsd.org</a><br>
       </font><br>
     </span><br>
     <div class="im">
       <<On Sat, 26 Oct 2013 14:55:19 -0400, Zaphod Beeblebrox <<a
         href="mailto:zbeeble at gmail.com">zbeeble at gmail.com</a>> said:<br>
       <br>
       > The result is that no 9k clusters appear to be allocated.
        I'm still<br>
       > running the system as before, but so far the problem has not
       recurred.  Of<br>
       > note, given your comment, is that this patch doesn't appear
       to break<br>
       > anything, either.  Should I send-pr it?<br>
       <br>
     </div>
     You bet.  Otherwise it will get lost.  Hopefully it can be assigned
     to<br>
     whoever is maintaining this driver as a reminder.<br>
     <br>
     -GAWollman<br>
     <br>
     <br>
   </body>
 </html>

 --------------000409080009070305040300--