mbuf cluster leaks in -CURRENT

Sat Dec 3 14:59:16 PST 2005

Robert Watson wrote:
> 
> Yesterday I sat down to run some benchmarks on phk's changes to the process
> time measurement system for scheduling, and discovered SMP boxes were wedging
> in [zonelimit] when running netperf tests.  I quickly tracked this down to an
> mbuf cluster leak:
> 
>    /zoo/rwatson/netperf/bin/netserver
>    while (1)
>            echo ""
>            netstat -m | grep mbuf
>            /zoo/rwatson/netperf/bin/netperf -l 30 >& /dev/null
>    end
> 
> Result of:
> 
> CVS Date                Description                             Leak?
> 2005/12/3               sample                                  yes
> 2005/11/28-2005/11/29   rwatson sosend changes                  -
> 2005/11/25              sample                                  yes
> 2005/11/15              sample                                  yes
> 2005/11/02-2005/11/05   andre cluster changes                   -
> 2005/10/25              sample                                  no
> 2005/10/15              sample                                  no
> 2005/10/1               sample                                  no
> 2005/09/27              rwatson removes mbuf counters           -
> 2005/09/16              sample                                  no
> 
> The reason for the wedge is that NFS based systems don't like running out of
> mbuf clusters.  It turns out that the reason I likely didn't notice this
> previously was that I was running the test boxes in question without ACPI, and
> for whatever reason, the race becomes many times more serious with ACPI turned
> on.  It was leaking without ACPI, but since it was slower, I wasn't noticing
> since I had the machines up for much shorter tests.  Here's a sampling of
> kernel dates and whether or not the leak was present in a kernel from the
> date, as well as the dates of a few changes I was worried were likely causes:
> 
> 769/641/1410 mbufs in use (current/cache/total)
> 768/204/972/25600 mbuf clusters in use (current/cache/total/max)
> 
> 769/4991/5760 mbufs in use (current/cache/total)
> 4341/905/5246/25600 mbuf clusters in use (current/cache/total/max)
> 
> 769/8456/9225 mbufs in use (current/cache/total)
> 7901/801/8702/25600 mbuf clusters in use (current/cache/total/max)
> 
> 769/11786/12555 mbufs in use (current/cache/total)
> 11242/788/12030/25600 mbuf clusters in use (current/cache/total/max)
> 
> 769/15236/16005 mbufs in use (current/cache/total)
> 14570/916/15486/25600 mbuf clusters in use (current/cache/total/max)
> 
> 769/18566/19335 mbufs in use (current/cache/total)
> 17948/866/18814/25600 mbuf clusters in use (current/cache/total/max)
> 
> I've not really had a chance to investigate the details of the leak -- the
> number of used (allocated) mbufs remains low, but the cache number grows
> steadily.  However, the dates suggest that it was the mbuf cluster cleanup
> work you did that introduced the problem (although don't guarantee it).

This seems to be the same problem I described in rev. 1.14 of kern_mbuf.c
where mbuf+clusters from the packet zone (pre-combined m+c) never get free'd
back to their original pools.  The numbers from netstat -m support that
assumption.  It doesn't (and can't) show the number of cached m+c in the
packet zone.  Mbuf's in packet zone account as cached in the mbuf zone
because the packet zone is a secondary zone to it.  The clusters in use
are not leaked but attached to all those mbufs in the packet zone.  The
cluster zone doesn't know about the packet zone and accounts them as used.

This pseudo-leak is not from my changes (as it is a UMA bug) but gets
amplified by use of kernel subsystems which make heavy use of m+c from
the packet zone.  While my changes triggered this problem too by changing
the way packets get free'd back to the UMA mbuf, cluster and packet zones
it was reverted in 1.14.  The other refcount changes do not cause any such
effect.  It may very well be that some part of the network stack switched
from allocating mbuf and cluster separately to pre-combined packets.  That
would explain the 'sudden' appearance of the problem.

The right fix is to have UMA free back mbuf+clusters from the packet zone
to their native zones.  This should not be done with high/low watermarks
but a median and positive/negative deviation method.  Refills and drains
to/from the packet zone should happen in batches and not for single
requests or free's to be efficient.

I'll look into it tomorrow.  I may have to summon Bosko for some help on
the secondary zone stuff as he introduced this feature.

-- 
Andre