Kernel (7.3) crash due to mbuf leak?

Fri Jul 30 18:40:45 UTC 2010

After upgrading a couple of our systems from 7.2-RELEASE to 7.3-RELEASE,
we have started to see them running out of mbuf's and crashing every
month or so.  The panic string is:

    kmem_malloc(16384): kmem_map too small: 335233024 total allocated

The actual panic signature (backtrace) shows a memory allocation failure
occurring in the filesystem code, but I do not think that is where the
problem lies.  Instead, it is clear to me that the system is slowly
leaking mbuf's until there is no more kernel memory available, and the
filesystem is just the innocent bystander asking for memory and failing
to get it.

Here's some netstat -m output on a couple of crashes:

    fs0# netstat -m -M vmcore.0

    882167/2902/885069 mbufs in use (current/cache/total)
    351/2041/2392/25600 mbuf clusters in use (current/cache/total/max)
    351/1569 mbuf+clusters out of packet secondary zone in use (current/cache)
    0/199/199/12800 4k (page size) jumbo clusters in use (current/cache/total/max)
    0/0/0/19200 9k jumbo clusters in use (current/cache/total/max)
    0/0/0/12800 16k jumbo clusters in use (current/cache/total/max)
    221249K/5603K/226853K bytes allocated to network (current/cache/total)
    0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
    0/0/0 requests for jumbo clusters denied (4k/9k/16k)
    0 requests for sfbufs denied
    0 requests for sfbufs delayed
    0 requests for I/O initiated by sendfile
    0 calls to protocol drain routines

    fs0# netstat -m -M vmcore.1

    894317/2905/897222 mbufs in use (current/cache/total)
    345/2013/2358/25600 mbuf clusters in use (current/cache/total/max)
    350/1358 mbuf+clusters out of packet secondary zone in use (current/cache)
    0/263/263/12800 4k (page size) jumbo clusters in use (current/cache/total/max)
    0/0/0/19200 9k jumbo clusters in use (current/cache/total/max)
    0/0/0/12800 16k jumbo clusters in use (current/cache/total/max)
    224274K/5804K/230078K bytes allocated to network (current/cache/total)
    0/1/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
    0/0/0 requests for jumbo clusters denied (4k/9k/16k)
    0 requests for sfbufs denied
    0 requests for sfbufs delayed
    0 requests for I/O initiated by sendfile
    0 calls to protocol drain routines

    fs1# netstat -m -M vmcore.0

    857844/2890/860734 mbufs in use (current/cache/total)
    317/2139/2456/25600 mbuf clusters in use (current/cache/total/max)
    350/1603 mbuf+clusters out of packet secondary zone in use (current/cache)
    0/263/263/12800 4k (page size) jumbo clusters in use (current/cache/total/max)
    0/0/0/19200 9k jumbo clusters in use (current/cache/total/max)
    0/0/0/12800 16k jumbo clusters in use (current/cache/total/max)
    215098K/6052K/221151K bytes allocated to network (current/cache/total)
    0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
    0/0/0 requests for jumbo clusters denied (4k/9k/16k)
    0 requests for sfbufs denied
    0 requests for sfbufs delayed
    0 requests for I/O initiated by sendfile
    0 calls to protocol drain routines

I also note that my currently running systems are both well on their way
to crashing again:

    fs0# netstat -m 

    766618/2927/769545 mbufs in use (current/cache/total)
    276/2560/2836/25600 mbuf clusters in use (current/cache/total/max)
    276/1772 mbuf+clusters out of packet secondary zone in use (current/cache)
    0/550/550/12800 4k (page size) jumbo clusters in use (current/cache/total/max)
    0/0/0/6400 9k jumbo clusters in use (current/cache/total/max)
    0/0/0/3200 16k jumbo clusters in use (current/cache/total/max)
    192207K/8051K/200259K bytes allocated to network (current/cache/total)
    0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
    0/0/0 requests for jumbo clusters denied (4k/9k/16k)
    0/7/6656 sfbufs in use (current/peak/max)
    0 requests for sfbufs denied
    0 requests for sfbufs delayed
    0 requests for I/O initiated by sendfile
    0 calls to protocol drain routines

    fs0# uptime
     1:00PM  up 18 days, 13:52, 1 user, load averages: 0.00, 0.00, 0.00

    fs1# netstat -m

    126949/3356/130305 mbufs in use (current/cache/total)
    263/1917/2180/25600 mbuf clusters in use (current/cache/total/max)
    263/1785 mbuf+clusters out of packet secondary zone in use (current/cache)
    0/295/295/12800 4k (page size) jumbo clusters in use (current/cache/total/max)
    0/0/0/6400 9k jumbo clusters in use (current/cache/total/max)
    0/0/0/3200 16k jumbo clusters in use (current/cache/total/max)
    32263K/5853K/38116K bytes allocated to network (current/cache/total)
    0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
    0/0/0 requests for jumbo clusters denied (4k/9k/16k)
    0/7/6656 sfbufs in use (current/peak/max)
    0 requests for sfbufs denied
    0 requests for sfbufs delayed
    0 requests for I/O initiated by sendfile
    0 calls to protocol drain routines

    fs1# uptime
     1:00PM  up 8 days, 17:23, 1 user, load averages: 0.00, 0.00, 0.00

Note that mbuf usage looks like a function of uptime, which is a classic
leak indication.

Can anyone give me some pointers as to how I can analyze these
crashdumps, or my running system, to determine what network subsystem is
leaking these mbuf's?

The services on these systems are extremely simple:

    SSH (though nobody logs in)
    sendmail
    qmail
    ntpd (client only)
    named (BIND)

Firewalling is performed by uncomplicated PF policy.

No special network features in use (no VLAN's or such):

    em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
	    options=19b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4>
	    ether 00:30:48:XX:XX:XX
	    inet XXX.XXX.XXX.XX netmask 0xfffffff8 broadcast XXX.XXX.XXX.XX
	    media: Ethernet autoselect (1000baseTX <full-duplex>)
	    status: active

What can I do to troubleshoot this problem?  Is there any accounting
system built into the mbuf subsystem to help me with this?

-- 
David DeSimone == Network Admin == fox at verio.net
  "I don't like spinach, and I'm glad I don't, because if I
   liked it I'd eat it, and I just hate it." -- Clarence Darrow

This email message is intended for the use of the person to whom it has been sent, and may contain information that is confidential or legally protected. If you are not the intended recipient or have received this message in error, you are not authorized to copy, distribute, or otherwise use this message or its attachments. Please notify the sender immediately by return e-mail and permanently delete this message and any attachments. Verio, Inc. makes no warranty that this email is error or virus free.  Thank you.