kern/144330: [nfs] mbuf leakage in nfsd with zfs

Mon Mar 22 14:00:13 UTC 2010

The following reply was made to PR kern/144330; it has been noted by GNATS.

From: Rick Macklem <rmacklem at uoguelph.ca>
To: Daniel Braniss <danny at cs.huji.ac.il>
Cc: Mikolaj Golub <to.my.trociny at gmail.com>,
        Jeremy Chadwick <freebsd at jdc.parodius.com>, freebsd-fs at FreeBSD.org,
        Kai Kockro <kkockro at web.de>, bug-followup at FreeBSD.org,
        gerrit at pmp.uni-hannover.de
Subject: Re: kern/144330: [nfs] mbuf leakage in nfsd with zfs 
Date: Mon, 22 Mar 2010 10:04:46 -0400 (EDT)

 On Mon, 22 Mar 2010, Daniel Braniss wrote:

 >
 > well, it's much better!, but no cookies yet :-)
 >

 Well, that's good news. I'll try and get dfr to review it and then
 commit it. Thanks Mikolaj, for finding this.

 > from comparing graphs in
 > 	ftp://ftp.cs.huji.ac.il/users/danny/freebsd/mbuf-leak/
 > store-01-e.ps: a production server running newfsd - now up almost 20 days
 > 	notice that the average used mbuf is below 1000!
 >
 > store-02.ps: kernel without last patch, classic nfsd
 > 	the leak is huge.
 >
 > store-02++.ps: with latest patch
 > 	the leak is much smaller but I see 2 issues:
 > 		- the initial leap to over 2000, then a smaller leak.

 The initial leap doesn't worry me. That's just a design constraint.
 A slow leak after that is still a problem. (I might have seen the
 slow leak in testing here. I'll poke at it and see if I can reproduce
 that.)

 >
 > could someone explain replay_prune() to me?
 >
 I just looked at it and I think it does the following:
  	- when it thinks the cache is too big (either too many entries
            or too much mbuf data) it loops around until:
  		- no longer too much or can't free any more
                  (when an entry is free'd, rc_size and rc_count are
                   reduced)
            (the loop is from the end of the tailq, so it is freeing
             the least recently used entries)
  	- the test for rce_repmsg.rm_xid != 0 avoids freeing ones
            that are in progress, since rce_repmsg is all zeroed until
            the reply has been generated

 I did notice that the call to replay_prune() from replay_setsize() does 
 not lock the mutex before calling it, so it doesn't look smp safe to me 
 for this case, but I doubt that would cause a slow leak. (I think this is
 only called when the number of mbuf clusters in the kernel changes and
 might cause a kernel crash if the tailq wasn't in a consistent state as
 it rattled through the list in the loop.)

 rick