leaking lots of unreferenced inodes (pg_xlog files?)

Wed Jul 17 05:34:43 UTC 2013

On Wed, Jul 17, 2013 at 12:47:22AM +0200, Palle Girgensohn wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Kirk McKusick skrev:
> >> Date: Mon, 15 Jul 2013 10:51:10 +0100 Subject: Re: leaking lots of
> >> unreferenced inodes (pg_xlog files?) From: Dan Thomas
> >> <godders at gmail.com> To: Kirk McKusick <mckusick at mckusick.com> Cc:
> >> Palle Girgensohn <girgen at freebsd.org>, freebsd-fs at freebsd.org, Jeff
> >> Roberson <jroberson at jroberson.net>, Julian Akehurst
> >> <julian at pingpong.se> X-ASK-Info: Message Queued (2013/07/15
> >> 02:51:22) X-ASK-Info: Confirmed by User (2013/07/15 02:55:04)
> >> 
> >> On 11 June 2013 01:17, Kirk McKusick <mckusick at mckusick.com>
> >> wrote:
> >>> OK, good to have it narrowed down. I will look to devise some 
> >>> additional diagnostics that hopefully will help tease out the 
> >>> bug. I'll hopefully get back to you soon.
> >> Hi,
> >> 
> >> Is there any news on this issue? We're still running several
> >> servers that are exhibiting this problem (most recently, one that
> >> seems to be leaking around 10gb/hour), and it's getting to the
> >> point where we're looking at moving to a different OS until it's
> >> resolved.
> >> 
> >> We have access to several production systems with this problem and
> >> (at least from time to time) will have systems with a significant
> >> leak on them that we can experiment with. Is there any way we can
> >> assist with tracking this down? Any diagnostics or testing that
> >> would be useful?
> >> 
> >> Thanks, Dan
> > 
> > Hi Dan (and Palle),
> > 
> > Sorry for the long delay with no help / news. I have gotten 
> > side-tracked on several projects and have had little time to try and
> > devise some tests that would help find the cause of the lost space.
> > It almost certainly is a one-line fix (a missing vput or vrele
> > probably in some error path), but finding where it goes is the hard
> > part :-)
> > 
> > I have had little success in inserting code that tracks reference 
> > counts (too many false positives). So, I am going to need some help 
> > from you to narrow it down. My belief is that there is some set of 
> > filesystem operations (system calls) that are leading to the
> > problem. Notably, a file is being created, data put into it, then the
> > file is deleted (either before or after being closed).  Somehow a
> > reference to that file is persisting despite there being no valid
> > reference to it. Hence the filesystem thinks it is still live and is
> > not deleting it. When you do the forcible unmount, these files get 
> > cleared and the space shows back up.
> > 
> > What I need to devise is a small test program doing the set of system
> > calls that cause this to happen. The way that I would like to try and
> > get it is to have you `ktrace -i' your application and then run your
> > application just long enough to create at least one of these lost
> > files. The goal is to minimize the amount of ktrace data through
> > which we need to sift.
> > 
> > In preparation for doing this test you need to have a kernel compiled
> > with `option DIAGNOSTIC' or if you prefer, just add `#define
> > DIAGNOSTIC 1' to the top of sys/kern/vfs_subr.c. You will know you
> > have at least one offending file when you try to unmount the affected
> > filesystem and find it busy. Before doing the `umount -f', enable
> > busy printing using `sysctl debug.busyprt=1'. Then capture the
> > console output which will show the details of all the vnodes that had
> > to be forcibly flushed. Hopefully we will then be able to correlate
> > them back to the files (NAMI in the ktrace output) with which they
> > were associated. We may need to augment the NAMI data with the inode
> > number of the associated file to make the association with the
> > busyprt output. Anyway, once we have that, we can look at all the
> > system calls done on those files and create a small test program that
> > exhibits the problem. Given a small test program, Jeff or I can track
> > down the offending system call path and nail this pernicious bug once
> > and for all.
> > 
> > Kirk McKusick
> 
> Hi,
> 
> I have run ktrace -i on pg_ctl (which forks off all the postgresql
> processes) and I got two "busy" files that where "lost" after a few
> hours. dmesg reveals this:
> 
> vflush: busy vnode
> 0xfffffe067cdde960: tag ufs, type VREG
>     usecount 1, writecount 0, refcount 2 mountedhere 0
>     flags (VI(0x200))
>  VI_LOCKed    v_object 0xfffffe0335922000 ref 0 pages 0
>     lock type ufs: EXCL by thread 0xfffffe01600eb8e0 (pid 56723)
> 	ino 11047146, on dev da2s1d
> vflush: busy vnode
> 0xfffffe039f35bb40: tag ufs, type VREG
>     usecount 1, writecount 0, refcount 3 mountedhere 0
>     flags (VI(0x200))
>  VI_LOCKed    v_object 0xfffffe03352701d0 ref 0 pages 0
>     lock type ufs: EXCL by thread 0xfffffe01600eb8e0 (pid 56723)
> 	ino 11045961, on dev da2s1d
> 
> 
> I had to umount -f, so they where "lost".
> 
> So, now I have 55 GB ktrace output... ;)  Is there anything I can do to
> filter it, or shall I compress it and put it on a web server for you to
> fetch as it is?

I think that 55GB of ktrace is obviously useless.  The Kirk' idea was to
have an isolated test case that would only create the situation triggering
the leak, without irrelevant activity.  This indeed requires drilling down
and isolating the file activities to get to the core of problem.

FWIW, I and Peter Holm used the following alternative approach quite
successfully when tracking down other vnode reference leaks.  The approach
still requires some understanding of the specifics of the problematic
files to be useful, but not as much as isolated test.

Basically, you take the patch below, and set the VV_DEBUGVREF flag for
the vnode that has characteristics as much specific for the leaked vnode
as possible.  The patch has example of setting the flag for all new NFS 
vnodes.  You would probably want to do the same in vfs_vgetf(),
checking  e.g. for the partition where your leaks happen.  The limiting
of the vnodes for which the vref traces are accumulated is needed to
save the kernel memory.

Then after the leak was observed, you just print the vnode with ddb
command 'show vnode addr' and send the output to developer.

Index: sys/sys/vnode.h
===================================================================

--- sys/sys/vnode.h	(revision 248723)
+++ sys/sys/vnode.h	(working copy)
@@ -94,6 +94,13 @@ struct vpollinfo {
 
 #if defined(_KERNEL) || defined(_KVM_VNODE)
 
+struct debug_ref {
+	TAILQ_ENTRY(debug_ref) link;
+	int val;
+	const char *op;
+	struct stack stack;
+};
+
 struct vnode {
 	/*
 	 * Fields which define the identity of the vnode.  These fields are
@@ -169,6 +176,7 @@ struct vnode {
 	int	v_writecount;			/* v ref count of writers */
 	u_int	v_hash;
 	enum	vtype v_type;			/* u vnode type */
+	TAILQ_HEAD(, debug_ref) v_debug_ref;
 };
 
 #endif /* defined(_KERNEL) || defined(_KVM_VNODE) */
@@ -253,6 +261,7 @@ struct xvnode {
 #define	VV_DELETED	0x0400	/* should be removed */
 #define	VV_MD		0x0800	/* vnode backs the md device */
 #define	VV_FORCEINSMQ	0x1000	/* force the insmntque to succeed */
+#define	VV_DEBUGVREF	0x2000
 
 /*
  * Vnode attributes.  A field value of VNOVAL represents a field whose value
Index: sys/kern/vfs_subr.c
===================================================================
--- sys/kern/vfs_subr.c	(revision 248723)
+++ sys/kern/vfs_subr.c	(working copy)
@@ -71,6 +71,7 @@ __FBSDID("$FreeBSD$");
 #include <sys/sched.h>
 #include <sys/sleepqueue.h>
 #include <sys/smp.h>
+#include <sys/stack.h>
 #include <sys/stat.h>
 #include <sys/sysctl.h>
 #include <sys/syslog.h>
@@ -871,6 +872,23 @@ static struct kproc_desc vnlru_kp = {
 };
 SYSINIT(vnlru, SI_SUB_KTHREAD_UPDATE, SI_ORDER_FIRST, kproc_start,
     &vnlru_kp);
+
+MALLOC_DEFINE(M_RECORD_REF, "recordref", "recordref");
+static void
+v_record_ref(struct vnode *vp, int val, const char *op)
+{
+	struct debug_ref *r;
+
+	if ((vp->v_type != VREG && vp->v_type != VBAD) ||
+	    (vp->v_vflag & VV_DEBUGVREF) == 0)
+		return;
+	r = malloc(sizeof(struct debug_ref), M_RECORD_REF, M_NOWAIT |
+	    M_USE_RESERVE);
+	r->val = val;
+	r->op = op;
+	stack_save(&r->stack);
+	TAILQ_INSERT_TAIL(&vp->v_debug_ref, r, link);
+}
  
 /*
  * Routines having to do with the management of the vnode table.
@@ -1073,6 +1091,7 @@ alloc:
 			vp->v_vflag |= VV_NOKNOTE;
 	}
 	rangelock_init(&vp->v_rl);
+	TAILQ_INIT(&vp->v_debug_ref);
 
 	/*
 	 * For the filesystems which do not use vfs_hash_insert(),
@@ -1082,6 +1101,7 @@ alloc:
 	 */
 	vp->v_hash = (uintptr_t)vp >> vnsz2log;
 
+	TAILQ_INIT(&vp->v_debug_ref);
 	*vpp = vp;
 	return (0);
 }
@@ -2197,6 +2217,7 @@ vget(struct vnode *vp, int flags, struct thread *t
 			vinactive(vp, td);
 		vp->v_iflag &= ~VI_OWEINACT;
 	}
+	v_record_ref(vp, 1, "vget");
 	VI_UNLOCK(vp);
 	return (0);
 }
@@ -2211,6 +2232,7 @@ vref(struct vnode *vp)
 	CTR2(KTR_VFS, "%s: vp %p", __func__, vp);
 	VI_LOCK(vp);
 	v_incr_usecount(vp);
+	v_record_ref(vp, 1, "vref");
 	VI_UNLOCK(vp);
 }
 
@@ -2253,6 +2275,7 @@ vputx(struct vnode *vp, int func)
 		KASSERT(func == VPUTX_VRELE, ("vputx: wrong func"));
 	CTR2(KTR_VFS, "%s: vp %p", __func__, vp);
 	VI_LOCK(vp);
+	v_record_ref(vp, -1, "vputx");
 
 	/* Skip this v_writecount check if we're going to panic below. */
 	VNASSERT(vp->v_writecount < vp->v_usecount || vp->v_usecount < 1, vp,
@@ -2409,6 +2432,7 @@ void
 vdropl(struct vnode *vp)
 {
 	struct bufobj *bo;
+	struct debug_ref *r, *r1;
 	struct mount *mp;
 	int active;
 
@@ -2489,6 +2513,9 @@ vdropl(struct vnode *vp)
 	lockdestroy(vp->v_vnlock);
 	mtx_destroy(&vp->v_interlock);
 	mtx_destroy(BO_MTX(bo));
+	TAILQ_FOREACH_SAFE(r, &vp->v_debug_ref, link, r1) {
+		free(r, M_RECORD_REF);
+	}
 	uma_zfree(vnode_zone, vp);
 }
 
@@ -2888,6 +2915,8 @@ vn_printf(struct vnode *vp, const char *fmt, ...)
 	va_list ap;
 	char buf[256], buf2[16];
 	u_long flags;
+	int ref;
+	struct debug_ref *r;
 
 	va_start(ap, fmt);
 	vprintf(fmt, ap);
@@ -2960,8 +2989,21 @@ vn_printf(struct vnode *vp, const char *fmt, ...)
 		    vp->v_object->resident_page_count);
 	printf("    ");
 	lockmgr_printinfo(vp->v_vnlock);
-	if (vp->v_data != NULL)
-		VOP_PRINT(vp);
+#if DDB
+	if (kdb_active) {
+		if (vp->v_data != NULL)
+			VOP_PRINT(vp);
+	}
+#endif
+
+	/* Getnewvnode() initial reference is not recorded due to VNON */
+	ref = 1;
+	TAILQ_FOREACH(r, &vp->v_debug_ref, link) {
+		ref += r->val;
+		printf("REF %d %s\n", ref, r->op);
+		stack_print(&r->stack);
+	}
+
 }
 
 #ifdef DDB
Index: sys/fs/nfsclient/nfs_clport.c
===================================================================
--- sys/fs/nfsclient/nfs_clport.c	(revision 248723)
+++ sys/fs/nfsclient/nfs_clport.c	(working copy)
@@ -273,6 +273,7 @@ nfscl_nget(struct mount *mntp, struct vnode *dvp,
 		/* vfs_hash_insert() vput()'s the losing vnode */
 		return (0);
 	}
+	vp->v_vflag |= VV_DEBUGVREF;
 	*npp = np;
 
 	return (0);
Index: sys/fs/nfsclient/nfs_clnode.c
===================================================================
--- sys/fs/nfsclient/nfs_clnode.c	(revision 248723)
+++ sys/fs/nfsclient/nfs_clnode.c	(working copy)
@@ -179,6 +179,7 @@ ncl_nget(struct mount *mntp, u_int8_t *fhp, int fh
 		/* vfs_hash_insert() vput()'s the losing vnode */
 		return (0);
 	}
+	vp->v_vflag |= VV_DEBUGVREF;
 	*npp = np;
 
 	return (0);
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 834 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20130717/6a1c89a0/attachment.sig>