leaking lots of unreferenced inodes (pg_xlog files?)

Fri May 31 18:25:45 UTC 2013

> Date: Thu, 30 May 2013 12:56:54 +0200
> From: Palle Girgensohn <girgen at FreeBSD.org>
> To: Kirk McKusick <mckusick at mckusick.com>
> CC: freebsd-fs at FreeBSD.org, Jeff Roberson <jroberson at jroberson.net>,
>         Dan Thomas <godders at gmail.com>, Julian Akehurst <julian at pingpong.se>
> Subject: Re: leaking lots of unreferenced inodes (pg_xlog files?)
> 
> Hello again!
> 
> I have now remounted the postgresql filesystem on a test server that
> experiences the same problem. The production server is not remounted
> yet, we're planning that in a weeks time approximately, but I though I
> could gain som time by running the suggested procedure on the test box.
> 
> The base problem was this:
> 
> # df -h /pgsql ; du -hxs /pgsql
> Filesystem     Size    Used   Avail Capacity  Mounted on
> /dev/da2s1d    128G    101G     17G    86%    /pgsql
>  82G	/pgsql
> 
> df says 101 GB used, but du only finds 82 GB, and fstat cannot find any
> open files that are unreferenced in the file system. Stopping postgresql
> does not help. It seems the OS is leaking inode references.
> 
> FreeBSD 9.1, postgresql 9.2.3 from port.
> 
> I ran the suggested commans (in attached diskspacecheck) before stopping
> postgresql (before.log), after stopping postgresql but before unmount
> /pgsql (before2.log), and then i unmounted /pgsql (had to run umount -f
> /pgsql, and it took about 20 seconds). I did not enter single-user mode,
> since I really did not have to this time (On the production server, the
> disk is /usr, so that will require more shutting down...)
> 
> I've attach the logs here. Hope it helps!
> 
> The commands run in diskspaccheck are
> #! /bin/sh
> df -ih /pgsql
> vmstat -z
> vmstat -m
> sysctl debug
> fstat -f /pgsql
> 
> as suggested by Kirk.

Your results are very enlightening. Especially the fact that you have 
to do a forcible unmount of the filesystem. What that tells me is that
somehow we are getting vnodes that have phantom references. That is
there is some system call where we get a reference on a vnode (vref,
vget, or similar) that does not ultimately have a corresponding drop
of the reference (vrele, vput, or similar). The net effect is that
the file is held open despite the fact that there are no longer any
connections to it. When you do the forcible unmount, the kernel walks
the list of vnodes associated with the filesystem and does a vgone on
each of them. That causes each to be inactivated which then triggers
the release of their associated disk space. The reason that the unmount
takes 20 seconds is to process all the releasing of the space. My guess
is that there is an error path in some system call that is missing the
vrele or vput.

Assuming that you are able to run some more tests on your test machine,
the next step in narrowing down the set of code to look at is to try
running your system with soft updates disabled. The idea is to find out
whether the miss-matched references are in the soft updates code or are
in one of the filesystem system calls themselves. To disable soft updates
run the command `tunefs -n disable /pgsql' on the unmounted /pgsql
filesystem. If the system then runs without the problem, I will know
to search the soft updates code. If the problem persists, then I'll
know to look in the system calls themselves. You may want to do some
preliminary tests to see how quickly the problem manifests itself.
You can do this by running it for a short time (10 minutes say) and
then checking to see if you need to do a forcible unmount of the
filesystem. Once you establish how long you have to run before you
reliably have to do a forcible unmount, you will know how long to
run the test with soft updates turned off. If you find that running
with soft updates turned off makes your application run too slowly
you can mount your filesystem asynchronously. Note however, that you
should not run asynchronously if the data on the filesystem is critical
as you may end up with an unrecoverable filesystem after a power failure
or system crash. So only run asynchronously if you can afford to lose
your filesystem.

Finally, it would be helpful if you could add two more commands to
your diskspacecheck.sh script:

	sysctl -a | egrep vnode
	mount -v

The first shows the vnode usage and the second shows the operational
state of your filesystems.

	Kirk McKusick