4.8 ffs_dirpref problem

Wed Oct 22 22:30:04 PDT 2003

I believe that you can dsolve your problem by tuning the existing
algorithm using tunefs. There are two parameters to control dirpref,
avgfilesize (which defaults to 16384) and filesperdir (which defaults
to 50). I suggest that you try using an avgfilesize of 4096 and
filesperdir of 1500. This is done by running tunefs on the unmounted
(or at least mounted read-only) filesystem as:

	tunefs -f 4096 -s 1500 /dev/<disk for my broken filesystem>

Note that this affects future layout, so needs to be done before you
put any data into the filesystem. If you are building the filesystem
from scratch, you can use:

	newfs -g 4096 -h 1500 ...

to set these fields. Please let me know if this solves your problem.
If it does not, I will ask Grigoriy Orlov <gluk at ptci.ru> if he has
any ideas on how to proceed.

	Kirk McKusick

=-=-=-=-=-=-=

> Date: Tue, 21 Oct 2003 17:48:51 -0700
> From: Ken Marx <kmarx at vicor.com>
> To: freebsd-fs at freebsd.org
> Cc: Julian Elischer <julian at vicor-nb.com>,
> 	John Lynch <jpl at vicor.com>, Dave Parker Smith <davep at vicor.com>,
> 	Cayford Burrell <cburrell at vicor.com>,
> 	victor elischer <VicPE at aol.com>, Josh Howard <jrh at vicor.com>,
> 	Ken Marx <kmarx at vicor.com>
> Subject: 4.8 ffs_dirpref problem
> 
> Hi,
> 
> We have 560GB raids that were sometimes bogging down heavily
> in our production systems. Under 4.8-RELEASE (recently
> upgrated from 4.4) we find that when:
> 
> 	o the raid file system grows to over 85% capacity (with only
> 	  30% inode usage)
> 	o we create ~1500 or so 2-6kb files in a given dir
> 	o (note: soft updates NOT enabled)
> 
> We see:
> 
> 	o 100% cpu utilization, all in system
> 	o I/O transfer rates of ~200kb/sec, down from normal of 15-30MB/s
> 
> We profiled the kernel and found a large number of calls to
> ffs_alloc().  After many twisty pasages, we finally diff'd 4.4
> with 4.8 ffs_alloc.c, and found a major difference in the
> ffs_dirpref() call. Hacking the 4.4 logic back in 'fixed' the
> problem: We can now fill the /raid entirely with no real
> noticeable performance degradation.
> 
> The nice comments for 4.4/4.8 versions of ffs_dirpref() seem to explain
> things fairly clearly:
> 
> 4.4 -  ffs_alloc.c,v 1.64.2.1 2000/03/16 08:15:53 ps:
> --------------------------------------
>  * The policy implemented by this algorithm is to select from
>  * among those cylinder groups with above the average number of
>  * free inodes, the one with the smallest number of directories.
> 
> 4.8 - ffs_alloc.c,v 1.64.2.2 2001/09/21 19:15:21 dillon:
> -----------------------------------------
>  * The policy implemented by this algorithm is to allocate a
>  * directory inode in the same cylinder group as its parent
>  * directory, but also to reserve space for its files inodes
>  * and data. Restrict the number of directories which may be
>  * allocated one after another in the same cylinder group
>  * without intervening allocation of files.
>  *
>  * If we allocate a first level directory then force allocation
>  * in another cylinder group.
> 
> For us, the 4.4 policy seems far superior, at least when the file system
> approches capacity.
> 
> We'd like to avoid local kernel hacks and keep with main line
> FreeBSD code. Is there some way that the old policy can be supported,
> perhaps via a tunefs or sysctl type option?
> 
> Actually, if the new policy can be fixed up to avoid the problem, that
> would of course be just as dandy.
> 
> Thanks very much,
> k
> -- 
> Ken Marx, kmarx at vicor-nb.com
> We need to hit the nail on the head and set the agenda regarding total 
> quality.
> 		- http://www.bigshed.com/cgi-bin/speak.cgi