Improving old-fashioned UFS2 performance with lots of inodes...

Tue Jun 28 23:47:28 UTC 2011

On Tue, Jun 28, 2011 at 04:14:00PM -0700, George Sanders wrote:
> > > with over 100  million inodes on the filesystem, things go slow.  Overall 
> > > throughput is fine, and I have no complaints there, but doing any kind of 
> > > operations with the files is quite slow.  Building a file list  with rsync, or 
> > > doing a cp, or a ln -s of a big dir tree, etc.
> > > 
> > > Let's assume that the architecture is not changing ... it's going to be FreeBSD 
> > >
> > > 8.x, using UFS2, and raid6 on actual spinning (7200rpm)  disks.
> > > 
> > > What can I do to speed things up ?
> > > 
> > >  Right now I have these in my loader.conf:
> > > 
> > >  kern.maxdsiz="4096000000"# for fsck
> > > vm.kmem_size="1610612736"# for big  rsyncs
> > > vm.kmem_size_max="1610612736"# for big rsyncs
> > 
> > On what  exact OS version?  Please don't say "8.2", need to know
> > 8.2-RELEASE,  -STABLE, or what.  You said "8.x" above, which is too
> > vague.  If  8.2-STABLE you should not be tuning vm.kmem_size_max at all,
> > and you probably  don't need to tune vm.kmem_size either.
> 
> Ok, right now we are on 6.4-RELEASE, but it is our intention to move to 
> 8.2-RELEASE.

Oh dear.

I would recommend you focus solely on the complexity and pains of that
upgrade and not about the "filesystem situation" here.  The last thing
you need to do is to try and "work in" some optimisations or tweaks
while moving ahead by two major version releases.  Take baby steps in
this situation, otherwise there's going to be a mail about "problems
with the upgrade but is it related to this tuning stuff we did or the
filesystem problem or what happened and who changed what?" and you'll
quickly lose track of everything.

Re-visit the issue with UFS2 *after* you have done the upgrade.

> If the kmem loader.conf options are no longer relevant in 8.2-STABLE, should I
> assume that will also be the case when 8.3-RELEASE comes along ?

Correct.

> > I also do not understand how  vm.kmem_size would affect rsync, since
> > rsync is a userland application.   I imagine you'd want to adjust
> > kern.maxdsiz and kern.dfldsiz (default  dsiz).
> 
> Well, a huge rsync with 20+ million files dies with memory related errors, and
> continued to do so until we upped the kmem values that high.  We don't know
> why, but we know it "fixed it".

Again: I don't understand how adjusting vm.kmem_size or kmem_size_max
would fix anything in regards to this.  However, adjusting kern.maxdsiz
I could see affecting this.  It would indicate your rsync process
becomes extremely large in size and exceeds maxdsiz, resulting in a
segfault or some other anomalies sigN error.

> > > and I also set:
> > > 
> > >  vfs.ufs.dirhash_maxmem=64000000
> > 
> > This tunable uses memory for a single  directorie that has a huge amount
> > of files in it; AFAIK it does not apply to  "large directory structures"
> > (as in directories within directories within  directories).  It's obvious
> > you're just tinkering with random sysctls  hoping to gain performance
> > without really understanding what the sysctls  do.  :-)  To see if you
> > even need to increase that, try "sysctl -a  | grep vfs.ufs.dirhash" and
> > look at dirhash_mem vs. dirhash_maxmem, as well  as dirhash_lowmemcount.
> 
> No, we actually ALSO have huge directories, and we do indeed need this value.
>
> This is the one setting that we actually understand and have empirically 
> measured.

Understood.

> > The only thing I can think of on short notice  is to have multiple
> > filesystems (volumes) instead of one large 12TB  one.  This is pretty
> > common in the commercial filer  world.
> 
> Ok, that is interesting - are you saying create multiple, smaller UFS 
> filesystems on the single large 12TB raid6 array ?

Correct.  Instead of one large 12TB filesystem, try four 3TB filesystems
instead, or eight 2TB.

> Or are you saying create a handful of smaller arrays ?  We have to burn two 
> disks for every raid6 array we make, as I am sure you know, so we really can't split 
> it up into multiple arrays.

Nah, not multiple arrays, just multiple filesystems on a single array.

> We could, however, split the single raid6 array into multiple, formatted UFS2
> filesystems, but I don't understand how that would help with our performance ?
>
> Certainly fsck time would be much shorter, and we could bring up each filesystem
> after it fsck'd, and then move to the next one ... but in terms of live performance,
> how does splitting the array into multiple filesystems help ?  The nature of a 
> raid array (as I understand it) would have us beating all 12 disks regardless of 
> which UFS filesystems were being used.
> 
> Can you elaborate ?

Please read everything I've written below before responding (e.g. do not
respond in-line to this information).

Actually, I think elaboration is needed on your part.  :-)  I say that
with as much sincerity as possible.  All you've stated in this thread so
far is:

- "With over 100 million inodes on the filesystem, things go slow"
- "Building a list of files with rsync/using cp/ln -s in a very large
  directory tree" (does this mean a directory with a large amount of
  files in it?) "is slow"
- Some sort of concern over the speed of fsck
- You want to use more system memory/RAM for filesystem-level caching

http://lists.freebsd.org/pipermail/freebsd-fs/2011-June/011867.html

There's really nothing concrete provided here.  Developers are going to
need hard data, and I imagine you're going to get a lot of push-back
given how you're using the filesystem.  "Hard data" means you need to
actually start showing some actual output of your filesystems, explain
your directory structures, etc... 

Generally speaking, the below are No-Nos on most UNIX filesystems.  At
least these are things that I was taught very early on (early 90s), and
I imagine others were as well:

- Stick tons of files in a single directory
- Cram hundreds of millions of files on a single filesystem

I would recommend looking into tunefs(8) as well; the -e, -f, and -s
arguments will probably interest you.

Splitting things up into multiple filesystems would help with both the
1st and 3rd items on the 4-item list.  Solving the 2nd item is as simple
as: "then don't do that" (are you in biometrics per chance?  Biometrics
people have a tendency to abuse filesystems horribly :-) ), and the 4th
item I can't really comment on (WRT UFS).

Items 1, 3, and 4 are things that use of ZFS would help with.  I'm not
sure about the 2nd item.  If I was in your situation, I would strongly
recommend considering moving to it *after* you finish your OS upgrades.

Furthermore, if you're going to consider using ZFS on FreeBSD, *please*
use RELENG_8 (8.2-STABLE) and not RELENG_8_2 (8.2-RELEASE).  There have
been *major* improvements between those two tags.  You can wait for
8.3-RELEASE if you want (which will obviously encapsulate those
changes), but it's your choice.

> > Regarding system RAM and UFS2: I have no idea, Kirk might have  to
> > comment on that.
> > 
> > You could "make use" of system RAM for cache (ZFS  ARC) if you were using
> > ZFS instead of native UFS2.  However, if the  system has 64GB of RAM, you
> > need to ask yourself why the system has that  amount of RAM in the first
> > place.  For example, if the machine runs  mysqld and is tuned to use a
> > large amount of memory, you really don't  ""have"" 64GB of RAM to play
> > with, and thus wouldn't want mysqld and some  filesystem caching model
> > fighting over memory (e.g.  paging/swapping).
> 
> Actually, the system RAM is there for the purpose of someday using ZFS - and
> for no other reason.  However, it is realistically a few years away on our 
> timeline,
> unfortunately, so for now we will use UFS2, and as I said ... it seems a shame
> that UFS2 cannot use system RAM for any good purpose...
> 
> Or can it ?  Anyone ?

Like I said: the only person (I know of) who could answer this would be
Kirk McKusick.  I'm not well-versed in the inner workings and design of
filesystems; Kirk would be.  I'm not sure who else "knows" UFS around
here.

I think you need to figure out which of your concerns have priority.
Upgrading to ZFS (8.2-STABLE or later please) may solve all of your
performance issues; I wish I could say "it will" but I can't.  If
upgrading to that isn't a priority (re: "a few years from now"), then
you may have to live with your current situation, albeit painfully.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |