ZFS and large directories - caveat report

Thu Jul 21 21:36:50 UTC 2011

On Thu, Jul 21, 2011 at 05:45:53PM +0200, Ivan Voras wrote:
> I'm writing this mostly for future reference / archiving and also if
> someone has an idea on how to improve the situation.
> 
> A web server I maintain was hit by DoS, which has caused more than 4
> million PHP session files to be created. The session files are
> sharded in 32 directories in a single level - which is normally more
> than enough for this web server as the number of users is only a
> couple of thousand. With the DoS, the number of files per shard
> directory rose to about 130,000.
> 
> The problem is: ZFS has proven horribly inefficient with such large
> directories. I have other, more loaded servers with simlarly bad /
> large directories on UFS where the problem is not nearly as serious
> as here (probably due to the large dirhash). On this system, any
> operation which touches even only the parent of these 32 shards
> (e.g. "ls") takes seconds, and a simple "find | wc -l" on one of the
> shards takes > 30 minutes (I stopped it after 30 minutes). Another
> symptom is that SIGINT-ing such find process takes 10-15 seconds to
> complete (sic! this likely means the kernel operation cannot be
> interrupted for so long).
> 
> This wouldn't be a problem by itself, but operations on such
> directories eat IOPS - clearly visible with the "find" test case,
> making the rest of the services on the server fall as collateral
> damage. Apparently there is a huge amount of seeking being done,
> even though I would think that for read operations all the data
> would be cached - and somehow the seeking from this operation takes
> priority / livelocks other operations on the same ZFS pool.
> 
> This is on a fresh 8-STABLE AMD64, pool version 28 and zfs version 5.
> 
> Is there an equivalent of UFS dirhash memory setting for ZFS? (i.e.
> the size of the metadata cache)

Ivan,

This is in no way an attempt to divert attention from the real issue
(bad performance with ZFS and lots of files), but PHP has configuration
settings that can help auto-reap sessions sooner than letting them get
up to 130,000.  Taken from our configuration file:

;
; 25% of the time (prob/divisor) we'll try to clean up leftover
; cruft in save_path.  Seems a lot of users enjoy leaving crusty
; session files laying around...
;
[session]
session.save_path = "/var/tmp/php_sessions"
session.gc_maxlifetime = 900
session.gc_probability = 25
session.gc_divisor = 100

With the above settings, roughly 1 out of every 4 times (25%) the PHP
interpreter is executed it will reap old files in save_path.  So in your
case you'd want to adjust gc_probability and gc_maxlifetime (the idea
being to make PHP reap sessions more aggressively).

Again: this doesn't solve the overall issue pertaining to ZFS, as
there's a multitude of other ways to create hundreds of thousands of
files on a system via a DoS.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |