ZFS and large directories - caveat report

Thu Jul 21 15:46:38 UTC 2011

I'm writing this mostly for future reference / archiving and also if 
someone has an idea on how to improve the situation.

A web server I maintain was hit by DoS, which has caused more than 4 
million PHP session files to be created. The session files are sharded 
in 32 directories in a single level - which is normally more than enough 
for this web server as the number of users is only a couple of thousand. 
With the DoS, the number of files per shard directory rose to about 130,000.

The problem is: ZFS has proven horribly inefficient with such large 
directories. I have other, more loaded servers with simlarly bad / large 
directories on UFS where the problem is not nearly as serious as here 
(probably due to the large dirhash). On this system, any operation which 
touches even only the parent of these 32 shards (e.g. "ls") takes 
seconds, and a simple "find | wc -l" on one of the shards takes > 30 
minutes (I stopped it after 30 minutes). Another symptom is that 
SIGINT-ing such find process takes 10-15 seconds to complete (sic! this 
likely means the kernel operation cannot be interrupted for so long).

This wouldn't be a problem by itself, but operations on such directories 
eat IOPS - clearly visible with the "find" test case, making the rest of 
the services on the server fall as collateral damage. Apparently there 
is a huge amount of seeking being done, even though I would think that 
for read operations all the data would be cached - and somehow the 
seeking from this operation takes priority / livelocks other operations 
on the same ZFS pool.

This is on a fresh 8-STABLE AMD64, pool version 28 and zfs version 5.

Is there an equivalent of UFS dirhash memory setting for ZFS? (i.e. the 
size of the metadata cache)