Suggestion for hardware for ZFS fileserver

Fri Dec 21 19:25:07 UTC 2018

Just to add a few pointers based on our experience from our FreeBSD-based filers.

> 1. RAM *must* be ECC.  No wiggle room here.  Undetected RAM corruption

100% agreed. ECC is definitely the way to go!

> 2. More RAM is better, up to a point, in that cache is faster than disk
> I/O in all cases as operations are avoided.  HOWEVER, there are
> pathologies in both the FreeBSD VFS and the ARC when considered as a
> group.  I and others have tried to eliminate the pathological behavior
> under certain workloads (and some of us have had long-running debates on
> same.)  Therefore, your workload must be considered -- simply saying
> "more is better" may not be correct for your particular circumstances.

Yes, our servers all have 256GB RAM and we’ve been having some performance issues every now and then that has forced us to adjust a couple of kernel settings in order to minimise the impact for the users. I’m sure we’ll run into more in the future.

Our setup:

A number of Dell servers with 256GB RAM each, “HBA330” (LSI 3008) controllers and 14 10TB 7200rpm SAS drives, dual SLOG SSDs and dual L2ARC SSDs connected to the network via dual 10Gbps ethernet, serving users via SMB (Samba), NFS and SFTP. Managed via Puppet.

Every user get their own ZFS filesystem with a ref quota set -> about 20000 zfs filesystems per frontend server (we have around 110K users (students & staff) - and around 3000 active (around 500 per server) at the same time currently (mostly SMB for now, but NFS is growing). LZ4 compression enabled on all.

Every filesystem gets a snapshot taken every hour (accessible via Windows “previous versions”). 

1st level backups is done via rsync to secondary servers (HPs with big SAS disk cabinets (70 disks/cab)) so around 100K filesystems on the biggest one right now. And snapshots on those too. No users have direct access to them.
We decided against using zfs send/recv since we wanted some better “fault” isolation between the primary and secondary servers in case of ZFS corruption on the primary frontend servers. Considering the panic-causing bugs with zfs send+recv that has been reported this was probably a good choice :-)

This has caused some interesting problems… 

First thing we noticed was that booting would take forever… Mounting the 20-100k filesystems _and_ enabling them to be shared via NFS is not done efficient at all (for each filesystem it re-reads /etc/zfs/exports (a couple of times) befor appending one line to the end. Repeat 20-100,000 times… Not to mention the big kernel lock for NFS “hold all NFS activity while we flush and reinstalls all sharing information per filesystem” being done by mountd…

Wish list item #1: A BerkeleyDB-based ’sharetab’ that replaces the horribly slow /etc/zfs/exports text file.
Wish list item #2: A reimplementation of mountd and the kernel interface to allow a “diff” between the contents of the DB-based sharetab above be input into the kernel instead of the brute-force way it’s done now..

(I’ve written some code that implements item #1 above and it helps quite a bit. Nothing near production quality yet though. I have looked at item #2 a bit too but not done anything about it.)

And then we have the Puppet “facter” process that does an inventory of the systems. Doing things like “zfs list” (to list all filesystems and then try to upload it to the PuppetDB (and fail due to too much data) and “zfs upgrade” (to get the first line of output about ZFS version - which has the side effect of also doing a recursive walk thru all filesystems - taking something like 6 hours on the main backup server… Solved that one with some binary patching and a wrapper script around /sbin/zfs :-)

Wish list item #3: A “zfs version” command that just prints the ZFS version instead that Puppet ‘facter’ could use.
Wish list item #4: A better (we currently binary-patch /sbin/zfs into /lbin/zfs (which doesn’t exist) in the libfacter shared library…) way to disable the ZFS filesystem enumeration that facter does….

Another thing we noticed was that when the rsync backups were running lot’s of things would start to go really slow. Things one at first didn’t think would be affected - like everything that stat:s /etc/nsswitch.conf (or other files/devices) on the (separate) root disks (mirrored ZFS). Access time would inflate 100x times or more. Turns out the rsync processes that would stat() all the users file would cause the kernel ‘vnodes’ table to get full (and older entries would get flushed out - like the vnode entry for /etc/nsswitch.conf (and basically everything). So we increased the kern.maxvnodes setting a number of times… Currently we’re running at a vnode table size of 20 million (but we probably should increase it even more). Uses a number of GBs of RAM but for us it’s worth it.

Wish list item #5: Separate vnode tables per ZFS pool, or some way to “protect” “important” vnodes...

Another thing we noticed recently was that the ARC - which had a default cap of about 251GB - would use all that and then we we would have a three-way fight of memory between the ARC, the VNode table and the 500+ ’smbd’ user processes that uses quite a lot of RAM to per process (and all the others) - causing the kernel pagedaemon to work a lot causing the Samba smbd & Winbindd daemons to become really slow (causing multi second login times for users).

Solved that one by capping the ARC to 128GB… (Can probably increase it a bit but 128GB seems to be plenty enough for us right now). Now ARC gets’ 50% of the machine and the rest have plenty of RAM to play with.

Dunno what to wish for here though :-)

> 4. On FreeBSD I prefer GELI on the base partition to which ZFS is then
> pointed as a pool member for encryption at the present time.  It's
> proven, uses AES hardware acceleration on modern processors and works. 
> Read the documentation carefully and understand your options for keying
> (e.g. password only, password + key file, etc) and how you will manage
> the security of the key component(s).

Yep, we use GELI+ZFS too on one server. Works fine!

- Peter