Suggestion for hardware for ZFS fileserver

Sat Dec 22 00:46:55 UTC 2018

> On 22 Dec 2018, at 00:49, Rick Macklem <rmacklem at uoguelph.ca> wrote:
> 
> Peter Eriksson wrote:
> [good stuff snipped]
>> This has caused some interesting problems…
>> 
>> First thing we noticed was that booting would take forever… Mounting the 20-100k >filesystems _and_ enabling them to be shared via NFS is not done efficient at all (for >each filesystem it re-reads /etc/zfs/exports (a couple of times) befor appending one >line to the end. Repeat 20-100,000 times… Not to mention the big kernel lock for >NFS “hold all NFS activity while we flush and reinstalls all sharing information per >filesystem” being done by mountd…
> Yes, /etc/exports and mountd were implemented in the 1980s, when a dozen
> file systems would have been a large server. Scaling to 10,000 or more file
> systems wasn't even conceivable back then.

Yeah, for a normal user with non-silly amounts of filesystems this is a non-issue. Anyway it’s the kind of issues that I kind of like to think about how to solve. It’s fun :-)

>> Wish list item #1: A BerkeleyDB-based ’sharetab’ that replaces the horribly >slow /etc/zfs/exports text file.
>> Wish list item #2: A reimplementation of mountd and the kernel interface to allow >a “diff” between the contents of the DB-based sharetab above be input into the >kernel instead of the brute-force way it’s done now..
> The parser in mountd for /etc/exports is already an ugly beast and I think
> implementing a "diff" version will be difficult, especially figuring out what needs
> to be deleted.

Yeah, I tried to decode it (this summer) and I think I sort of got the hang of it eventually. 

> I do have a couple of questions related to this:
> 1 - Would your case work if there was an "add these lines to /etc/exports"?
>     (Basically adding entries for file systems, but not trying to delete anything
>      previously exported. I am not a ZFS guy, but I think ZFS just generates another
>      exports file and then gets mountd to export everything again.)

Yeah, the ZFS library that the zfs commands use just reads and updates the separate /etc/zfs/exports text file (and have mountd read both /etc/exports and /etc/zfs/exports). The problem is that basically what it does when you tell it to “zfs mount -a” (mount all filesystems in all zpools) is a big (pseudocode):

For P in ZPOOLS; do
  For Z in ZFILESYSTEMS-AND-SNAPSHOTS in $P; do
    Mount $Z
    If $Z Have “sharenfs” option; Then
       Open /etc/zfs/exports
       Read until you find a matching line, replace with the options, else if not found, Append options
       Close /etc/zfs/exports
       Signal mountd
         (Which then opens /etc/exports and /etc/zfs/exports and does it’s magic)
    End
  End
End

All wrapped up in a Solaris compatibility layer I libzfs. Actually I think it even reads the /etc/zfs/exports file twice for each loop iteration due to some abstractions. Btw things got really “fun” when the hourly snapshots we were taking (adding 10-20k new snapshots every hour, and we didn’t expire them fast enough in the beginning) triggered the code above and that code took longer than 1 hour to execute - mountd was 100% busy getting signalled and rereading, flushing and reinstalling exports into the kernel all the time) and basically never finished. Luckily we didn’t have an NFS clients accessing the servers at that time :-)

This summer I wrote some code to instead use a Btree BerkeleyDB file and modified the libzfs code and mountd daemon to instead use that database for much faster lookups (no need to read the whole /etc/zfs/exports file all the time) and additions. Worked pretty well actually and wasn’t that hard to add. Wanted to also add a possibility to add “exports” arguments “Solaris”-style so one could say things like:

	/export/staff 	vers=4,sec=krb5:krb5i:krb5p,rw=130.236.0.0/16,sec=sys,ro=130.236.160.0/24:10.1.2.3

But I never finished that (solaris-style exports options) part….

We’ve lately been toying with putting the NFS sharing stuff into separate “private" ZFS attribute (separate from official “sharenfs” one) and have another tool to read them instead and generate another “exports” file so that file can be generated in “one go” and just signal mountd once after all filesystems have been mounted. Unfortunately that would mean that they won’t be shared until after all of them have been mounted but we think it would take less time all-in-all.

We also modified the FreeBSD boot scripts so that we make sure to first mount all most important ZFS filesystems that is needed on the boot disks (not just /) and then we mount (and share via NFS the rest in the background so we can login to the machine as root early (no need for everything to have been mounted before giving us a login prompt).

(Right now a reboot of the bigger servers take an hour or two before all filesystems are mounted and exported).

> 2 - Are all (or maybe most) of these ZFS file systems exported with the same
>      arguments?
>      - Here I am thinking that a "default-for-all-ZFS-filesystems" line could be
>         put in /etc/exports that would apply to all ZFS file systems not exported
>         by explicit lines in the exports file(s).
>      This would be fairly easy to implement and would avoid trying to handle
>      1000s of entries.

For us most have exactly the same exports arguments. (We set options on the top level filsystems (/export/staff, /export/students etc) and then all home dirs inherit those.

> In particular, #2 above could be easily implemented on top of what is already
> there, using a new type of line in /etc/exports and handling that as a special
> case by the NFS server code, when no specific export for the file system to the
> client is found.
> 
>> (I’ve written some code that implements item #1 above and it helps quite a bit. >Nothing near production quality yet though. I have looked at item #2 a bit too but >not done anything about it.)
> [more good stuff snipped]
> Btw, although I put the questions here, I think a separate thread discussing
> how to scale to 10000+ file systems might be useful. (On freebsd-fs@ or
> freebsd-current at . The latter sometimes gets the attention of more developers.)

Yeah, probably a good idea!

- Peter

> rick
> 
>