Suggestion for hardware for ZFS fileserver

Sami Halabi sodynet1 at gmail.com
Sat Dec 22 14:49:36 UTC 2018


Hi,

What sas hba card do you recommend for 16/24 internal ports and 2 external
that are recognized and work well with freebsd ZFS.
Sami

בתאריך שבת, 22 בדצמ׳ 2018, 2:48, מאת Peter Eriksson <peter at ifm.liu.se>:

>
>
> > On 22 Dec 2018, at 00:49, Rick Macklem <rmacklem at uoguelph.ca> wrote:
> >
> > Peter Eriksson wrote:
> > [good stuff snipped]
> >> This has caused some interesting problems…
> >>
> >> First thing we noticed was that booting would take forever… Mounting
> the 20-100k >filesystems _and_ enabling them to be shared via NFS is not
> done efficient at all (for >each filesystem it re-reads /etc/zfs/exports (a
> couple of times) befor appending one >line to the end. Repeat 20-100,000
> times… Not to mention the big kernel lock for >NFS “hold all NFS activity
> while we flush and reinstalls all sharing information per >filesystem”
> being done by mountd…
> > Yes, /etc/exports and mountd were implemented in the 1980s, when a dozen
> > file systems would have been a large server. Scaling to 10,000 or more
> file
> > systems wasn't even conceivable back then.
>
> Yeah, for a normal user with non-silly amounts of filesystems this is a
> non-issue. Anyway it’s the kind of issues that I kind of like to think
> about how to solve. It’s fun :-)
>
>
> >> Wish list item #1: A BerkeleyDB-based ’sharetab’ that replaces the
> horribly >slow /etc/zfs/exports text file.
> >> Wish list item #2: A reimplementation of mountd and the kernel
> interface to allow >a “diff” between the contents of the DB-based sharetab
> above be input into the >kernel instead of the brute-force way it’s done
> now..
> > The parser in mountd for /etc/exports is already an ugly beast and I
> think
> > implementing a "diff" version will be difficult, especially figuring out
> what needs
> > to be deleted.
>
> Yeah, I tried to decode it (this summer) and I think I sort of got the
> hang of it eventually.
>
>
> > I do have a couple of questions related to this:
> > 1 - Would your case work if there was an "add these lines to
> /etc/exports"?
> >     (Basically adding entries for file systems, but not trying to delete
> anything
> >      previously exported. I am not a ZFS guy, but I think ZFS just
> generates another
> >      exports file and then gets mountd to export everything again.)
>
> Yeah, the ZFS library that the zfs commands use just reads and updates the
> separate /etc/zfs/exports text file (and have mountd read both /etc/exports
> and /etc/zfs/exports). The problem is that basically what it does when you
> tell it to “zfs mount -a” (mount all filesystems in all zpools) is a big
> (pseudocode):
>
> For P in ZPOOLS; do
>   For Z in ZFILESYSTEMS-AND-SNAPSHOTS in $P; do
>     Mount $Z
>     If $Z Have “sharenfs” option; Then
>        Open /etc/zfs/exports
>        Read until you find a matching line, replace with the options, else
> if not found, Append options
>        Close /etc/zfs/exports
>        Signal mountd
>          (Which then opens /etc/exports and /etc/zfs/exports and does it’s
> magic)
>     End
>   End
> End
>
> All wrapped up in a Solaris compatibility layer I libzfs. Actually I think
> it even reads the /etc/zfs/exports file twice for each loop iteration due
> to some abstractions. Btw things got really “fun” when the hourly snapshots
> we were taking (adding 10-20k new snapshots every hour, and we didn’t
> expire them fast enough in the beginning) triggered the code above and that
> code took longer than 1 hour to execute - mountd was 100% busy getting
> signalled and rereading, flushing and reinstalling exports into the kernel
> all the time) and basically never finished. Luckily we didn’t have an NFS
> clients accessing the servers at that time :-)
>
> This summer I wrote some code to instead use a Btree BerkeleyDB file and
> modified the libzfs code and mountd daemon to instead use that database for
> much faster lookups (no need to read the whole /etc/zfs/exports file all
> the time) and additions. Worked pretty well actually and wasn’t that hard
> to add. Wanted to also add a possibility to add “exports” arguments
> “Solaris”-style so one could say things like:
>
>         /export/staff   vers=4,sec=krb5:krb5i:krb5p,rw=
> 130.236.0.0/16,sec=sys,ro=130.236.160.0/24:10.1.2.3
>
> But I never finished that (solaris-style exports options) part….
>
> We’ve lately been toying with putting the NFS sharing stuff into separate
> “private" ZFS attribute (separate from official “sharenfs” one) and have
> another tool to read them instead and generate another “exports” file so
> that file can be generated in “one go” and just signal mountd once after
> all filesystems have been mounted. Unfortunately that would mean that they
> won’t be shared until after all of them have been mounted but we think it
> would take less time all-in-all.
>
> We also modified the FreeBSD boot scripts so that we make sure to first
> mount all most important ZFS filesystems that is needed on the boot disks
> (not just /) and then we mount (and share via NFS the rest in the
> background so we can login to the machine as root early (no need for
> everything to have been mounted before giving us a login prompt).
>
> (Right now a reboot of the bigger servers take an hour or two before all
> filesystems are mounted and exported).
>
>
> > 2 - Are all (or maybe most) of these ZFS file systems exported with the
> same
> >      arguments?
> >      - Here I am thinking that a "default-for-all-ZFS-filesystems" line
> could be
> >         put in /etc/exports that would apply to all ZFS file systems not
> exported
> >         by explicit lines in the exports file(s).
> >      This would be fairly easy to implement and would avoid trying to
> handle
> >      1000s of entries.
>
> For us most have exactly the same exports arguments. (We set options on
> the top level filsystems (/export/staff, /export/students etc) and then all
> home dirs inherit those.
>
> > In particular, #2 above could be easily implemented on top of what is
> already
> > there, using a new type of line in /etc/exports and handling that as a
> special
> > case by the NFS server code, when no specific export for the file system
> to the
> > client is found.
> >
> >> (I’ve written some code that implements item #1 above and it helps
> quite a bit. >Nothing near production quality yet though. I have looked at
> item #2 a bit too but >not done anything about it.)
> > [more good stuff snipped]
> > Btw, although I put the questions here, I think a separate thread
> discussing
> > how to scale to 10000+ file systems might be useful. (On freebsd-fs@ or
> > freebsd-current at . The latter sometimes gets the attention of more
> developers.)
>
> Yeah, probably a good idea!
>
> - Peter
>
> > rick
> >
> >
>
> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
>


More information about the freebsd-fs mailing list