some ZFS questions

Fri Aug 8 16:36:14 UTC 2014

On Aug 8, 2014, at 3:06, Scott Bennett <bennett at sdf.org> wrote:

>     Well, I need space for encrypted file systems and for unencrypted file
> systems at a roughly 1:3 ratio.  I have four 2 TB drives for the purpose
> already, but apparently need at least two more.  If zvol+geli+UFS is not the
> way and using multiple slices/partitions is not the way either, then how
> should I set it up?

How much data do you need to store?

With four 2TB drives I would setup either a 2x2 mirror (if I needed random I/O performance) or a 4 drive RAIDz2 if I needed reliability (the RAIDz2 configuration has a substantially higher MTTDL than a 2-way mirror).

How much does the geli encryption cost in terms of space and speed? Is there a strong reason to not encrypt ALL the data? It can be in different zfs datasets (the ZFS term for a filesystem). In fact, I am NOT a fan of using the base dataset that is created with every zpool; I always create addition zfs datasets below the root of the zpool.

Note that part of the reason it is not recommended to create more than one vdev per physical device is that load on one zpool can then effect the performance of the other. It also means that you cannot readily predict the performance of *either* as they will interact with each other. Neither of the above may apply to you, but knowing *why* can help you choose to ignore a recommendation :-)

> 
>     I see.  I had gathered from the zpool(8) man page's two forms of the
> "replace" subcommand that the form shown above should be used if the failing
> disk were still somewhat usable, but that the other form should be used if
> the failing disk were already a failed disk.  I figured from that that ZFS
> would try to get whatever it could from the failing disk and only recalculate
> from the rest for blocks that couldn't be read intact from the failing disk.
> If that is not the case, then why bother to have two forms of the "replace"
> subcommand?  Wouldn't it just be simpler to unplug the failing drive, plug
> in the new drive, and then use the other form of the "replace" subcommand,
> in which case that would be the *only* form of the subcommand?

I suspect that is legacy usage. The behavior of resilver (and scrub) operations changed a fair bit in the first couple years of ZFS’s use in the real world. One of the HUGE advantages of the OpenSolaris project was the very fast feedback from the field directly to the developers. You still see that today in the OpenZFS project. While I am not a developer, I do subscribe to the ZFS-developer mailing list to read what is begin worked on and why.

>     In any case, that is why I was asking what would happen in the
> mid-rebuild failure situation.  If both subcommands are effectively identical,
> then I guess it shouldn't be a big problem.

IIRC, at some point the replace operation (resilver) was modified to use a “failing” device to speed the process if it were still available. You still need to read the data and compare to the checksum, but it can be faster if you have the bad drive for some of the data. But my memory here may also be faulty, this is a good question to ask over on the ZFS list.

>     How does one set a limit?  Is there an undocumented sysctl variable
> for it?

$ sysctl -a | grep vfs.zfs

to find all the zfs handles (not all may be tunable)

Set them in /boot/loader.conf

vfs.zfs.arc_max=“nnnM” is what you want :-)

If /boot/loader.conf does not exist, create it, same format as /boot/defaults/loader.conf (but do not change things there, they may be overwritten by OS updates/upgrades).

>     However, no one seems to have tackled my original question 4) regarding
> "options KVA_PAGES=n".  Care to take a stab at it?

See the writeup at https://wiki.freebsd.org/ZFSTuningGuide

I have not needed to make these tunings, so I cannot confirm them, but they have been out there for long enough that I suspect if they were wrong (or bad) they would have been corrected or removed.

>     If ZFS has no way to prevent thrashing in that situation, then that is
> a serious design deficiency in ZFS.

Before you start making claims about “design deficiencies” in ZFS I suggest you take a good hard look at the actual design and the criteria it was designed to fulfill. ZFS was NOT designed to be easy on drives. Nor was it designed to be easy on any of the other hardware (CPU or RAM). It WAS designed to be as fault tolerant as any physical system can be. It WAS designed to be incredibly scalable. It WAS designed to be very portable. It was NOT designed to be cheap.

>     Does that then leave me with just the zvol+geli+UFS way to proceed?
> I mean, I would love to be wealthy enough to throw thrice as many drives
> into this setup, but I'm not.  I can get by with using a single set of drives
> for the two purposes that need protection against device failure and silent
> data corruption and then finding a smaller, cheaper drive or two for the
> remaining purposes, but devoting a whole set of drives to each purpose is
> not an option.  If ZFS really needs to be used that way, then that is another
> serious design flaw,

You seem to be annoyed that ZFS was not designed for your specific requirements. I would not say that ZFS has a “serious design flaw” simply because it was designed for the exact configuration you need. What you need is the Oracle implementation of encryption under ZFS, which you can get by paying for it.

--
Paul Kraus
paul at kraus-haus.org