some ZFS questions

Fri Aug 8 07:06:20 UTC 2014

Andrew Berg <aberg010 at my.hennepintech.edu> wrote:

> On 2014.08.07 03:16, Scott Bennett wrote:
> >      On Wed, 6 Aug 2014 03:49:37 -0500 Andrew Berg
> > <aberg010 at my.hennepintech.edu> wrote:
> >>On 2014.08.06 02:32, Scott Bennett wrote:
> >>>      I have a number of questions that I need answered before I go about
> >>> setting up any raidz pools.  They are in no particular order.
> >>> 
> >>> 	1) What is the recommended method of using geli encryption with
> >>> 	ZFS?
> >>
> >>> Does one first create .eli devices and then specify those
> >>> 	.eli devices in the zpool(8) command as the devices to include
> >>> 	in the pool? 
> >>This.
> > 
> >      Oh.  Well, that's doable, if not terribly convenient, but it brings up
> > another question.  After a reboot, for example, what does ZFS do while the
> > array of .eli devices is being attached one by one?  Does it see the first
> > one attached without the others in sight and decide it has a failed pool?
> Once you bring the .eli devices back online, zpool will see them and your pool
> will be back online. Before then, it won't really do anything but tell you the
> disks are not available and therefore, neither is your pool. The status of the
> pool is 'unavailable', not 'faulted'.

     Okay.  That's good.
>
> >>mercilessly thrash disks; standard reads and writes are given higher priority
> >>in the scheduler than resilver and scrub operations.
> > 
> >      If two pools use different partitions on a drive and both pools are
> > rebuilding those partitions at the same time, then how could ZFS *not*
> > be hammering the drive?
> A good reason not to setup your pools like that.

     Well, I need space for encrypted file systems and for unencrypted file
systems at a roughly 1:3 ratio.  I have four 2 TB drives for the purpose
already, but apparently need at least two more.  If zvol+geli+UFS is not the
way and using multiple slices/partitions is not the way either, then how
should I set it up?
>
> >>> 	3) If a raidz2 or raidz3 loses more than one component, does one
> >>> 	simply replace and rebuild all of them at once?  Or is it necessary
> >>> 	to rebuild them serially?  In some particular order?
> >>AFAIK, replacement of several disks can't be done in a single command, but I
> >>don't think you need to wait for a resilver to finish on one before you can
> >>replace another.
> > 
> >      That looks good.  What happens if a "zpool replace failingdrive newdrive"
> > is running when the failingdrive actually fails completely?
> Assuming you don't trigger some race condition (which would be rare if you're
> using decent controllers), nothing special. A disk doesn't need to be present
> and functioning to be replaced.

     I see.  I had gathered from the zpool(8) man page's two forms of the
"replace" subcommand that the form shown above should be used if the failing
disk were still somewhat usable, but that the other form should be used if
the failing disk were already a failed disk.  I figured from that that ZFS
would try to get whatever it could from the failing disk and only recalculate
from the rest for blocks that couldn't be read intact from the failing disk.
If that is not the case, then why bother to have two forms of the "replace"
subcommand?  Wouldn't it just be simpler to unplug the failing drive, plug
in the new drive, and then use the other form of the "replace" subcommand,
in which case that would be the *only* form of the subcommand?
     In any case, that is why I was asking what would happen in the
mid-rebuild failure situation.  If both subcommands are effectively identical,
then I guess it shouldn't be a big problem.
>
> >>> 	5) When I upgrade to amd64, the usage would continue to be low-
> >>> 	intensity as defined above.  Will the 4 GB be enough?  I will not
> >>> 	be using the "deduplication" feature at all.
> >>It will be enough unless you are managing tens of TB of data. I recommend
> >>setting an ARC limit of 3GB or so. There is a patch that makes the ARC handle

     How does one set a limit?  Is there an undocumented sysctl variable
for it?
> > 
> >      3 GB for ARC plus whatever is needed for FreeBSD would leave much room
                                                               ^not
> > for applications to run.  Maybe I won't be able to use ZFS if it requires
> > so vastly more page-fixed memory than UFS. :-(
> 3GB is the hard limit here. If applications need more, they'll get it. The only
> reason to set a limit at all is that the ARC currently has issues giving up
> memory gracefully. As I said, there's a patch in discussion to fix it.

     Oh, okay.  I misunderstood.  I can deal with it being just an upper
bound.
     However, no one seems to have tackled my original question 4) regarding
"options KVA_PAGES=n".  Care to take a stab at it?
>
> >      One thing I ran across was the following from the zpool(8) man page.
> > 
> >   "For pools to be portable, you must give the zpool command whole
> >   disks, not just slices, so that ZFS can label the disks with portable
> >   EFI labels. Otherwise, disk drivers on platforms of different endian-
> >   ness will not recognize the disks."
> Well, that is kind of confusing since slices != partitions and partitions
> aren't mentioned. Using slices is also something someone would generally not do
> with GPT. I'll look at that part of the man page and maybe bring it up on the

     While you're at it, take a gander also at the gpart(8) man page, wherein
the list of partition types includes "freebsd", in which one uses bsdlabel(8)
to create subpartitions (e.g., ada0p1b, where the bsdlabel could specify that
as having a "swap" subpartition type or a pointless "vinum" subpartition type
(see next), "freebsd-vinum", which is useless because gvinum(8) and GPT
partitioning are mutually incompatible, "freebsd-zfs", which the man page
says is for a "FreeBSD partition that contains a ZFS volume" (not a zpool!),
and "mbr", which can then be subdivided into slices and the slices further
subdivided by bsdlabels (think adap1s1b :-).

> doc and fs MLs.
>
> > If I have one raidzN comprising .eli partitions and another raidzN comprising
> > a set of unencrypted partitions on those same drives, will I be able to
> > export both raidzN pools from a 9-STABLE system and then import them
> > into, say, a 10-STABLE system on a different Intel amd64 machine?  By your
> > answer to question 1), it would seem that I need to have two raidzN pools,
> > although there might be a number of benefits to having both encrypted and
> > unencrypted file systems allocated inside a single pool were that an option.
> Having any physical disk be a part of more than one pool is not recommended
> (except perhaps for cache and log devices where failure is not a big deal). Not
> only can it cause thrashing as you mentioned above, but one disk dying makes
> both pools degraded. Lose two disks, and you lose both pools. If you need only

     If ZFS has no way to prevent thrashing in that situation, then that is
a serious design deficiency in ZFS.

> some things encrypted, perhaps something that works above the FS layer such as
> PEFS would be a better option for you.

     Hmm.  That doesn't look like PEFS will suffice.  1) It is not part of
the FreeBSD base system and therefore, for something as fundamental as a
file system, must be considered experimental.  2) Its documentation is even
thinner that that available for ZFS.  3) What documentation I did find seems
to suggest that PEFS may not conceal things like the fraction of the space
currently in use, the directory structure (though the names are encrypted),
and possibly some other things (e.g., file owner, group, permissions, size).
     Does that then leave me with just the zvol+geli+UFS way to proceed?
I mean, I would love to be wealthy enough to throw thrice as many drives
into this setup, but I'm not.  I can get by with using a single set of drives
for the two purposes that need protection against device failure and silent
data corruption and then finding a smaller, cheaper drive or two for the
remaining purposes, but devoting a whole set of drives to each purpose is
not an option.  If ZFS really needs to be used that way, then that is another
serious design flaw, one nearly as bad as gvinum's insistence upon writing
its entire configuration at the end of the physical disk instead of at the
end of the GEOM device node given to it.
     Once again, thank you for the information you've provided.  I'll get
to the bottom of this stuff eventually with help from you and others on
this list, I'm sure.

                                  Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet:   bennett at sdf.org   *xor*   bennett at freeshell.org  *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good  *
* objection to the introduction of that bane of all free governments *
* -- a standing army."                                               *
*    -- Gov. John Hancock, New York Journal, 28 January 1790         *
**********************************************************************