some ZFS questions

Thu Aug 21 10:08:14 UTC 2014

Paul Kraus <paul at kraus-haus.org> wrote:
> On Aug 7, 2014, at 4:16, Scott Bennett <bennett at sdf.org> wrote:
>
> >     If two pools use different partitions on a drive and both pools are
> > rebuilding those partitions at the same time, then how could ZFS *not*
> > be hammering the drive?  The access arm would be doing almost nothing but
> > endless series of long seeks back and forth between the two partitions
> > involved.
>
> How is this different from real production use with, for example, a large database? Even with a single vdev per physical drive you generate LOTS of RANDOM I/O during a resilver. Remember that a ZFS resilver is NOT a like other RAID resync operations. It is NOT a sequential copy of existing data. It is a functionally a reply of all the data written to the zpool as it walks the UberBlock. The major difference between a resilver and a scrub is that the resilver is expecting to be writing data to one (or more) vdevs, while the scrub is mainly a read operation (still generating LOTS of random I/O) looking for errors in the read data (and correcting such when found).

     I wasn't thinking in terms of a large data base, but rather in terms
of my particular usage.  What I need all this space for is primarily archival
storage, so it will not be heavily accessed under normal circumstances, at
least not once the initial, mass loading has been done.  Speed is not expected
to be all that important on a day-to-day basis, but integrity of the data will
be.
     Some of the archives will be encrypted.  About three times as much space
as the encrypted portion is expected to be used for unencrypted data.  In the
event of a drive being replaced, what I wish to avoid would be to have two
rebuilding operations going on simultaneously on the same set of component
drives.  If I can combine the encrypted and unencrypted data into a single
pool in ZFS, then that eliminates the problem.
>
> >  When you're talking about hundreds of gigabytes to be written
> > to each partition, it could take months or even years to complete, during
> > which time something else is almost certain to fail and halt the rebuilds.
>
> In my experience it is not the amount of data to be re-written, but the amount of writes that created the data. For example, a zpool that is mostly write once (a mead library, for example, where each CD is written once, never changed, and read lots) will resilver much faster than a zpool with lots of small random writes and lots of deletions (like a busy database). See my blog post here: http://pk1048.com/zfs-resilver-observations/ for the most recent resilver I had to do on my home server. I needed to scan 2.84TB of data to rewrite 580GB, it took just under 17 hours.

     I'll take a look at that article once I have X11 working again.  Thanks
for the URL.  That recovery time looks reasonable and reassures me quite a
bit.
>
> If I had two (or more) vdevs on each device (and I *have* done that when I needed to), I would have issued the first zpool replace command, waited for it to complete and then issued the other. If I had more than one drive fail, I would have handled the replacement of BOTH drives on one zpool first and then moved on to the second. This is NOT because I want to be nice and easy on my drives :-), it is simply because I expect that running the two operations in parallel will be slower than running them in series. For the major reason that large seeks are slower than short seeks.

     My concern was over a slightly different possible case, namely, a hard
failure of a component drive (e.g., makes ugly noises, doesn't spin, and/or
doesn't show up as a device recognized as such by the OS).  In that case,
either one has to physically connect a replacement device or a spare is
already on-line.  A spare would automatically be grabbed by a pool for
reconstruction, so I wanted to know whether situation under discussion would
result in automatically initiated rebuilds of both pools at once.
>
> Also note from the data in my blog entry that the only drive being pushed close to it?s limits is the new replaced drive that is handling the writes. The read drives are not being pushed that hard. YMMV as this is a 5 drive RAIDz2 and for the case of a 2-way mirror the read drive and write drive will be more closely loaded.

     FWIW, I'm now hoping to build a 6-drive, raidz2 pool for my archival
storage.
>
> >     That looks good.  What happens if a "zpool replace failingdrive newdrive"
> > is running when the failingdrive actually fails completely?
>
> A zpool replace is not a simple copy from the failing device to the new one, it is a rebuild of the data on the new device, so if the device fails completely it just keeps rebuilding. The example in my blog was of a drive that just went offline with no warning. I put the new drive in the same physical slot (I did not have any open slots) and issued the resilver command.

     Okay.  However, now you bring up another possible pitfall.  Are ZFS's
drives address- or name-dependent?  All of the drives I expect to use will be
external drives.  At least four of the six will be connected via USB 3.0.  The
other two may be connected via USB 3.0, Firewire 400, or eSATA.  In any case,
their device names in /dev will most likely change from one boot to another.
>
> Note that having the FreeBSD device drive echo the Vendor info, including drive P/N and S/N to the system log is a HUGE help to replacing bad drives.

     I would imagine so.
>
> >> memory pressure more gracefully, but it's not committed yet. I highly recommend
> >> moving to 64-bit as soon as possible.
> > 
> >     I intend to do so, but "as soon as possible" will be after all this
> > disk trouble and disk reconfiguration have been resolved.  It will be done
> > via an in-place upgrade from source, so I need to have a place to run
> > buildworld and build kernel.
>
> So the real world intrudes on perfection yet again :-) We do what we have to in order to get the job done, but make sure to understand the limitations and compromises you are making along the way.
>
> >  Before doing an installkernel and installworld,
> > I need also to have a place to run full backups.  I have not had a place to
> > store new backups for the last three months, which is making me more unhappy
> > by the day.  I really have to get the disk work *done* before I can move
> > forward on anything else, which is why I'm trying to find out whether I can
> > actually use ZFS raidzN in that cause while still on i386.
>
> Yes, you can. I have used ZFS on 32-bit systems (OK, they were really 32-bit VMs, but I was still running ZFS there, still am today and it has saved my butt at least once already).
>
     Okay, that's great.  I'll continue with my plans in that case.

> >  Performance
> > will not be an issue that I can see until later if ever.
>
> I have run ZFS on systems with as little as 1GB total RAM, just do NOT expect stellar (or even good) performance. Keep a close watch on the ARC size (FreeBSD 10 makes this easy with the additional status line in top for the ZFS ARC and L2ARC). You can also use arcstat.pl (get the FreeBSD version here https://code.google.com/p/jhell/downloads/detail?name=arcstat.pl )to track ARC usage over time. On my most critical production server I leave it running with a 60 second sample so if something goes south I can see what happened just before.

     I'll give that a shot.
>
> Tune vfs.zfs.arc_max in /boot/loader.conf

     That looks like a huge help.  While initially loading a file system
or zvol, would there be any advantage to setting primarycache to "metadata",
as opposed to leaving it set to the default value of "all"?
>
> If I had less than 4GB of RAM I would limit the ARC to 1/2 RAM, unless this were solely a fileserver, then I would watch how much memory I needed outside ZFS and set the ARC to slightly less than that. Take a look at the recommendations here https://wiki.freebsd.org/ZFSTuningGuide for low RAM situations.

     Will do.  Hmm...I see again the recommendation to increase KVA_PAGES
from 260 to 512.  I worry about that because the i386 kernel says at boot
that it ignores all real memory above ~2.9 GB.  A bit farther along, during
the early messages preserved and available via dmesg(1), it says,

real memory  = 4294967296 (4096 MB)
avail memory = 3132100608 (2987 MB)

>
> >  I just need to
> > know whether I can use it at all with my presently installed OS or will
> > instead have to use gvinum(8) raid5 and hope for minimal data corruption.
> > (At least only one .eli device would be needed in that case, not the M+N
> > .eli devices that would be required for a raidzN pool.) Unfortunately,
> > ideal conditions for ZFS are not an available option for now.
>
> I am a big believer in ZFS, so I think the short term disadvantages are outweighed by the ease of migration and the long term advantages. So I would go the ZFS route.
>
     Yes, it does appear that way to me, too, provided I can actually get
there from here.


                                  Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet:   bennett at sdf.org   *xor*   bennett at freeshell.org  *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good  *
* objection to the introduction of that bane of all free governments *
* -- a standing army."                                               *
*    -- Gov. John Hancock, New York Journal, 28 January 1790         *
**********************************************************************