some ZFS questions

Fri Aug 8 16:03:59 UTC 2014

On Aug 7, 2014, at 4:16, Scott Bennett <bennett at sdf.org> wrote:

>     If two pools use different partitions on a drive and both pools are
> rebuilding those partitions at the same time, then how could ZFS *not*
> be hammering the drive?  The access arm would be doing almost nothing but
> endless series of long seeks back and forth between the two partitions
> involved.

How is this different from real production use with, for example, a large database? Even with a single vdev per physical drive you generate LOTS of RANDOM I/O during a resilver. Remember that a ZFS resilver is NOT a like other RAID resync operations. It is NOT a sequential copy of existing data. It is a functionally a reply of all the data written to the zpool as it walks the UberBlock. The major difference between a resilver and a scrub is that the resilver is expecting to be writing data to one (or more) vdevs, while the scrub is mainly a read operation (still generating LOTS of random I/O) looking for errors in the read data (and correcting such when found).

>  When you're talking about hundreds of gigabytes to be written
> to each partition, it could take months or even years to complete, during
> which time something else is almost certain to fail and halt the rebuilds.

In my experience it is not the amount of data to be re-written, but the amount of writes that created the data. For example, a zpool that is mostly write once (a mead library, for example, where each CD is written once, never changed, and read lots) will resilver much faster than a zpool with lots of small random writes and lots of deletions (like a busy database). See my blog post here: http://pk1048.com/zfs-resilver-observations/ for the most recent resilver I had to do on my home server. I needed to scan 2.84TB of data to rewrite 580GB, it took just under 17 hours.

If I had two (or more) vdevs on each device (and I *have* done that when I needed to), I would have issued the first zpool replace command, waited for it to complete and then issued the other. If I had more than one drive fail, I would have handled the replacement of BOTH drives on one zpool first and then moved on to the second. This is NOT because I want to be nice and easy on my drives :-), it is simply because I expect that running the two operations in parallel will be slower than running them in series. For the major reason that large seeks are slower than short seeks.

Also note from the data in my blog entry that the only drive being pushed close to it’s limits is the new replaced drive that is handling the writes. The read drives are not being pushed that hard. YMMV as this is a 5 drive RAIDz2 and for the case of a 2-way mirror the read drive and write drive will be more closely loaded.

>     That looks good.  What happens if a "zpool replace failingdrive newdrive"
> is running when the failingdrive actually fails completely?

A zpool replace is not a simple copy from the failing device to the new one, it is a rebuild of the data on the new device, so if the device fails completely it just keeps rebuilding. The example in my blog was of a drive that just went offline with no warning. I put the new drive in the same physical slot (I did not have any open slots) and issued the resilver command.

Note that having the FreeBSD device drive echo the Vendor info, including drive P/N and S/N to the system log is a HUGE help to replacing bad drives.

>> memory pressure more gracefully, but it's not committed yet. I highly recommend
>> moving to 64-bit as soon as possible.
> 
>     I intend to do so, but "as soon as possible" will be after all this
> disk trouble and disk reconfiguration have been resolved.  It will be done
> via an in-place upgrade from source, so I need to have a place to run
> buildworld and build kernel.

So the real world intrudes on perfection yet again :-) We do what we have to in order to get the job done, but make sure to understand the limitations and compromises you are making along the way.

>  Before doing an installkernel and installworld,
> I need also to have a place to run full backups.  I have not had a place to
> store new backups for the last three months, which is making me more unhappy
> by the day.  I really have to get the disk work *done* before I can move
> forward on anything else, which is why I'm trying to find out whether I can
> actually use ZFS raidzN in that cause while still on i386.

Yes, you can. I have used ZFS on 32-bit systems (OK, they were really 32-bit VMs, but I was still running ZFS there, still am today and it has saved my butt at least once already).

>  Performance
> will not be an issue that I can see until later if ever.

I have run ZFS on systems with as little as 1GB total RAM, just do NOT expect stellar (or even good) performance. Keep a close watch on the ARC size (FreeBSD 10 makes this easy with the additional status line in top for the ZFS ARC and L2ARC). You can also use arcstat.pl (get the FreeBSD version here https://code.google.com/p/jhell/downloads/detail?name=arcstat.pl )to track ARC usage over time. On my most critical production server I leave it running with a 60 second sample so if something goes south I can see what happened just before.

Tune vfs.zfs.arc_max in /boot/loader.conf

If I had less than 4GB of RAM I would limit the ARC to 1/2 RAM, unless this were solely a fileserver, then I would watch how much memory I needed outside ZFS and set the ARC to slightly less than that. Take a look at the recommendations here https://wiki.freebsd.org/ZFSTuningGuide for low RAM situations.

>  I just need to
> know whether I can use it at all with my presently installed OS or will
> instead have to use gvinum(8) raid5 and hope for minimal data corruption.
> (At least only one .eli device would be needed in that case, not the M+N
> .eli devices that would be required for a raidzN pool.) Unfortunately,
> ideal conditions for ZFS are not an available option for now.

I am a big believer in ZFS, so I think the short term disadvantages are outweighed by the ease of migration and the long term advantages. So I would go the ZFS route.

--
Paul Kraus
paul at kraus-haus.org