Hours of tiny transfers at the end of a ZFS resilver?

Mon Feb 15 15:05:51 UTC 2016

On Feb 15, 2016, at 5:18, Andrew Reilly <areilly at bigpond.net.au> wrote:

> Hi Filesystem experts,
> 
> I have a question about the nature of ZFS and the resilvering
> that occurs after a driver replacement from a raidz array.

How many snapshots do you have ? I have seen this behavior on pools with many snapshots and ongoing creation of snapshots during the resilver. The resilver gets to somewhere above 95% (usually 99.xxx % for me) and then slows to a crawl, often for days.

Most of the ZFS pools I manage have automated jobs to create hourly snapshots, so I am always creating snapshots.

More below...

> 
> I have a fairly simple home file server that (by way of

<snip>

> have had the system off-line for many hours (I guess).
> 
> Now, one thing that I didn't realise at the start of this
> process was that the zpool has the original 512B sector size
> baked in at a fairly low level, so it is using some sort of
> work-around for the fact that the new drives actually have 4096B
> sectors (although they lie about that in smartctl -i queries):

Running 4K native drives in a 512B pool will cause a performance hit. When I ran into this I rebuilt the pool from scratch as a 4K native pool. If there is at least one 4K native drive in a given vdev the vdev will be created native 4K (at least under FBSD 10.x). My home server has a pool of mixed 512B and 4K drives. I made sure each vdev was built 4K.

The code in the drive that emulates 512B behavior has not been very fast and that is the crux of the performance issues. I just had to rebuild a pool because 2TB WD Red Pro are 4K while 2TB WD RE are 512B. 

<snip>

> While clearly sub-optimal, I expect that the performance will
> still be good enough for my purposes: I can build a new,
> properly aligned file system when I do the next re-build.
> 
> The odd thing is that after charging through the resilver using
> large blocks (around 64k according to systat), when they get to
> the end, as this one is now, the process drags on for hours with
> millions of tiny, sub-2K transfers:

Yup.

The resilver process walks through the transaction groups (TXG) replaying them onto the new (replacement) drive. This is different from other traditional resync methods. It also means that the early TXG will be large (as you loaded data) and then he size of the TXG will vary with the size of the data written.

<snip>

> So there's a problem wth the zpool status output: it's
> predicting half an hour to go based on the averaged 67M/s over
> the whole drive, not the <2MB/s that it's actually doing, and
> will probably continue to do so for several hours, if tonight
> goes the same way as last night.  Last night zpool status said
> "0h05m to go" for more than three hours, before I gave up
> waiting to start the next drive.

Yup, the code that estimates time to go is based on the overall average transfer not the current. In my experience the transfer rate peaks somewhere in the middle of the resilver.

> Is this expected behaviour, or something bad and peculiar about
> my system?

Expected ? I’m not sure if the designers of ZFS expected this behavior :-)

But it is the typical behavior and is correct.

> I'm confused about how ZFS really works, given this state.  I
> had thought that the zpool layer did parity calculation in big
> 256k-ish stripes across the drives, and the zfs filesystem layer
> coped with that large block size because it had lots of caching
> and wrote everything in log-structure.  Clearly that mental
> model must be incorrect, because then it would only ever be
> doing large transfers.  Anywhere I could go to find a nice
> write-up of how ZFS is working?

You really can’t think about ZFS the same way as older systems, with a volume manager and a filesystem, they are fully integrated. For example, stripe size (across all the top level vdevs) is dynamic, changing with each write operation. I believe that it tries to include every top level vdev in each write operation. In your case that does not apply as you only have one top level vdev, but note that performance really scales with the number of top level vdevs more than the number of drives per vdev.

Also note that striping within a RAIDz<n> vdev is separate from the top level vdev striping.

Take a look here: http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ for a good discussion of ZFS striping for RAIDz<n> vdevs. And don’t forget to follow the links at the bottom of the page for more details.

P.S. For performance it is generally recommended to use mirrors while for capacity use RAIDz<n>, all tempered by the mean time to data loss (MTTDL) you need. Hint, a 3-way mirror has about the same MTTDL as a RAIDz2.

--
Paul Kraus
paul at kraus-haus.org