Hours of tiny transfers at the end of a ZFS resilver?

Mon Feb 15 10:19:09 UTC 2016

Hi Filesystem experts,

I have a question about the nature of ZFS and the resilvering
that occurs after a driver replacement from a raidz array.

I have a fairly simple home file server that (by way of
gradually replaced pieces and upgrades) has effectively been
doing great service since well, forever, but its last re-build
replaced its main UFS file systems with a four-drive ZFS raidz
pool.  It's been going very nicely over the years, and now it's
full, so I've nearly finished replacing its 1TB drives with new
4TB ones.  I'm doing that the slow way, replacing one at a time
and resilvering before going on to the next, because that only
requires a minute or two of down-time for the drive swaps each
time.  Replacing the whole array and restoring from backup would
have had the system off-line for many hours (I guess).

Now, one thing that I didn't realise at the start of this
process was that the zpool has the original 512B sector size
baked in at a fairly low level, so it is using some sort of
work-around for the fact that the new drives actually have 4096B
sectors (although they lie about that in smartctl -i queries):

The four new drives appear to smartctl as:
Model Family:     HGST Deskstar NAS
Device Model:     HGST HDN724040ALE640
Serial Number:    PK1334PEHYSZ6S
LU WWN Device Id: 5 000cca 250dba043
Firmware Version: MJAOA5E0
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Feb 15 20:57:30 2016 AEDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

They show up in zpool status as:
  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Feb 15 07:19:45 2016
        3.12T scanned out of 3.23T at 67.4M/s, 0h29m to go
        798G resilvered, 96.48% done
config:

	NAME                        STATE     READ WRITE CKSUM
	tank                        DEGRADED     0     0     0
	  raidz1-0                  DEGRADED     0     0     0
	    ada0p1                  ONLINE       0     0     0  block size: 512B configured, 4096B native
	    ada1p1                  ONLINE       0     0     0  block size: 512B configured, 4096B native
	    ada3p1                  ONLINE       0     0     0  block size: 512B configured, 4096B native
	    replacing-3             DEGRADED     0     0     0
	      17520966596084542745  UNAVAIL      0     0     0  was /dev/ada4p1/old
	      ada4p1                ONLINE       0     0     0  block size: 512B configured, 4096B native  (resilvering)

errors: No known data errors

While clearly sub-optimal, I expect that the performance will
still be good enough for my purposes: I can build a new,
properly aligned file system when I do the next re-build.

The odd thing is that after charging through the resilver using
large blocks (around 64k according to systat), when they get to
the end, as this one is now, the process drags on for hours with
millions of tiny, sub-2K transfers:

Here's the systat -vmstat output right now.

   19 users    Load  0.28  0.30  0.27                  15 Feb 21:01

Mem:KB    REAL            VIRTUAL                       VN PAGER   SWAP PAGER
        Tot   Share      Tot    Share    Free           in   out     in   out
Act  832452   38920  5151576   111272  258776  count
All  859068   52388  5349268   249964          pages     3
Proc:                                                            Interrupts
  r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Flt        ioflt  1026 total
            174      4417  126 4775 1026   45   90        cow         atkbd0 1
                                                       88 zfod        hdac1 16
 2.8%Sys   0.5%Intr  3.9%User  0.0%Nice 92.9%Idle         ozfod       ehci0 ehci
|    |    |    |    |    |    |    |    |    |           %ozfod   249 siis0 ohci
=+>>                                                      daefr   147 hpet0:t0
                                         4 dtbuf          prcfr   165 hpet0:t1
Namei     Name-cache   Dir-cache    213520 desvn        3 totfr   114 hpet0:t2
   Calls    hits   %    hits   %     12132 numvn          react       hdac0 259
   77130   77093 100                  4853 frevn          pdwak       xhci1 261
                                                       32 pdpgs       ahci0:ch0
Disks   da0  ada0  ada1  ada2  ada3  ada4 pass0           intrn   170 ahci0:ch1
KB/t   0.00  1.41  1.46  0.00  1.39  1.47  0.00   6013640 wire    151 ahci0:3
tps       0   173   138     0   176   151     0     77140 act      30 re0 266
MB/s   0.00  0.24  0.20  0.00  0.24  0.22  0.00   1732676 inact
%busy     0    18    15     0    17    99     0           cache
                                                   258776 free
                                                    29632 buf

So there's a problem wth the zpool status output: it's
predicting half an hour to go based on the averaged 67M/s over
the whole drive, not the <2MB/s that it's actually doing, and
will probably continue to do so for several hours, if tonight
goes the same way as last night.  Last night zpool status said
"0h05m to go" for more than three hours, before I gave up
waiting to start the next drive.

Is this expected behaviour, or something bad and peculiar about
my system?

I'm confused about how ZFS really works, given this state.  I
had thought that the zpool layer did parity calculation in big
256k-ish stripes across the drives, and the zfs filesystem layer
coped with that large block size because it had lots of caching
and wrote everything in log-structure.  Clearly that mental
model must be incorrect, because then it would only ever be
doing large transfers.  Anywhere I could go to find a nice
write-up of how ZFS is working?

Cheers,

-- 
Andrew