Uneven load on drives in ZFS RAIDZ1

Mon Dec 19 21:53:22 UTC 2011

In the last episode (Dec 19), Stefan Esser said:
> Am 19.12.2011 17:22, schrieb Dan Nelson:
> > In the last episode (Dec 19), Stefan Esser said:
> >> for quite some time I have observed an uneven distribution of load
> >> between drives in a 4 * 2TB RAIDZ1 pool.  The following is an excerpt
> >> of a longer log of 10 second averages logged with gstat:
> >>
> >> dT: 10.001s  w: 10.000s  filter: ^a?da?.$
> >>  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
> >>     0    130    106   4134    4.5     23   1033    5.2   48.8| ada0
> >>     0    131    111   3784    4.2     19   1007    4.0   47.6| ada1
> >>     0     90     66   2219    4.5     24   1031    5.1   31.7| ada2
> >>     1     81     58   2007    4.6     22   1023    2.3   28.1| ada3
> > [...]
> 
> This is a ZFS only system. The first partition on each drive holds just
> the gptzfsloader.
> 
> pool        alloc   free   read  write   read  write
> ----------  -----  -----  -----  -----  -----  -----
> raid1       4.41T  2.21T    139     72  12.3M   818K
>   raidz1    4.41T  2.21T    139     72  12.3M   818K
>     ada0p2      -      -    114     17  4.24M   332K
>     ada1p2      -      -    106     15  3.82M   305K
>     ada2p2      -      -     65     20  2.09M   337K
>     ada3p2      -      -     58     18  2.18M   329K
> 
> The same difference of read operations per second as shown by gstat ...

I was under the impression that the parity blocks were scattered evenly
across all disks, but from reading vdev_raidz.c, it looks like that isn't
always the case.  See the comment at the bottom of the
vdev_raidz_map_alloc() function; it looks like it will toggle parity between
the first two disks in a stripe every 1MB.  It's not necessarily the first
two disks assigned to the zvol, since stripes don't have to span all disks
as long as there's one parity block (a small sync write may just hit two
disks, essentially being written mirrored).  The imbalance is only visible
if you're writing full-width stripes in sequence, so if you write a 1TB file
in one long stream, chances are that that file's parity blocks will be
concentrated on just two disks, so those two disks will get less I/O on
later reads.  I don't know why the code toggles parity between just the
first two columns; rotating it between all columns would give you an even
balance.

Is it always the last two disks that have less load, or does it slowly
rotate to different disks depending on the data that you are reading?  An
interesting test would be to idle the system, run a "tar cvf /dev/null
/raidz1" in one window, and watch iostat output on another window.  If the
load moves from disk to disk as tar reads different files, then my parity
guess is probably right.  If ada0 and ada1 are always busier, than you can
ignore me :)

Since it looks like the algorithm ends up creating two half-cold parity
disks instead of one cold disk, I bet a 3-disk RAIDZ would exhibit even
worse balancing, and a 5-disk set would be more even.

-- 
	Dan Nelson
	dnelson at allantgroup.com