Uneven load on drives in ZFS RAIDZ1

Mon Dec 19 15:02:13 UTC 2011

2011/12/19 Stefan Esser <se at freebsd.org>:
> Hi ZFS users,
>
> for quite some time I have observed an uneven distribution of load
> between drives in a 4 * 2TB RAIDZ1 pool. The following is an excerpt of
> a longer log of 10 second averages logged with gstat:
>
> dT: 10.001s  w: 10.000s  filter: ^a?da?.$
>  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
>    0    130    106   4134    4.5     23   1033    5.2   48.8| ada0
>    0    131    111   3784    4.2     19   1007    4.0   47.6| ada1
>    0     90     66   2219    4.5     24   1031    5.1   31.7| ada2
>    1     81     58   2007    4.6     22   1023    2.3   28.1| ada3
>
>  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
>    1    132    104   4036    4.2     27   1129    5.3   45.2| ada0
>    0    129    103   3679    4.5     26   1115    6.8   47.6| ada1
>    1     91     61   2133    4.6     30   1129    1.9   29.6| ada2
>    0     81     56   1985    4.8     24   1102    6.0   29.4| ada3
>
>  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
>    1    148    108   4084    5.3     39   2511    7.2   55.5| ada0
>    1    141    104   3693    5.1     36   2505 10.4 54.4| ada1
>    1    102     62   2112    5.6     39   2508    5.5   35.4| ada2
>    0     99     60   2064    6.0     39   2483    3.7   36.1| ada3
>
> This goes on for minutes, without a change of roles (I had assumed that
> other 10 minute samples might show relatively higher load on another
> subset of the drives, but it's always the first two, which receive some
> 50% more read requests than the other two.
>
> The test consisted of minidlna rebuilding its content database for a
> media collection held on that pool. The unbalanced distribution of
> requests does not depend on the particular application and the
> distribution of requests does not change when the drives with highest
> load approach 100% busy.
>
> This is a -CURRENT built from yesterdays sources, but the problem exists
> for quite some time (and should definitely be reproducible on -STABLE, too).
>
> The pool consists of a 4 drive raidz1 on an ICH10 (H67) without cache or
> log devices and without much ZFS tuning (only max. ARC size, should not
> at all be relevant in this context):
>
> zpool status -v
>  pool: raid1
>  state: ONLINE
>  scan: none requested
> config:
>
>        NAME        STATE     READ WRITE CKSUM
>        raid1       ONLINE       0     0     0
>          raidz1-0  ONLINE       0     0     0
>            ada0p2  ONLINE       0     0     0
>            ada1p2  ONLINE       0     0     0
>            ada2p2  ONLINE       0     0     0
>            ada3p2  ONLINE       0     0     0
>
> errors: No known data errors
>
> Cached configuration:
>        version: 28
>        name: 'raid1'
>        state: 0
>        txg: 153899
>        pool_guid: 10507751750437208608
>        hostid: 3558706393
>        hostname: 'se.local'
>        vdev_children: 1
>        vdev_tree:
>            type: 'root'
>            id: 0
>            guid: 10507751750437208608
>            children[0]:
>                type: 'raidz'
>                id: 0
>                guid: 7821125965293497372
>                nparity: 1
>                metaslab_array: 30
>                metaslab_shift: 36
>                ashift: 12
>                asize: 7301425528832
>                is_log: 0
>                create_txg: 4
>                children[0]:
>                    type: 'disk'
>                    id: 0
>                    guid: 7487684108701568404
>                    path: '/dev/ada0p2'
>                    phys_path: '/dev/ada0p2'
>                    whole_disk: 1
>                    create_txg: 4
>                children[1]:
>                    type: 'disk'
>                    id: 1
>                    guid: 12000329414109214882
>                    path: '/dev/ada1p2'
>                    phys_path: '/dev/ada1p2'
>                    whole_disk: 1
>                    create_txg: 4
>                children[2]:
>                    type: 'disk'
>                    id: 2
>                    guid: 2926246868795008014
>                    path: '/dev/ada2p2'
>                    phys_path: '/dev/ada2p2'
>                    whole_disk: 1
>                    create_txg: 4
>                children[3]:
>                    type: 'disk'
>                    id: 3
>                    guid: 5226543136138409733
>                    path: '/dev/ada3p2'
>                    phys_path: '/dev/ada3p2'
>                    whole_disk: 1
>                    create_txg: 4
>
> I'd be interested to know, whether this behavior can be reproduced on
> other systems with raidz1 pools consisting of 4 or more drives. All it
> takes is generating some disk load and running the command:
>
>        gstat -I 10000000 -f '^a?da?.$'
>
> to obtain 10 second averages.
>
> I have not even tried to look at the scheduling of requests in ZFS, but
> I'm surprised to see higher than average load on just 2 of the 4 drives,
> since RAID parity should be evenly spread over all drives and for each
> file system block a different subset of 3 out of 4 drives should be able
> to deliver the data without need to reconstruct it from parity (that
> would lead to an even distribution of load).
>
> I've got two theories what might cause the obtained behavior:
>
> 1) There is some meta data that is only kept on the first two drives.
> Data is evenly spread, but meta data accesses lead to additional reads.
>
> 2) The read requests are distributed in such a way, that 1/3 goes to
> ada0, another 1/3 to ada1, while the remaining 1/3 is evenly distributed
> to ada2 and ada3.
>
>
> So: Can anybody reproduce this distribution requests?

Hello,

Stupid question, but are your drives all exactly the same ? I noticed
"ashift: 12" so I think you should have at least one 4k-sector drive,
are you sure they're not mixed with 512B per sector drives ?

>
> Any idea, why this is happening and whether something should be changed
> in ZFS to better distribute the load (leading to higher file system
> performance)?
>
> Best regards, STefan
> _______________________________________________
> freebsd-current at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe at freebsd.org"

-- 
Olivier Smedts                                                 _
                                        ASCII ribbon campaign ( )
e-mail: olivier at gid0.org        - against HTML email & vCards  X
www: http://www.gid0.org    - against proprietary attachments / \

  "Il y a seulement 10 sortes de gens dans le monde :
  ceux qui comprennent le binaire,
  et ceux qui ne le comprennent pas."