Uneven load on drives in ZFS RAIDZ1
Stefan Esser
se at freebsd.org
Mon Dec 19 19:09:03 UTC 2011
Am 19.12.2011 16:42, schrieb Peter Maloney:
> On 12/19/2011 03:22 PM, Stefan Esser wrote:
>> So: Can anybody reproduce this distribution requests?
> I don't have a raidz1 machine, and no time to make you a special raidz1
> pool out of spare disks, but on my raidz2 I can only ever see unevenness
> when a disk is bad, or between different vdevs. But you only have one vdev.
Thanks for replying.
In my previous raidz1 pool consisting of 3*1TB, one of the drives had to
be replaced because it showed lots of recoverable errors when I
initially created the pool. The effects where much more drastic than
what I see now: Given identical request rates, the failed drive was 100%
busy when the other drives had busy percentages in the one digit range.
But the observed differences seem to be caused by a different rate of
read requests issued towards the drives (the first two receive 30% of
the reads, each, while the last two receive 20% each). And this ratio
has been stable over months (I had already noticed this in summer, but
did not have time to start a thread at that time).
> Check is that your disks are identical (are they? we can only assume so
> since you didn't say so).
Yes, all 4 are identical.
> Show us output from:
> smartctl -i /dev/ada0
Model Family: SAMSUNG SpinPoint F4 EG (AFT)
Device Model: SAMSUNG HD204UI
Serial Number: S2H7JD1B116957
LU WWN Device Id: 5 0024e9 0049bee63
Firmware Version: 1AQ10001
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 6
Local Time is: Mon Dec 19 19:23:36 2011 CET
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always
- 0
2 Throughput_Performance 0x0026 252 252 000 Old_age Always
- 0
3 Spin_Up_Time 0x0023 067 067 025 Pre-fail Always
- 10127
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always
- 254
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always
- 0
7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always
- 0
8 Seek_Time_Performance 0x0024 252 252 015 Old_age
Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always
- 2300
10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always
- 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always
- 1
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always
- 228
181 Program_Fail_Cnt_Total 0x0022 100 100 000 Old_age Always
- 621067
191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always
- 4
192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always
- 0
194 Temperature_Celsius 0x0002 064 055 000 Old_age Always
- 28 (Min/Max 15/48)
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always
- 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always
- 0
197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always
- 0
198 Offline_Uncorrectable 0x0030 252 252 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always
- 0
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always
- 2
223 Load_Retry_Count 0x0032 100 100 000 Old_age Always
- 1
225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always
- 264
> smartctl -i /dev/ada1
Model Family: SAMSUNG SpinPoint F4 EG (AFT)
Device Model: SAMSUNG HD204UI
Serial Number: S2H7JD1B116947
LU WWN Device Id: 5 0024e9 0049bee49
Firmware Version: 1AQ10001
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 6
Local Time is: Mon Dec 19 19:23:22 2011 CET
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always
- 0
2 Throughput_Performance 0x0026 252 252 000 Old_age Always
- 0
3 Spin_Up_Time 0x0023 067 067 025 Pre-fail Always
- 10096
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always
- 255
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always
- 0
7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always
- 0
8 Seek_Time_Performance 0x0024 252 252 015 Old_age
Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always
- 2316
10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always
- 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always
- 1
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always
- 231
181 Program_Fail_Cnt_Total 0x0022 100 100 000 Old_age Always
- 2175909
191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always
- 1
192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always
- 0
194 Temperature_Celsius 0x0002 064 055 000 Old_age Always
- 26 (Min/Max 16/47)
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always
- 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always
- 0
197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always
- 0
198 Offline_Uncorrectable 0x0030 252 252 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always
- 0
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always
- 1
223 Load_Retry_Count 0x0032 100 100 000 Old_age Always
- 1
225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always
- 264
> smartctl -i /dev/ada2
Model Family: SAMSUNG SpinPoint F4 EG (AFT)
Device Model: SAMSUNG HD204UI
Serial Number: S2H7JD1B116956
LU WWN Device Id: 5 0024e9 0049bee60
Firmware Version: 1AQ10001
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 6
Local Time is: Mon Dec 19 19:24:24 2011 CET
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always
- 0
2 Throughput_Performance 0x0026 252 252 000 Old_age Always
- 0
3 Spin_Up_Time 0x0023 067 066 025 Pre-fail Always
- 10254
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always
- 246
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always
- 0
7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always
- 0
8 Seek_Time_Performance 0x0024 252 252 015 Old_age
Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always
- 2300
10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always
- 0
11 Calibration_Retry_Count 0x0032 252 252 000 Old_age Always
- 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always
- 227
181 Program_Fail_Cnt_Total 0x0022 100 100 000 Old_age Always
- 105259
191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always
- 1
192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always
- 0
194 Temperature_Celsius 0x0002 064 056 000 Old_age Always
- 28 (Min/Max 16/45)
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always
- 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always
- 0
197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always
- 0
198 Offline_Uncorrectable 0x0030 252 252 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always
- 0
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always
- 0
223 Load_Retry_Count 0x0032 252 252 000 Old_age Always
- 0
225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always
- 256
> smartctl -i /dev/ada3
Model Family: SAMSUNG SpinPoint F4 EG (AFT)
Device Model: SAMSUNG HD204UI
Serial Number: S2H7JD1B116946
LU WWN Device Id: 5 0024e9 0049bee47
Firmware Version: 1AQ10001
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 6
Local Time is: Mon Dec 19 19:24:55 2011 CET
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always
- 0
2 Throughput_Performance 0x0026 252 252 000 Old_age Always
- 0
3 Spin_Up_Time 0x0023 066 066 025 Pre-fail Always
- 10472
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always
- 250
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always
- 0
7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always
- 0
8 Seek_Time_Performance 0x0024 252 252 015 Old_age
Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always
- 2302
10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always
- 0
11 Calibration_Retry_Count 0x0032 252 252 000 Old_age Always
- 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always
- 227
181 Program_Fail_Cnt_Total 0x0022 100 100 000 Old_age Always
- 239254
191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always
- 1
192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always
- 0
194 Temperature_Celsius 0x0002 064 055 000 Old_age Always
- 27 (Min/Max 16/47)
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always
- 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always
- 0
197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always
- 0
198 Offline_Uncorrectable 0x0030 252 252 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always
- 0
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always
- 2
223 Load_Retry_Count 0x0032 252 252 000 Old_age Always
- 0
225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always
- 259
> Since your tests show read ms/r to be pretty even, I guess your disks
> are not broken. But the ms/w is slightly different. So I think it seems
> that the first 2 disks are slower for writing (someone once said that
My interpretation is, that the first two have higher write latencies
since they receive more read requests.
> refurbished disks are like this, even if identical), or the hard disk
> controller ports they use are slower. For example, maybe your
> motherboard has 6 ports, and you plugged disks 1,2,3 into port 1,2,3 and
> disk 4 into port 5. Disk 3 and 4 would have their own channel, but disk
> 1 and 2 share one.
This is an ICH10 and the drives are connected to SATA II channels (the
SATA III channels are reserved for a planned SSD cache).
> So if the disks are identical, I would guess your hard disk controller
> is to blame. To test this, first back it up. Then *fix your setup by
> using labels*. ie. use gpt/somelabel0 or gptid/....... rather than
> ada0p2. Check "ls /dev/gpt*" output for options on what labels you have
> already. Then try swapping disks around to see if the load changes. Make
> sure to back up...
The drives are lalready abelled and I can easily modify the pool to
refer to GPT labels. But swapping drives should not cause any harm in
ZFS, whether labels are device names are used (the drives in the pool
are identified by their GUID).
> Swapping disks (or even removing one depending on controller, etc. when
> it fails) without labels can be bad.
Yes, I know (having seen my first Unix system more than 30 years ago).
I'll re-import the drives with "zpool import -d /dev/gpt ..." but need
to boot from an alternate boot device first.
> eg.
> You have ada1 ada2 ada3 ada4.
> Someone spills coffee on ada2; it fries and cannot be detected anymore,
> and you reboot.
> Now you have ada1 ada2 ada3.
> Then things are usually still fine (even though ada3 is now ada2 and
> ada4 is now ada3, because there is some zfs superblock stuff to keep
> track of things), but if you also had an ada5 that was not part of the
> pool, or was a spare or a log or something other than another disk in
> the same vdev as ada1, etc., bad things happen when it becomes ada4.
> Unfortunately, I don't know exactly what people do to cause the "bad
> things" that happen. When this happened to me, it just said my pool was
> faulted or degraded or something, and set a disk or two to UNAVAIL or
> FAULTED. I don't remember it automatically resilvering them, but when I
> read about these problems, I think it seems like some disks were
> resilvered afterwards.
The recovery from partial pool failures and the collection of drives to
form a pool has been modified several times in the last two years and
should be quite robust by now. One thing to look out for is to not copy
a pool to new disk drives (I used to have 3*1TB, copied to 4*2TB) and
later connect a drive from the original pool with its ZFS metadata
intact at the end of the drive (I had cleared the first 1MB, but not the
last 1MB). This causes confusion, if the name of the pool has not
changed. But other than that, I do not see much risk in ZFS pools built
from /dev nodes.
> And last thing I can think of is to make sure your partitions are
> aligned, and identical. Show us output from:
> gpart show
They have all been created by a script that takes the device node name
as parameter and thus are identical.
=> 34 3907029101 ada0 GPT (1.8T)
34 30 - free - (15k)
64 192 1 freebsd-boot (96k)
256 3565158400 2 freebsd-zfs (1.7T)
3565158656 341870479 3 freebsd (163G)
=> 34 3907029101 ada1 GPT (1.8T)
34 30 - free - (15k)
64 192 1 freebsd-boot (96k)
256 3565158400 2 freebsd-zfs (1.7T)
3565158656 341870479 3 freebsd (163G)
=> 34 3907029101 ada2 GPT (1.8T)
34 30 - free - (15k)
64 192 1 freebsd-boot (96k)
256 3565158400 2 freebsd-zfs (1.7T)
3565158656 341870479 3 freebsd (163G)
=> 34 3907029101 ada3 GPT (1.8T)
34 30 - free - (15k)
64 192 1 freebsd-boot (96k)
256 3565158400 2 freebsd-zfs (1.7T)
3565158656 1792 - free - (896k)
3565160448 341868544 3 freebsd-swap (163G)
3907028992 143 - free - (71k)
There is an unused 10% at the end of each device, and I have recently
made ada3p3 a swap device, just to be able to collect kernel dumps (no
swpa is actually used; this is an 8GB RAM machine with 6GB assigned to
ARC and mostly low load).
Best regards, STefan
More information about the freebsd-current
mailing list