raidz slowing down

Mon Oct 26 01:30:39 UTC 2009

> Did you ever get any response? I have a very similar sounding issue with 
> my raidz2. I've always assumed it was because the volume was nearly full 
> and maybe some fragmentation or something. All of my devices are on MPT 
> controllers, so I don't think that the highpoint device is an issue.

Nope, no responses...

Since I was working on a rescue operation, I didn't have the patience
to eliminated  all kinds of errors and so I swapped out da1
(maybe a little bit slow or buggy?) and used the forensics version of dd
'dcfldd'. It has a split option and I suspected that ZFS has problems when
writing huge amounts of continous data streams - so I split the 10TB in
100GB files, which took about 11 hours.

I don't know if this is general problem, or if this only happens when the
input id delivered at a much higher data-rate. In this case, the HW-RAID/zpool was
able to deliver data at 600MB/s while the RAIDZ/zpool could only write at 130MB/s.

The dynamics of this 'slow-down' that I could watch via gstat looked like the
whole access on the device level was desynchronizing completely.
In the end, before I quit the process, write-speed was down to 5MB/s !

But as I mentioned earlier, I had no nerves for bug-hunting, due to a
bigger (still unsolved) problem at hand.

Maybe somebody else likes to investigate? I'm busy with ZFS forensics...

solon

> On Thu, 8 Oct 2009, Solon Lutz wrote:

>> I built a 9x hdd 11TB raidz for some rescue purposes and started
>> copying an image from another partition via "dd if=/dev/da0..." to it.
>> It consists of: ad4 da1 da2 da3 da4 da5 da6 da7 da8, da1 to da8 are
>> connected via two highpoint controllers.

>> In the beginning write speeds were quite fair:

>> dT: 1.002s  w: 1.000s
>> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
>>    0    424      0      0    0.0    424  52483   33.9   84.6| ad4
>>    0      0      0      0    0.0      0      0    0.0    0.0| da0
>>   35    356      0      0    0.0    356  44584   76.4  124.5| da1
>>   35    296      0      0    0.0    296  36919   84.5  121.0| da2
>>   34    361      0      0    0.0    361  45111   75.5  124.7| da3
>>   35    346      0      0    0.0    346  43196   78.6  123.2| da4
>>   35    344      0      0    0.0    344  42940   80.0  124.7| da5
>>   35    343      0      0    0.0    343  42812   80.7  124.5| da6
>>   35    344      0      0    0.0    344  43051   79.8  123.9| da7
>>   34    342      0      0    0.0    342  42796   80.6  124.4| da8

>> Now, some 10 hours and 2.5TB later, it look like that most of the time:

>> dT: 1.002s  w: 1.000s
>> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
>>    0     10      0      0    0.0     10      6    0.8    0.2| ad4
>>    0      0      0      0    0.0      0      0    0.0    0.0| da0
>>    4     13      0      0    0.0     13      8  550.4  178.5| da1
>>    0     12      0      0    0.0     12      7    0.7    0.2| da2
>>    0     11      0      0    0.0     11      7    0.7    0.2| da3
>>    0     10      0      0    0.0     10      5    0.6    0.2| da4
>>    0     11      0      0    0.0     11      6    0.9    0.3| da5
>>    0     12      0      0    0.0     12      7    0.7    0.2| da6
>>    0     11      0      0    0.0     11      7    0.7    0.2| da7
>>    0      9      0      0    0.0      9      6    0.8    0.2| da8

>> da1 seems to be busy most of time and every few seconds all the other
>> devices write some data with nearly normal speed:

>> dT: 1.003s  w: 1.000s
>> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
>>    0    254      0      0    0.0    254  31331   34.9   35.4| ad4
>>    0      0      0      0    0.0      0      0    0.0    0.0| da0
>>    4      0      0      0    0.0      0      0    0.0    0.0| da1
>>    0    254      0      0    0.0    254  31346  107.4  104.5| da2
>>    0    256      0      0    0.0    256  31345  108.1  104.0| da3
>>    0    255      0      0    0.0    255  31345  110.2  105.1| da4
>>   35    200      0      0    0.0    200  24912  143.3  115.0| da5
>>   35    211      0      0    0.0    211  26303  137.8  114.9| da6
>>   35    210      0      0    0.0    210  26079  139.3  114.9| da7
>>   35    209      0      0    0.0    209  25952  135.2  113.7| da8

>> Sometimes it even gets back to 'normal' behaviour, but never reaches
>> the speeds it once had:

>> dT: 1.002s  w: 1.000s
>> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
>>   35    274      0      0    0.0    274  34334   44.2   66.6| ad4
>>    0   1166   1166 149243    0.1      0      0    0.0   14.3| da0
>>   35    120      0      0    0.0    120  14717   94.4   64.5| da1
>>   35     96      0      0    0.0     96  11665  113.9   64.3| da2
>>   35    100      0      0    0.0    100  12288   98.7   63.9| da3
>>   35    103      0      0    0.0    103  12496   93.4   59.4| da4
>>   34    112      0      0    0.0    112  13694  106.1   67.4| da5
>>   35     71      0      0    0.0     71   8596  115.3   66.8| da6
>>   35    116      0      0    0.0    116  14205  101.7   67.3| da7
>>   35     83      0      0    0.0     83  10066  112.2   65.9| da8

>> Syslog reports the following:

>> Oct  8 09:53:40 radium kernel: hptrr: start channel [0,0]
>> Oct  8 09:53:40 radium kernel: hptrr: channel [0,0] started successfully
>> Oct  8 09:57:44 radium kernel: hptrr: start channel [0,0]
>> Oct  8 09:57:45 radium kernel: hptrr: channel [0,0] started successfully
>> Oct  8 10:54:26 radium kernel: hptrr: start channel [0,0]
>> Oct  8 10:54:27 radium kernel: hptrr: channel [0,0] started successfully
>> Oct  8 11:10:29 radium kernel: hptrr: start channel [0,0]
>> Oct  8 11:10:30 radium kernel: hptrr: channel [0,0] started successfully
>> Oct  8 11:17:27 radium kernel: hptrr: start channel [0,0]
>> Oct  8 11:17:27 radium kernel: hptrr: channel [0,0] started successfully

>> Is this a problem of the hptrr device or is da1 failing?

>> Mit freundlichen Grüßen
>> Best regards,

>> Solon Lutz

>> +-----------------------------------------------+
>> | Pyro.Labs Berlin -  Creativity for tomorrow   |
>> | Wasgenstrasse 75/13 - 14129 Berlin, Germany   |
>> | www.pyro.de - phone + 49 - 30 - 48 48 58 58   |
>> | info at pyro.de - fax + 49 - 30 - 80 94 03 52    |
>> +-----------------------------------------------+

>> _______________________________________________
>> freebsd-fs at freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"