slowdown of zfs (tx->tx)

Mon Jan 14 10:58:42 UTC 2013

Nicolas Rachinsky schreef:
> * Artem Belevich <art at freebsd.org> [2013-01-11 12:39 -0800]:
>> On Thu, Jan 10, 2013 at 11:34 PM, Nicolas Rachinsky
>> <fbsd-mas-0 at ml.turing-complete.org> wrote:
>>> * Nicolas Rachinsky <fbsd-mas-0 at ml.turing-complete.org> [2013-01-10 20:39 +0100]:
>>>> after replacing one of the controllers, all problems seem to have
>>>> disappeared. Thank you very much for your advice!
>>> Now the problem is back.
>>>
>>> After changing the controller, there were no more timeouts logged.
>>>
>>> No UDMA_CRC_Error_Count changed.
>>>
>> Is there anything special about ada8? It does seem to have noticeably
>> higher service time compared to other disks.
> Nothing I know of. The disks are Samsung HD103UJ and HD103SI, multiple
> of each type.
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>    1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -       0
>    3 Spin_Up_Time            0x0007   073   073   011    Pre-fail  Always       -       8890
>    4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       32
>    5 Reallocated_Sector_Ct   0x0033   094   094   010    Pre-fail  Always       -       166
>    7 Seek_Error_Rate         0x000f   100   100   051    Pre-fail  Always       -       0
>    8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail  Offline      -       10872
>    9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       5688
>   10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       -       0
>   11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       -       0
>   12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       31
>   13 Read_Soft_Error_Rate    0x000e   100   100   000    Old_age   Always       -       0
> 183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
> 184 End-to-End_Error        0x0033   100   100   000    Pre-fail  Always       -       0
> 187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
> 188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
> 190 Airflow_Temperature_Cel 0x0022   078   069   000    Old_age   Always       -       22 (Min/Max 21/25)
> 194 Temperature_Celsius     0x0022   077   067   000    Old_age   Always       -       23 (Min/Max 21/26)
> 195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       1259614646
> 196 Reallocated_Event_Count 0x0032   096   096   000    Old_age   Always       -       166
> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
> 198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
> 199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
> 200 Multi_Zone_Error_Rate   0x000a   100   099   000    Old_age   Always       -       5
> 201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age   Always       -       0
>
> Reallocated_Sector_Ct did not increase during the last days.
>
>
>> Cound you do gstat with 1-second interval. Some of the 5-second
>> samples show that ada8 is the bottleneck -- it has its request queue
>> full (L(q)=10) when all other drives were done with their jobs. And
>> that's a 5-sec average. Its write service time also seems to be a lot
>> higher than for other drives.
> Attached.  I have replace ada8 by ada9, which is a Western Digital
> Caviar Black.
>
> Now ada0 and ada4 seem to be the bottleneck.
>
> But I don't understand the intervalls without any disk activity.
>
>> Does the drive have its write cache disabled by any chance? That could
>> explain why it takes so much longer to service writes.
> No, camcontrol identify says it's enabled.
>
>> Can you remove ada8 and see if your performance go back to normal?
> The problem still persists.
>
> Thank you for your help!
>
> Nicolas
>
>
> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
Could it be that something else is occupying the pool.
I had to disable a security check from periodic.
daily_status_security_neggrpperm_enable="NO"

After i disabled that check, my pool was performing normal again.
If you do not have many snapshots, it is no problem, but with a lot of 
snashots, this check stalls the pool.

gr
Johan