slowdown of zfs (tx->tx)

Mon Jan 14 19:13:42 UTC 2013

On Mon, Jan 14, 2013 at 1:40 AM, Nicolas Rachinsky
<fbsd-mas-0 at ml.turing-complete.org> wrote:
>   5 Reallocated_Sector_Ct   0x0033   094   094   010    Pre-fail  Always       -       166
> 195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       1259614646
> 196 Reallocated_Event_Count 0x0032   096   096   000    Old_age   Always       -       166

> Reallocated_Sector_Ct did not increase during the last days.

It does not matter IMHO. That hard drive already got quite a few bad
sectors that ECC could not deal with. There are apparently more
marginally bad sectors, but ECC deals with it for now. Once enough
bits rot, you'll get more bad sectors. I personally would replace the
drive.

>> Cound you do gstat with 1-second interval. Some of the 5-second
>> samples show that ada8 is the bottleneck -- it has its request queue
>> full (L(q)=10) when all other drives were done with their jobs. And
>> that's a 5-sec average. Its write service time also seems to be a lot
>> higher than for other drives.
>
> Attached.  I have replace ada8 by ada9, which is a Western Digital
> Caviar Black.
>
> Now ada0 and ada4 seem to be the bottleneck.
>
> But I don't understand the intervalls without any disk activity.

It is puzzling. Is rsync still sleeping in tx->tx state? Try running
"procstat -kk <rsync-PID>" periodically. It will print in-kernel stack
trace and may help giving a clue where/why rsync is stuck.

--Artem