slowdown of zfs (tx->tx)

Thu Jan 10 01:15:10 UTC 2013

On Wed, Jan 9, 2013 at 8:26 AM, Nicolas Rachinsky
<fbsd-mas-0 at ml.turing-complete.org> wrote:
> * Artem Belevich <art at freebsd.org> [2013-01-08 12:47 -0800]:
>> On Tue, Jan 8, 2013 at 9:42 AM, Nicolas Rachinsky
>> <fbsd-mas-0 at ml.turing-complete.org> wrote:
>> >       NAME                      STATE     READ WRITE CKSUM
>> >         pool1                     DEGRADED     0     0     0
>> >           raidz2-0                DEGRADED     0     0     0
>> >             ada5                  ONLINE       0     0     0
>> >             ada8                  ONLINE       0     0     0
>> >             ada2                  ONLINE       0     0     0
>> >             ada3                  ONLINE       0     0     0
>> >             11846390416703086268  UNAVAIL      0     0     0  was /dev/dsk/ada1
>> >             ada6                  ONLINE       0     0     0
>> >             ada0                  ONLINE       0     0     1
>> >             ada7                  ONLINE       0     0     0
>> >             ada4                  ONLINE       0     0     3
>>
>> You seem to have some checksum errors which does suggest hardware troubles.
>
> I somehow missed these. Is there any way to learn when these checksum
> errors happen?

Not on FreeBSD (yet) as far as I can tell. Not explicitly, anyways.
Check /var/log/messages for any indications of SATA errors. There's a
good chance that there was a timeout at some point.

>> For starters, check smart info for all drives and see if they have any
>> relocated sectors.
>
> There are some disks with relocated sectors, but for both ada0 and
> ada4 Reallocated_Sector_Ct is 0.

Are there any UDMA errors? Those would suggest trouble with cabling.

>> Use gstat during your workload to see if any of the drives takes much
>> longer than others to handle its job.
>
> There is one disk sticking out a bit.

In a raid-z pool number of transactions/second is determined by the
slowest disk. Check ms/w column. Look for numbers substantially higher
than typical seek rate (10..20ms is OK, 100 is not).

>
>> > There is almost no disk activity during this time.
>>
>> What kind of disk activity *is* there?
>
> What would be interesting?

Drives 'sticking out' being busy longer than their peers in the pool.
Excessive ms/r or ms/w in gstat. Unexpected reads or writes.

--Artem