ZFS + NFS poor performance after restarting from 100 day uptime

Fri Mar 22 20:24:49 UTC 2013

On Fri, Mar 22, 2013 at 1:07 PM, Steven Hartland <killing at multiplay.co.uk>wrote:

>
>  ----- Original Message ----- From: Josh Beard
>>
>>> A snip of gstat:
>>>
>>> dT: 1.002s  w: 1.000s
>>> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
>>>
>> ...
>
>>    4    160    126   1319   31.3     34    100    0.1  100.3| da1
>>>    4    146    110   1289   33.6     36     98    0.1   97.8| da2
>>>    4    142    107   1370   36.1     35    101    0.2  101.9| da3
>>>    4    121     95   1360   35.6     26     19    0.1   95.9| da4
>>>    4    151    117   1409   34.0     34    102    0.1  100.1| da5
>>>    4    141    109   1366   35.9     32    101    0.1   97.9| da6
>>>    4    136    118   1207   24.6     18     13    0.1   87.0| da7
>>>    4    118    102   1278   32.2     16     12    0.1   89.8| da8
>>>    4    138    116   1240   33.4     22     55    0.1  100.0| da9
>>>    4    133    117   1269   27.8     16     13    0.1   86.5| da10
>>>    4    121    102   1302   53.1     19     51    0.1  100.0| da11
>>>    4    120     99   1242   40.7     21     51    0.1   99.7| da12
>>>
>>> Your ops/s are be maxing your disks. You say "only" but the ~190 ops/s
>>> is what HD's will peak at, so whatever our machine is doing is causing
>>> it to max the available IO for your disks.
>>>
>>> If you boot back to your previous kernel does the problem go away?
>>>
>>> If so you could look at the changes between the two kernel revisions
>>> for possible causes and if needed to a binary chop with kernel builds
>>> to narrow down the cause.
>>>
>>
>> Thanks for your response.  I booted with the old kernel (9.1-RC3) and the
>> problem disappeared!  We're getting 3x the performance with the previous
>> kernel than we do with the 9.1-RELEASE-p1 kernel:
>>
>> Output from gstat:
>>
>>     1    362      0      0    0.0    345  20894    9.4   52.9| da1
>>     1    365      0      0    0.0    348  20893    9.4   54.1| da2
>>     1    367      0      0    0.0    350  20920    9.3   52.6| da3
>>     1    362      0      0    0.0    345  21275    9.5   54.1| da4
>>     1    363      0      0    0.0    346  21250    9.6   54.2| da5
>>     1    359      0      0    0.0    342  21352    9.5   53.8| da6
>>     1    347      0      0    0.0    330  20486    9.4   52.3| da7
>>     1    353      0      0    0.0    336  20689    9.6   52.9| da8
>>     1    355      0      0    0.0    338  20669    9.5   53.0| da9
>>     1    357      0      0    0.0    340  20770    9.5   52.5| da10
>>     1    351      0      0    0.0    334  20641    9.4   53.1| da11
>>     1    362      0      0    0.0    345  21155    9.6   54.1| da12
>>
>>
>> The kernels were compiled identically using GENERIC with no modification.
>> I'm no expert, but none of the stuff I've seen looking at svn commits
>> looks like it would have any impact on this.  Any clues?
>>
>
> Your seeing a totally different profile there Josh as in all writes no
> reads where as before you where seeing mainly reads and some writes.
>
> So I would ask if your sure your seeing the same work load, or has
> something external changed too?
>
> Might be worth rebooting back to the new kernel and seeing if your
> still see the issue ;-)
>
>
>    Regards
>    Steve
>
> Regards
> Steve
>
>
Steve,

You're absolutely right.  I didn't catch that, but the total ops/s is
reaching quite a bit higher.  Things are certainly more responsive than
they have been, for what it's worth, so it "feels right."  I'm also not
seeing this thing consistently railed to 100% busy like I was before with
similar testing (that was 50 machines just pushing data with dd).  I won't
be able to get a good comparison until Monday, when our students come back
(this is a file server for a public school district and used for network
homes).

Josh