ZFS "stalls" -- and maybe we should be talking about defaults?

Thu Mar 7 20:45:47 UTC 2013

----- Original Message ----- 
From: "Karl Denninger" <karl at denninger.net>

>>> I will test that but first I have to get the test machine to reliably
>>> stall so I know I'm not chasing my tail.
>>
>> Very sensible.
>>
>> Assuming you can reproduce it, one thing that might be interesting to
>> try is to eliminate all sync IO. I'm not sure if there are options in
>> Postgres to do this via configuration or if it would require editing
>> the code but this could reduce the problem space.
>>
>> If disabling sync IO eliminated the problem it would go a long way
>> to proving it isn't the IO volume or pattern per say but instead
>> related to the sync nature of said IO.
>>
> That can be turned off in the Postgres configuration.  For obvious
> reasons it's a very bad idea but it is able to be disabled without
> actually changing the code itself.
>
> I don't know if it shuts off ALL sync requests, but the documentation
> says it does.
>
> It's interesting that you ran into this with RRD going; the machine in
> question does pull RRD data for Cacti, but it's such a small piece of
> the total load profile that I considered it immaterial.
>
> It might not be.

We never did get to the bottom of it but did come up with a fix.

Instead of using straight RRD interaction we switched all out code to
use rrdcached and put the files on SSD based pool, never had an issue
since.

    Regards
    Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postmaster at multiplay.co.uk.