ZFS "stalls" -- and maybe we should be talking about defaults?

Thu Mar 7 19:30:58 UTC 2013

On 3/7/2013 1:27 PM, Steven Hartland wrote:
>
> ----- Original Message ----- From: "Karl Denninger" <karl at denninger.net>
> To: <freebsd-stable at freebsd.org>
> Sent: Thursday, March 07, 2013 7:07 PM
> Subject: Re: ZFS "stalls" -- and maybe we should be talking about
> defaults?
>
>
>
> On 3/7/2013 12:57 PM, Steven Hartland wrote:
>>
>> ----- Original Message ----- From: "Karl Denninger" <karl at denninger.net>
>>> Where I am right now is this:
>>>
>>> 1. I *CANNOT* reproduce the spins on the test machine with Postgres
>>> stopped in any way.  Even with multiple ZFS send/recv copies going on
>>> and the load average north of 20 (due to all the geli threads), the
>>> system doesn't stall or produce any notable pauses in throughput.  Nor
>>> does the system RAM allocation get driven hard enough to force paging.
>>> This is with NO tuning hacks in /boot/loader.conf.  I/O performance is
>>> both stable and solid.
>>>
>>> 2. WITH Postgres running as a connected hot spare (identical to the
>>> production machine), allocating ~1.5G of shared, wired memory,  running
>>> the same synthetic workload in (1) above I am getting SMALL versions of
>>> the misbehavior.  However, while system RAM allocation gets driven
>>> pretty hard and reaches down toward 100MB in some instances it doesn't
>>> get driven hard enough to allocate swap.  The "burstiness" is very
>>> evident in the iostat figures with spates getting into the single digit
>>> MB/sec range from time to time but it's not enough to drive the system
>>> to a full-on stall.
>>>
>>>> There's pretty-clearly a bad interaction here between Postgres wiring
>>>> memory and the ARC, when the latter is left alone and allowed to do
>>>> what
>>>> it wants.   I'm continuing to work on replicating this on the test
>>>> machine... just not completely there yet.
>>>
>>> Another possibility to consider is how postgres uses the FS. For
>>> example
>>> does is request sync IO in ways not present in the system without it
>>> which is causing the FS and possibly underlying disk system to behave
>>> differently.
>>
>> That's possible but not terribly-likely in this particular instance. 
>> The reason is that I ran into this with the Postgres data store on a UFS
>> volume BEFORE I converted it.  Now it's on the ZFS pool (with
>> recordsize=8k as recommended for that filesystem) but when I first ran
>> into this it was on a separate UFS filesystem (which is where it had
>> resided for 2+ years without incident), so unless the Postgres
>> filesystem use on a UFS volume would give ZFS fits it's unlikely to be
>> involved.
>
> I hate to say it, but that sounds very familiar to something we
> experienced
> with a machine here which was running high numbers of rrd updates. Again
> we had the issue on UFS and saw the same thing when we moved the ZFS.
>
> I'll leave that there as to not derail the investigation with what could
> be totally irrelavent info, but it may prove an interesting data point
> later.
>
> There are obvious common low level points between UFS and ZFS which
> may be the cause. One area which springs to mind is device bio ordering
> and barriers which could well be impacted by sync IO requests independent
> of the FS in use.
>
>>> One other options to test, just to rule it out is what happens if you
>>> use BSD scheduler instead of ULE?
>>
>> I will test that but first I have to get the test machine to reliably
>> stall so I know I'm not chasing my tail.
>
> Very sensible.
>
> Assuming you can reproduce it, one thing that might be interesting to
> try is to eliminate all sync IO. I'm not sure if there are options in
> Postgres to do this via configuration or if it would require editing
> the code but this could reduce the problem space.
>
> If disabling sync IO eliminated the problem it would go a long way
> to proving it isn't the IO volume or pattern per say but instead
> related to the sync nature of said IO.
>
That can be turned off in the Postgres configuration.  For obvious
reasons it's a very bad idea but it is able to be disabled without
actually changing the code itself.

I don't know if it shuts off ALL sync requests, but the documentation
says it does.

It's interesting that you ran into this with RRD going; the machine in
question does pull RRD data for Cacti, but it's such a small piece of
the total load profile that I considered it immaterial.

It might not be.

-- 
-- Karl Denninger
/The Market Ticker ®/ <http://market-ticker.org>
Cuda Systems LLC