ZFS "stalls" -- and maybe we should be talking about defaults?

Thu Mar 7 14:32:32 UTC 2013

On 3/7/2013 1:21 AM, Peter Jeremy wrote:
> On 2013-Mar-04 16:48:18 -0600, Karl Denninger <karl at denninger.net> wrote:
>> The subject machine in question has 12GB of RAM and dual Xeon
>> 5500-series processors.  It also has an ARECA 1680ix in it with 2GB of
>> local cache and the BBU for it.  The ZFS spindles are all exported as
>> JBOD drives.  I set up four disks under GPT, have a single freebsd-zfs
>> partition added to them, are labeled and the providers are then
>> geli-encrypted and added to the pool.
> What sort of disks?  SAS or SATA?
SATA.  They're clean; they report no errors, no retries, no corrected
data (ECC) etc.  They also have been running for a couple of years under
UFS+SU without problems.  This isn't new hardware; it's an in-service
system.

>> also known good.  I began to get EXTENDED stalls with zero I/O going on,
>> some lasting for 30 seconds or so.  The system was not frozen but
>> anything that touched I/O would lock until it cleared.  Dedup is off,
>> incidentally.
> When the system has stalled:
> - Do you see very low free memory?
Yes.  Effectively zero.
> - What happens to all the different CPU utilisation figures?  Do they
>   all go to zero?  Do you get high system or interrupt CPU (including
>   going to 1 core's worth)?
No, they start to fall.  This is a bad piece of data to trust though
because I am geli-encrypting the spindles, so falling CPU doesn't mean
the CPU is actually idle (since with no I/O there is nothing going
through geli.)  I'm working on instrumenting things sufficiently to try
to peel that off -- I suspect the kernel is spinning on something, but
the trick is finding out what it is.
> - What happens to interrupt load?  Do you see any disk controller
>   interrupts?
None.
>
> Would you be able to build a kernel with WITNESS (and WITNESS_SKIPSPIN)
> and see if you get any errors when stalls happen.
If I have to.  That's easy to do on the test box -- on the production
one, not so much.
> On 2013-Mar-05 14:09:36 -0800, Jeremy Chadwick <jdc at koitsu.org> wrote:
>> On Tue, Mar 05, 2013 at 01:09:41PM +0200, Andriy Gapon wrote:
>>> Completely unrelated to the main thread:
>>>
>>> on 05/03/2013 07:32 Jeremy Chadwick said the following:
>>>> That said, I still do not recommend ZFS for a root filesystem
>>> Why?
>> Too long a history of problems with it and weird edge cases (keep
>> reading); the last thing an administrator wants to deal with is a system
>> where the root filesystem won't mount/can't be used.  It makes
>> recovery or problem-solving (i.e. the server is not physically accessible
>> given geographic distances) very difficult.
> I've had lots of problems with a gmirrored UFS root as well.  The
> biggest issue is that gmirror has no audit functionality so you
> can't verify that both sides of a mirror really do have the same data.
I have root on a 2-drive RAID mirror (done in the controller) and that
has been fine.  The controller does scrubs on a regular basis
internally.  The problem is that if it gets a clean read that is
different (e.g. no ECC indications, etc) it doesn't know which is the
correct copy.  The good news is that hasn't happened yet :-)

The risk of this happening as my data store continues to expand is one
of the reasons I want to move toward ZFS, but not necessarily for the
boot drives.  For the data store, however....

>> My point/opinion: UFS for a root filesystem is guaranteed to work
>> without any fiddling about and, barring drive failures or controller
>> issues, is (again, my opinion) a lot more risk-free than ZFS-on-root.
> AFAIK, you can't boot from anything other than a single disk (ie no
> graid).
Where I am right now is this:

1. I *CANNOT* reproduce the spins on the test machine with Postgres
stopped in any way.  Even with multiple ZFS send/recv copies going on
and the load average north of 20 (due to all the geli threads), the
system doesn't stall or produce any notable pauses in throughput.  Nor
does the system RAM allocation get driven hard enough to force paging. 

This is with NO tuning hacks in /boot/loader.conf.  I/O performance is
both stable and solid.

2. WITH Postgres running as a connected hot spare (identical to the
production machine), allocating ~1.5G of shared, wired memory,  running
the same synthetic workload in (1) above I am getting SMALL versions of
the misbehavior.  However, while system RAM allocation gets driven
pretty hard and reaches down toward 100MB in some instances it doesn't
get driven hard enough to allocate swap.  The "burstiness" is very
evident in the iostat figures with spates getting into the single digit
MB/sec range from time to time but it's not enough to drive the system
to a full-on stall.

There's pretty-clearly a bad interaction here between Postgres wiring
memory and the ARC, when the latter is left alone and allowed to do what
it wants.   I'm continuing to work on replicating this on the test
machine... just not completely there yet.

-- 
-- Karl Denninger
/The Market Ticker ®/ <http://market-ticker.org>
Cuda Systems LLC
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 552 bytes
Desc: OpenPGP digital signature
URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20130307/4f601863/attachment.sig>