ZFS "stalls" -- and maybe we should be talking about defaults?

Wed Mar 6 05:08:11 UTC 2013

On Tue, Mar 05, 2013 at 06:56:02AM -0600, Karl Denninger wrote:
> { I've snipped lots of text.  For those who are reading this follow-up }
> { and wish to read the snipped portions, please see this URL: }
> { http://lists.freebsd.org/pipermail/freebsd-stable/2013-March/072696.html }

> > 1. Is compression enabled?  Has it ever been enabled (on any fs) in the
> > past (barring pool being destroyed + recreated)?
> >
> > 2. Is dedup enabled?  Has it ever been enabled (on any fs) in the past
> > (barring pool being destroyed + recreated)?

No answers to questions #1 and #2?  (Edit: see below, I believe it's
implied neither are used)

> > * Describing the stall symptoms; what all does it impact?  Can you
> >   switch VTYs on console when its happening?  Network I/O (e.g. SSH'd
> >   into the same box and just holding down a letter) showing stalls
> >   then catching up?  Things of this nature.
> When it happens on my system anything that is CPU-bound continues to
> execute.  I can switch consoles and network I/O also works.

Okay, it sounds like compression and dedup aren't in use/have never been
used.  The stalling problem with compression and dedup (e.g. if you use
either of these features, and it worsens if you use both) results in a
full/hard system stall where *everything* is impacted, and has been
explained in the past (2nd URL has the explanation):

http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012718.html 
http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012726.html 
http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012752.html 

> If I have an iostat running at the time all I/O counters go to and
> remain at zero while the stall is occurring, but the process that is
> producing the iostat continues to run and emit characters whether it
> is a ssh session or on the physical console.  

What kind of an iostat?  iostat(8) or zpool iostat?

(Edit: last paragraph of this response says "zpool iostat", which is not
the same thing as iostat)

Why not gstat(8), e.g. gstat -I500ms, as well?  This provides the I/O
statistics at a deeper layer, not the ZFS layer.

Do the numbers actually change **while the system is stalling**?

The answer matters greatly, because it would help indicate if some
kernel API requests for I/O statistics are also blocking, or if only
*actual I/O (e.g. read() and write() requests)* are blocking.

> The CPUs are running and processing, but all threads block if they
> attempt access to the disk I/O subsystem, irrespective of the portion
> of the disk I/O subsystem they attempt to access (e.g. UFS, swap or
> ZFS)  I therefore cannot start any new process that requires image
> activation.

And now you'll need to provide a full diagram of your disk and
controller device tree, along with all partitions, slices, and
filesystem types.  It's best to draw this in ASCII in a tree-like
diagram.  It will take you 15-20 minutes to do.

What's even more concerning:

This thread is about ZFS, yet you're saying applications block when they
attempt to do I/O to a filesystem ***other than ZFS***.  There must be
some kind of commonality here, i.e. a single controller is driving both
the ZFS and UFS disks, or something along those lines.  If there isn't,
then there is something within the kernel I/O subsystem that is doing
this.  Like I said: very deep, very knowledgeable kernel folks are the
only ones who can fix this.

> > * How long the stall is in duration (ex. if there's some way to
> >   roughly calculate this using "date" in a shell script)
> They're variable.  Some last fractions of a second and are not really
> all that noticeable unless you happen to be paying CLOSE attention. 
> Some last a few (5 or so) seconds.  The really bad ones last long enough
> that the kernel throws the message "swap_pager: indefinite wait buffer".

The message "swap_pager: indefinite wait buffer" indicates that some
part of the VM is trying to offload pages of memory to swap via standard
I/O write requests, and those writes have not come back within kern.hz*20
seconds.  That's a very, very long time.

> The machine in the general sense never pages.  It contains 12GB of RAM
> but historically (prior to ZFS being put into service) always showed "0"
> for a "pstat -s", although it does have a 20g raw swap partition (to
> /dev/da0s1b, not to a zpool) allocated.

The swap_pager message implies otherwise.  It may be that the programs
you're using poll at intervals of, say, 1 second, and swap-out + swap-in
occurs very quickly so you never see it.  (Edit: next quoted paragraph
shows that there ARE pages of memory hitting swap, so "never pages" is
false).

I do not know the VM subsystem well enough to know what the criteria are
for offloading pages of memory to swap -- but it's obviously happening.
It may be due to memory pressure, or it may be due to "pages which have
not been touched in a long while" -- again, I do not know.  This is
where "vmstat -s" would be useful.  Possibly Alan Cox knows.

> During the stalls I cannot run a pstat (I tried; it stalls) but when it
> unlocks I find that there is swap allocated, albeit not a ridiculous
> amount.  ~20,000 pages or so have made it to the swap partition. This is
> not behavior that I had seen before on this machine prior to the stall
> problem, and with the two tuning tweaks discussed here I'm now up to 48
> hours without any allocation to swap (or any stalls.)

This would fall under the same category as your above statement, re:
that any kind of I/O blocks until "something" gets released.  The whole
thing smells of some kind of global mutex or semaphore, which then makes
me think of Giant, except that's mostly gone.

> > * Contents of /etc/sysctl.conf and /boot/loader.conf (re: "tweaking"
> >   of the system)
> /boot/loader.conf:
>
> {snip}
> 
> vfs.zfs.arc_max=2000000000
> vfs.zfs.write_limit_override=1024000000
>
> {snip}
> 
> The two ZFS-related entries at the end, if present, stop the stalls.

I'd like to know which of the two "stops the stalls".

The former limits ARC size (at least on FreeBSD it does; when I used
Solaris last, the same tunable on Solaris was a "recommendation" than a
hard limit), while the latter limits overall "write bandwidth" (for lack
of better term).  If the former is what addresses the issue, then memory
fragmentation or some ARC-related bug is the cause (again I'm
speculating).  Again: only low-level kernel folks are going to be able
to work this one out, with your help.

I am at a loss for this problem.  To me, in your case, it sounds like
you have a multitude of ZFS and UFS disks on the same controller, and it
may be that the **controller** is "wedging" on all these I/O requests.
I don't use arcmsr(4), but I don't know how to prove if it's arcmsr(4)
doing this.

Part of me wonders if folks experiencing this are hitting some kind of
memory bus limit or something along those lines, and since ZFS tends to
shove everything into the ARC then periodically (vfs.zfs.txg.timeout)
flush gigantic amounts to disk, I wonder if there's some contention
between different drivers/pieces (arcmsr vs. zfs vs. VM vs. ???) causing
the issue.

Irrelevant comment: you should use human-readable values for those
tunables, for legibility; (make sure you use quotes, and do so
consistently throughout loader.conf), ex.:

vfs.zfs.arc_max="2G"
vfs.zfs.write_limit_override="1G"

> {snip}
>
> sysctl.conf contains:
>
> {snip}
> 
> net.inet.tcp.imcp_may_rst=0

Irrelevant comment: typo in the MIB name here; surprised you haven't
seen messages about this on your system consoles ("unknown oid").

> I suspect (but can't yet prove) that wiring shared memory is likely
> involved in this.  That makes a BIG difference in Postgres performance,
> but I can certainly see where a misbehaving ARC cache could "think" that
> the (rather large) shared segment that Postgres has (it currently
> allocates 1.5G of shared memory and wires it) can or might "get out of
> the way." 

Remove pgsql from the picture and see if you can reproduce the problem.
Like I said: a dedicated test box would do you well.  :-)

FreeBSD's classic shm_xxx(3) stuff has always been painful, in my
experience.  I had the wonderful pleasure of dealing with it when it
came to PHP/PECL's APC, and found that the mmap(2) mechanism works
significantly better (and I don't have to futz with stupid sysctls).
But this comment does not solve or do you any good.

> {snip}
>
> I'm quite sure I can reproduce the workload that causes the stalls;
> populating the backup pack as a separate zfs pool (with zfs send | zfs
> recv) was what led to it happening here originally.
> 
> With that said I've got more than 24 hours on the box that exhibited the
> problem with the two tunables in /boot/loader.conf and a sentinal
> process that is doing a zpool iostat 5 looking for more than one "all
> zeros" I/O line sequentially. 
> 
> It hasn't happened since I stuck those two lines in there and at this
> point two nightly backup runs have gone to completion along with some
> fairly heavy user I/O last evening which was plenty of load to provoke
> the misbehavior previously.

I wish you had just added one of those lines instead of both.  Even with
just those 2 lines, the possibilities of cause are still extremely many.
My entire gut feeling at this point is that there's some kind of
controller (as in firmware or driver-level) nonsense going on.

You're going to need that test box up and reproducing the problem, and
then (I hate to tell you this) you're probably going to have to hire
someone from the FreeBSD Project -- as in pay them hourly -- to figure
this out.

Otherwise, I found this post of Freddie's to be interesting:

http://lists.freebsd.org/pipermail/freebsd-stable/2013-March/072702.html

This is all I can say with regards to this thread at this point.  I have
absolutely nothing else of worth to add.  Anything else I'd say would
just be negative/condescending (upon ZFS) and would do no one any good.

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |