ZFS "stalls" -- and maybe we should be talking about defaults?

Tue Mar 5 12:56:09 UTC 2013

On 3/5/2013 3:27 AM, Jeremy Chadwick wrote:
> On Tue, Mar 05, 2013 at 09:12:47AM -0000, Steven Hartland wrote:
>> ----- Original Message ----- From: "Jeremy Chadwick"
>> <jdc at koitsu.org>
>> To: "Ben Morrow" <ben at morrow.me.uk>
>> Cc: <freebsd-stable at freebsd.org>
>> Sent: Tuesday, March 05, 2013 5:32 AM
>> Subject: Re: ZFS "stalls" -- and maybe we should be talking about defaults?
>>
>>
>>> On Tue, Mar 05, 2013 at 05:05:47AM +0000, Ben Morrow wrote:
>>>> Quoth Karl Denninger <karl at denninger.net>:
>>>>>> Note that the machine is not booting from ZFS -- it is
>>>> booting from and
>>>>> has its swap on a UFS 2-drive mirror (handled by the disk adapter; looks
>>>>> like a single "da0" drive to the OS) and that drive stalls as well when
>>>>> it freezes.  It's definitely a kernel thing when it happens as the OS
>>>>> would otherwise not have locked (just I/O to the user partitions) -- but
>>>>> it does.
>>>> Is it still the case that mixing UFS and ZFS can cause problems, or were
>>>> they all fixed? I remember a while ago (before the arc usage monitoring
>>>> code was added) there were a number of reports of serious probles
>>>> running an rsync from UFS to ZFS.
>>> This problem still exists on stable/9.  The behaviour manifests itself
>>> as fairly bad performance (I cannot remember if stalling or if just
>>> throughput rates were awful).  I can only speculate as to what the root
>>> cause is, but my guess is that it has something to do with the two
>>> caching systems (UFS vs. ZFS ARC) fighting over large sums of memory.
>> In our case we have no UFS, so this isn't the cause of the stalls.
>> Spec here is
>> * 64GB RAM
>> * LSI 2008
>> * 8.3-RELEASE
>> * Pure ZFS
>> * Trigger MySQL doing a DB import, nothing else running.
>> * 4K disk alignment
> 1. Is compression enabled?  Has it ever been enabled (on any fs) in the
> past (barring pool being destroyed + recreated)?
>
> 2. Is dedup enabled?  Has it ever been enabled (on any fs) in the past
> (barring pool being destroyed + recreated)?
>
> I can speculate day and night about what could cause this kind of issue,
> honestly.  The possibilities are quite literally infinite, and all of
> them require folks deeply familiar with both FreeBSD's ZFS as well as
> very key/major parts of the kernel (ranging from VM to interrupt
> handlers to I/O subsystem).  (This next comment isn't for you, Steve,
> you already know this :-) )  The way different pieces of the kernel
> interact with one another is fairly complex; the kernel is not simple.
>
> Things I think that might prove useful:
>
> * Describing the stall symptoms; what all does it impact?  Can you
>   switch VTYs on console when its happening?  Network I/O (e.g. SSH'd
>   into the same box and just holding down a letter) showing stalls
>   then catching up?  Things of this nature.
When it happens on my system anything that is CPU-bound continues to
execute.  I can switch consoles and network I/O also works.  If I have
an iostat running at the time all I/O counters go to and remain at zero
while the stall is occurring, but the process that is producing the
iostat continues to run and emit characters whether it is a ssh session
or on the physical console.  

The CPUs are running and processing, but all threads block if they
attempt access to the disk I/O subsystem, irrespective of the portion of
the disk I/O subsystem they attempt to access (e.g. UFS, swap or ZFS)  I
therefore cannot start any new process that requires image activation.

> * How long the stall is in duration (ex. if there's some way to
>   roughly calculate this using "date" in a shell script)
They're variable.  Some last fractions of a second and are not really
all that noticeable unless you happen to be paying CLOSE attention. 
Some last a few (5 or so) seconds.  The really bad ones last long enough
that the kernel throws the message "swap_pager: indefinite wait buffer".

The machine in the general sense never pages.  It contains 12GB of RAM
but historically (prior to ZFS being put into service) always showed "0"
for a "pstat -s", although it does have a 20g raw swap partition (to
/dev/da0s1b, not to a zpool) allocated.

During the stalls I cannot run a pstat (I tried; it stalls) but when it
unlocks I find that there is swap allocated, albeit not a ridiculous
amount.  ~20,000 pages or so have made it to the swap partition. This is
not behavior that I had seen before on this machine prior to the stall
problem, and with the two tuning tweaks discussed here I'm now up to 48
hours without any allocation to swap (or any stalls.)

> * Contents of /etc/sysctl.conf and /boot/loader.conf (re: "tweaking"
>   of the system)
/boot/loader.conf:

kern.ipc.semmni=256
kern.ipc.semmns=512
kern.ipc.semmnu=256
geom_eli_load="YES"
sound_load="YES"
#
# Limit to physical CPU count for threads
#
kern.geom.eli.threads=8
#
# ZFS Prefetch does help, although you'd think it would not due to the
adapter
# doing it already.  Wrong guess; it's good for 2x the performance.
# We limit the ARC to 2GB of RAM and the TXG write limit to 1GB.
#
#vfs.zfs.prefetch_disable="1"
vfs.zfs.arc_max=2000000000
vfs.zfs.write_limit_override=1024000000
--------------------------------

The first three are required for Postgres.  The geli thread limit has
been found to provide better performance under heavy load, as the system
will otherwise start 16 threads per geli-attached provider since the
CPUs support hyperthreading. 

The two ZFS-related entries at the end, if present, stop the stalls.

Geli is not used on the boot pack; da0 is an old-style MBR disk that is
physically comprised of two 300MB drives in a mirror managed by the
adapter.   Swap resides on the traditional "b" slice of that pack; it is
a reasonably-standard "old-style" setup in that regard with separate
root, /home, /var and /usr slices.

sysctl.conf contains:

# $FreeBSD: src/etc/sysctl.conf,v 1.8 2003/03/13 18:43:50 mux Exp $
#
#  This file is read when going to multi-user and its contents piped thru
#  ``sysctl'' to adjust kernel values.  ``man 5 sysctl.conf'' for details.
#

# Uncomment this to prevent users from seeing information about
processes that
# are being run under another UID.
#security.bsd.see_other_uids=0
#
# tuning for PostgreSQL
#
kern.ipc.shm_use_phys=1
kern.ipc.shmmax=4096000000
kern.ipc.shmall=1000000
kern.ipc.semmsl=512
kern.ipc.semmap=256

#
# IP Performance
#
kern.ipc.somaxconn=4096
kern.ipc.nmbclusters=32768
net.inet.tcp.sendspace=131072
net.inet.tcp.recvspace=131072
net.inet.tcp.inflight.enable=1
#
# Tune for asshole (DDOS) resistance
#
net.inet.tcp.blackhole=2
net.inet.udp.blackhole=1
net.inet.icmp.icmplim=10
net.inet.tcp.imcp_may_rst=0
net.inet.tcp.drop_synfin=1
net.inet.tcp.msl=7500
#
# Maxfiles
#
kern.maxfiles=65535

I suspect (but can't yet prove) that wiring shared memory is likely
involved in this.  That makes a BIG difference in Postgres performance,
but I can certainly see where a misbehaving ARC cache could "think" that
the (rather large) shared segment that Postgres has (it currently
allocates 1.5G of shared memory and wires it) can or might "get out of
the way." 

But it most-certainly won't with kern.ipc.shm_use_phys set.

In normal operation that Postgres server is a hot-spare replication
machine that connects to Asheville; in the event of a catastrophic
failure there it would be promoted and the load would shift here.

> * "sysctl -a | grep zfs" before and after a stall -- do not bother
>   with those "ARC summaries" scripts please, at least not for this
> * "vmstat -z" before and after a stall
> * "vmstat -m" before and after a stall
> * "vmstat -s" before and after a stall
> * "vmstat -i" before, after, AND during a stall
>
> Basically, every person who experiences this problem needs to treat
> every situation uniquely -- no "me too" -- and try to find reliable 100%
> test cases for it.  That's the only way bugs of this nature (i.e.
> of a complex nature) get fixed.

I am fortunate enough to have an identical machine that's "cold" in the
rack and will effort spinning that up today; I'm going to attach another
pack to the backup and allow it to resilver, then use that "in anger" to
restore the spare box.

I'm quite sure I can reproduce the workload that causes the stalls;
populating the backup pack as a separate zfs pool (with zfs send | zfs
recv) was what led to it happening here originally.

With that said I've got more than 24 hours on the box that exhibited the
problem with the two tunables in /boot/loader.conf and a sentinal
process that is doing a zpool iostat 5 looking for more than one "all
zeros" I/O line sequentially. 

It hasn't happened since I stuck those two lines in there and at this
point two nightly backup runs have gone to completion along with some
fairly heavy user I/O last evening which was plenty of load to provoke
the misbehavior previously.

-- 
-- Karl Denninger
/The Market Ticker ®/ <http://market-ticker.org>
Cuda Systems LLC