ZFS "stalls" -- and maybe we should be talking about defaults?

Tue Mar 5 09:27:02 UTC 2013

On Tue, Mar 05, 2013 at 09:12:47AM -0000, Steven Hartland wrote:
> 
> ----- Original Message ----- From: "Jeremy Chadwick"
> <jdc at koitsu.org>
> To: "Ben Morrow" <ben at morrow.me.uk>
> Cc: <freebsd-stable at freebsd.org>
> Sent: Tuesday, March 05, 2013 5:32 AM
> Subject: Re: ZFS "stalls" -- and maybe we should be talking about defaults?
> 
> 
> >On Tue, Mar 05, 2013 at 05:05:47AM +0000, Ben Morrow wrote:
> >>Quoth Karl Denninger <karl at denninger.net>:
> >>> > Note that the machine is not booting from ZFS -- it is
> >>booting from and
> >>> has its swap on a UFS 2-drive mirror (handled by the disk adapter; looks
> >>> like a single "da0" drive to the OS) and that drive stalls as well when
> >>> it freezes.  It's definitely a kernel thing when it happens as the OS
> >>> would otherwise not have locked (just I/O to the user partitions) -- but
> >>> it does.
> >>
> >>Is it still the case that mixing UFS and ZFS can cause problems, or were
> >>they all fixed? I remember a while ago (before the arc usage monitoring
> >>code was added) there were a number of reports of serious probles
> >>running an rsync from UFS to ZFS.
> >
> >This problem still exists on stable/9.  The behaviour manifests itself
> >as fairly bad performance (I cannot remember if stalling or if just
> >throughput rates were awful).  I can only speculate as to what the root
> >cause is, but my guess is that it has something to do with the two
> >caching systems (UFS vs. ZFS ARC) fighting over large sums of memory.
> 
> In our case we have no UFS, so this isn't the cause of the stalls.
> Spec here is
> * 64GB RAM
> * LSI 2008
> * 8.3-RELEASE
> * Pure ZFS
> * Trigger MySQL doing a DB import, nothing else running.
> * 4K disk alignment

1. Is compression enabled?  Has it ever been enabled (on any fs) in the
past (barring pool being destroyed + recreated)?

2. Is dedup enabled?  Has it ever been enabled (on any fs) in the past
(barring pool being destroyed + recreated)?

I can speculate day and night about what could cause this kind of issue,
honestly.  The possibilities are quite literally infinite, and all of
them require folks deeply familiar with both FreeBSD's ZFS as well as
very key/major parts of the kernel (ranging from VM to interrupt
handlers to I/O subsystem).  (This next comment isn't for you, Steve,
you already know this :-) )  The way different pieces of the kernel
interact with one another is fairly complex; the kernel is not simple.

Things I think that might prove useful:

* Describing the stall symptoms; what all does it impact?  Can you
  switch VTYs on console when its happening?  Network I/O (e.g. SSH'd
  into the same box and just holding down a letter) showing stalls
  then catching up?  Things of this nature.
* How long the stall is in duration (ex. if there's some way to
  roughly calculate this using "date" in a shell script)
* Contents of /etc/sysctl.conf and /boot/loader.conf (re: "tweaking"
  of the system)
* "sysctl -a | grep zfs" before and after a stall -- do not bother
  with those "ARC summaries" scripts please, at least not for this
* "vmstat -z" before and after a stall
* "vmstat -m" before and after a stall
* "vmstat -s" before and after a stall
* "vmstat -i" before, after, AND during a stall

Basically, every person who experiences this problem needs to treat
every situation uniquely -- no "me too" -- and try to find reliable 100%
test cases for it.  That's the only way bugs of this nature (i.e.
of a complex nature) get fixed.

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |