ZFS "stalls" -- and maybe we should be talking about defaults?

Tue Mar 5 05:40:40 UTC 2013

In article <8C68812328E3483BA9786EF15591124D at multiplay.co.uk>,
killing at multiplay.co.uk writes:

>Now interesting you should say that I've seen a stall recently on ZFS
>only box running on 6 x SSD RAIDZ2.
>
>The stall was caused by fairly large mysql import, with nothing else
>running.
>
>Then it happened I thought the machine had wedged, but minutes (not
>seconds) later, everything sprung into action again.

I have certainly seen what you might describe as "stalls", caused, so
far as I can tell, by kernel memory starvation.  I've seen it take as
much as a half an hour to recover from these (which is too long for my
users).  Right now I have the ARC limited to 64 GB (on a 96 GB file
server) and that has made it more stable, but it's still not behaving
quite as I would like, and I'm looking to put more memory into the
system (to be used for non-ARC functions).  Looking at my munin
graphs, I find that backups in particular put very heavy pressure on,
doubling the UMA allocations over steady-state, and this takes about
four or five hours to climb back down.  See
<http://people.freebsd.org/~wollman/vmstat_z-day.png> for an example.

Some of the stalls are undoubtedly caused by internal fragmentation
rather than actual data in use.  (Solaris used to have this issue, and
some hooks were added to allow some amount of garbage collection with
the cooperation of the filesystem.)

-GAWollman