ZFS "stalls" -- and maybe we should be talking about defaults?

Tue Mar 5 03:52:51 UTC 2013

On Mon, 2013-03-04 at 20:58 -0600, Karl Denninger wrote:
> Stick this in /boot/loader.conf and see if your lockups goes away:
> 
> vfs.zfs.write_limit_override=1024000000
> 

K.

> I've got a "sentinal" running that watches for zero-bandwidth zpool
> iostat 5s that has been running for close to 12 hours now and with the
> two tunables I changed it doesn't appear to be happening any more.
> 

I've also done this as well as top and systat -vmstat. Disk I/O stops
but the system lives through top, system, and the network. However, if I
try to login the login won't complete.

All of my systems are hardware RAID1 for the OS (LSI and Areca) and
typically a separate disk for swap. All other disks are ZFS.

> This system always has small-ball write I/Os going to it as it's a
> postgresql "hot standby" mirror backing a VERY active system and is
> receiving streaming logdata from the primary at a colocation site, so
> the odds of it ever experiencing an actual zero for I/O (unless there's
> a connectivity problem) is pretty remote.
> 

I am doing multi TB sorts and GB database loads.

> If it turns out that the write_limit_override tunable is the one
> responsible for stopping the hangs I can drop the ARC limit tunable
> although I'm not sure I want to; I don't see much if any performance
> penalty from leaving it where it is and if the larger cache isn't
> helping anything then why use it?  I'm inclined to stick an SSD in the
> cabinet as a cache drive instead of dedicating RAM to this -- even
> though it's not AS fast as RAM it's still MASSIVELY quicker than getting
> data off a rotating plate of rust.
> 

I forgot to mention that on my three 8.3 systems they occasionally
offline a disk (one or two a week, total). I simply online the disk and
after resilver all is well. There are ~40 disks across those three
systems. Of my 9.1 systems three are busy but with smaller number of
disks (about eight across two volumes (RAIDz2 and mirror).

I also have a ZFS-on-Linux (CentOS) system for play (about 12 disks). It
did not exhibit problems when it was in use but it did teach me a lesson
on the evils of dedup. :)

> Am I correct that a ZFS filesystem does NOT use the VM buffer cache at all?
> 

Dunno.

> On 3/4/2013 8:07 PM, Dennis Glatting wrote:
> > I get stalls with 256GB of RAM with arc_max=64G (my limit is usually 25%
> > ) on a 64 core system with 20 new 3TB Seagate disks under LSI2008 chips
> > without much load. Interestingly pbzip2 consistently created a problem
> > on a volume whereas gzip does not.
> >
> > Here, stalls happen across several systems however I have had less
> > problems under 8.3 than 9.1. If I go to hardware RAID5 (LSI2008 -- same
> > chips: IR vs IT) I don't have a problem.
> >
> >
> >
> >
> > On Mon, 2013-03-04 at 16:48 -0600, Karl Denninger wrote:
> >> Well now this is interesting.
> >>
> >> I have converted a significant number of filesystems to ZFS over the
> >> last week or so and have noted a few things.  A couple of them aren't so
> >> good.
> >>
> >> The subject machine in question has 12GB of RAM and dual Xeon
> >> 5500-series processors.  It also has an ARECA 1680ix in it with 2GB of
> >> local cache and the BBU for it.  The ZFS spindles are all exported as
> >> JBOD drives.  I set up four disks under GPT, have a single freebsd-zfs
> >> partition added to them, are labeled and the providers are then
> >> geli-encrypted and added to the pool.  When the same disks were running
> >> on UFS filesystems they were set up as a 0+1 RAID array under the ARECA
> >> adapter, exported as a single unit, GPT labeled as a single pack and
> >> then gpart-sliced and newfs'd under UFS+SU.
> >>
> >> Since I previously ran UFS filesystems on this config I know what the
> >> performance level I achieved with that, and the entire system had been
> >> running flawlessly set up that way for the last couple of years. 
> >> Presently the machine is running 9.1-Stable, r244942M
> >>
> >> Immediately after the conversion I set up a second pool to play with
> >> backup strategies to a single drive and ran into a problem.  The disk I
> >> used for that testing is one that previously was in the rotation and is
> >> also known good.  I began to get EXTENDED stalls with zero I/O going on,
> >> some lasting for 30 seconds or so.  The system was not frozen but
> >> anything that touched I/O would lock until it cleared.  Dedup is off,
> >> incidentally.
> >>
> >> My first thought was that I had a bad drive, cable or other physical
> >> problem.  However, searching for that proved fruitless -- there was
> >> nothing being logged anywhere -- not in the SMART data, not by the
> >> adapter, not by the OS.  Nothing.  Sticking a digital storage scope on
> >> the +5V and +12V rails didn't disclose anything interesting with the
> >> power in the chassis; it's stable.  Further, swapping the only disk that
> >> had changed (the new backup volume) with a different one didn't change
> >> behavior either.
> >>
> >> The last straw was when I was able to reproduce the stalls WITHIN the
> >> original pool against the same four disks that had been running
> >> flawlessly for two years under UFS, and still couldn't find any evidence
> >> of a hardware problem (not even ECC-corrected data returns.)  All the
> >> disks involved are completely clean -- zero sector reassignments, the
> >> drive-specific log is clean, etc.
> >>
> >> Attempting to cut back the ARECA adapter's aggressiveness (buffering,
> >> etc) on the theory that I was tickling something in its cache management
> >> algorithm that was pissing it off proved fruitless as well, even when I
> >> shut off ALL caching and NCQ options.  I also set
> >> vfs.zfs.prefetch_disable=1 to no effect.  Hmmmm...
> >>
> >> Last night after reading the ZFS Tuning wiki for FreeBSD I went on a
> >> lark and limited the ARC cache to 2GB (vfs.zfs.arc_max=2000000000), set
> >> vfs.zfs.write_limit_override to 1024000000 (1GB) and rebooted.  /*
> >>
> >> The problem instantly disappeared and I cannot provoke its return even
> >> with multiple full-bore snapshot and rsync filesystem copies running
> >> while a scrub is being done.*/
> >> /**/
> >> I'm pinging between being I/O and processor (geli) limited now in normal
> >> operation and slamming the I/O channel during a scrub.  It appears that
> >> performance is roughly equivalent, maybe a bit less, than it was with
> >> UFS+SU -- but it's fairly close.
> >>
> >> The operating theory I have at the moment is that the ARC cache was in
> >> some way getting into a near-deadlock situation with other memory
> >> demands on the system (there IS a Postgres server running on this
> >> hardware although it's a replication server and not taking queries --
> >> nonetheless it does grab a chunk of RAM) leading to the stalls. 
> >> Limiting its grab of RAM appears to have to resolved the contention
> >> issue.  I was unable to catch it actually running out of free memory
> >> although it was consistently into the low five-digit free page count and
> >> the kernel never garfed on the console about resource exhaustion --
> >> other than a bitch about swap stalling (the infamous "more than 20
> >> seconds" message.)  Page space in use near the time in question (I could
> >> not get a display while locked as it went to I/O and froze) was not
> >> zero, but pretty close to it (a few thousand blocks.)  That the system
> >> was driven into light paging does appear to be significant and
> >> indicative of some sort of memory contention issue as under operation
> >> with UFS filesystems this machine has never been observed to allocate
> >> page space.
> >>
> >> Anyone seen anything like this before and if so.... is this a case of
> >> bad defaults or some bad behavior between various kernel memory
> >> allocation contention sources?
> >>
> >> This isn't exactly a resource-constrained machine running x64 code with
> >> 12GB of RAM and two quad-core processors in it!
> >>
> >
> > _______________________________________________
> > freebsd-stable at freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> > To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"
> >
> >
> > %SPAMBLOCK-SYS: Matched [@freebsd.org+], message ok
> 

-- 
Dennis Glatting <dg at pki2.com>