ZFS hangs with 8.2-release

Jeremy Chadwick freebsd at jdc.parodius.com
Thu Dec 15 16:12:09 UTC 2011


On Thu, Dec 15, 2011 at 10:42:20AM -0500, Dan Pritts wrote:
> Hi all, as a followup to my notes from last week.
> 
> Short answer, I have followed most or all of the list's suggestions
> and I still get crashes when scrubbing.  In fact, It is now reliably
> crashing after <10 minutes.
> 
> Does anyone have any other suggestions?  Are the ZFS devs here, and
> would crash dumps be useful?
> 
> 
> Below are my responses to specific things that folks suggested.
> 
> 
> >do a memory test
> my colleague reminded me that we have run a  test in the last month
> or two, since we started troubleshooting this.  24 hours with
> memtest86+ with no errors reported.  FWIW this system was stable
> running solaris for several years.
> >Recommendations to upgrade to 8.2-STABLE and then polite
> >explanations after i did it wrong
> We've upgraded to 8.2-STABLE and applied the 1-line patch suggested
> by Adam McDougall.
> 
> >FreeBSD netflow3.internet2.edu 8.2-STABLE FreeBSD 8.2-STABLE #1:
> >Mon Dec 12 15:45:06 UTC 2011
> >root at netflow3.internet2.edu:/usr/obj/usr/src/sys/GENERIC  amd64
> 
> And many recommendations from Adam McDougall that resulted in the
> following /boot/loader.conf.  I also tried removing all of the zfs
> and vm lines, same problems.
> 
> I think that something in here is causing the lockups - with the
> empty loader.conf it reboots instead of locking.
> >verbose_loading="YES"
> >rootdev="disk16s1a"
> >
> >#I have 16G of Ram
> >
> >vfs.zfs.prefetch_disable=1
> >vfs.zfs.txg.timeout="5"
> >vfs.zfs.arc_min="512M"
> >vfs.zfs.arc_max="4G"
> >vm.kmem_size="32G"

These settings are incorrect by my standards.  You're running
8.2-RELEASE though I would strongly recommend you go with 8.2-STABLE and
stay with that instead.  Regardless of which you run, these are the
settings you should be using in /boot/loader.conf:

vfs.zfs.arc_max="4G"

You could increase this value to 8G if you wanted, or maybe even 12G
(and possibly larger, but I would not recommend above 14G).  There is
"an art" to tuning this variable, as memory fragmentation and other
things I'd rather not get into can cause the ARC size to exceed that
variable sometimes (this is even further addressed in 8.2-STABLE --
consider it another reason to run that instead of -RELEASE).  So, you
need to "give it some headroom".

The extra "art" involved is that you want to make sure you don't give
too much memory to the ARC; e.g. if you have a big/fat mysqld running on
that system, you should probably diminish the ARC size so that you have
a "good balance" between what MySQL can/will use (based on mysql tunings
and some other loader.conf tunings) and what's available for ARC/kernel.

Start small (4GB with a 16GB RAM system is fine), see how things are,
then increase it.  With a 16GB system I would go with 4GB -> 8GB ->
10GB, with about a week in between each test.  DO NOT pick a value like
16GB or 15GB; again, it's better to be safe than sorry, else you'll
experience a kernel panic.  :-)

Further comments:

1. vfs.zfs.txg.timeout defaults to 5 now.
2. There is no point in messing with vfs.zfs.arc_min.  It seems to
be calculated on its own, and reliably so.
3. vm.kmem_size should not be adjusted/touched at all.  Messing about
with this *could* cause a reboot or possibly stability problems (but the
latter would show up differently, not just a reboot).

Next let's talk about vfs.zfs.prefetch_disable="1".

For a very long time (years?), I strongly advocated this setting (which
disables prefetching).  The performance on my home storage system, as
well as our production systems in our co-lo, suffered when prefetching
was enabled; I/O throughput was generally blah.  You can find old posts
from me on the mailing list, and many posts from me on the web talking
about this and advocating its setting.

However, we have removed it entirely and leave prefetching enabled.  We
haven't noticed any particular massive performance loss as such, so it's
very likely something was changed/improved in this regard.  Maybe ZFSv28
is what did it, I really don't know (meaning I am not sure which commit
may have addressed it).

Prefetching being enabled or disabled has absolutely no bearing on
stability, other than your drives may get taxed a tiny bit more (more
data read into the ARC in advance).

However, if you have the time (after you get this lock-up problem
solved), you can play with the setting if you want.  Find what works
best for you with your workload.  Be aware you should change the setting
and then let it sit for about a week or so if possible, to get a full
feel for the difference.

Next let's talk about dedupe as well as compression.

I recommend not enabling either one of these unless you absolutely want
them/need them **and** are willing to suffer from sporadic "stalls" on
the system during ZFS I/O.  The stalling is double-worse if you enable
both.  "Stalls" means while ZFS is writing (dedupe only) or
writing/reading (compression), things like typing via SSH, or on the
console, or doing ANYTHING on the system just "stops" and catches up
when ZFS does its stuff.  This is a known problem and has to do with
lack of "prioritisation queue" code for dedupe/compression on ZFS for
FreeBSD.  Solaris does not have this problem (it was solved by
implementing said prio queue thing).  I can refer you to the exact post
from Bob Friesenhahn on this subject if you wish to read it.  There is
no ETA on getting this fixed in FreeBSD (meaning I have seen no one
discuss fixing it or anything of that sort).

Both of these features will tax your system CPU and memory more than if
you didn't use them.  If you do wish to use compression, I recommend
using the lzjb algorithm instead of the default (gzip) as it diminishes
the stalling by quite a bit -- but it's still easily noticeable.

Finally, let's talk about your system problem:

Can you take ZFS out of the picture on this system?  If so, please do.
That would be a great way to start.  But I will give you my opinion: I
strongly doubt ZFS is responsible for your problem.  ZFS is probably
"tickling" another problem.

I'm inclined to believe your problem is hardware-related, or (and this
continues to get my vote because of the continual non-stop problems I
keep reading with these damn controllers) firmware or driver related
pertaining to your mpt(4) cards.

I will recall what Dan said initially -- you have to read very close to
understand the implications.  Quote:

> internal LSI mpt-driver hardware raid for boot.
> 3x LSI parallel-scsi cards for primary storage.  48 SATA disks
> attached.  Using Infortrend RAIDs as JBODs.

So you effectively have 4 LSI cards in this system.  Would you like me
to spend a few hours digging through mailing lists and PRs listing off
all the problems people continually report with either mpt(4), mps(4) or
mfi(4) on FreeBSD, ESPECIALLY when ZFS is in use?  Heck, there were even
commits done not too long ago to one of the drivers "to help relieve
problems when heavy I/O happens under ZFS".

Then there's the whole debacle with card firmware versions (and you've
FOUR cards!  :-) ).  Some people report problems with some firmware
versions, while others work great.  Then there's the whole
provided-by-FreeBSD vs. provided-by-LSI driver ordeal.  I don't even
want to get into this nonsense -- seriously, it's all on the mailing
lists, and it keeps coming up.  It would take me, as I said, hours to
put it all together and give you *LOTS* of references.

Finally, there is ALWAYS the possibility of bad hardware.  I don't mean
RAM -- I'm talking weird motherboard problems that are exacerbated when
doing lots of PCIe I/O, or drawing too much power -- neither of these
would be stress-tested by memtest86, obviously.  The number of
possibilities is practically endless, I'm sorry to say.  Hardware
troubleshooting 101 says replace things piece-by-piece until you figure
it out.  :-(

Otherwise, I'd consider just running OpenIndiana on this system,
assuming their LSI card driver support is good.

Finally: http://people.internet2.edu/~danno/zfs/ returns HTTP 403
Forbidden, so I have no idea what your photos/screen shots contained, if
anything.  :-(

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |



More information about the freebsd-fs mailing list