8.1R possible zfs snapshot livelock?

Tue May 17 12:23:26 UTC 2011

On Tue, May 17, 2011 at 02:55:54PM +0300, Andriy Gapon wrote:
> on 17/05/2011 14:29 Jeremy Chadwick said the following:
> > On Tue, May 17, 2011 at 01:48:04PM +0300, Andriy Gapon wrote:
> >> on 17/05/2011 10:30 Jeremy Chadwick said the following:
> >>> On Tue, May 17, 2011 at 02:43:44AM -0400, Charles Sprickman wrote:
> >>>> Does this sound familiar to anyone running ZFS with snapshots?
> >>>
> >>> Yes, and is exactly why I don't use them.  :-)
> >>
> >> You put a smiley, but is this an attempt at FUD?
> > 
> > I wish it were.  
> 
> The reason I asked is that I could have easily answered "No, that's why I use them
> all the time".  And I am sure many people would join me on this.
> So the way you originally described the issue was sufficiently non-specific and
> strong.

You're absolutely right -- and to me, your answer/experience holds much
more weight than my own.  But if you and I were presenting advocacy of
ZFS snapshots to a person who had experienced problems with it, their
reluctance to believe would be understandable, no?  They'd want some
form of reassurance that the problem they experience was known or had
been fixed in some way.

I guess what I'm saying is that yes my wording was strong -- it was an
opinion based on past experience.  Fact: I don't have any present-day
evidence to validate my opinion, since the ZFS code has changed greatly
between then and now.  But also fact: I did experience something very
similar to what Charles did.

Sympathy is sometimes all we admins/users have in situations like this.
:-)  But I do understand your point.

> > I experienced similar behaviour to Charles during the
> > early 8.x days (possibly 8.1-RELEASE, I forget; I may be thinking of
> > 8.0?) where ZFS snapshots would occasionally result in the kernel
> > deadlocking on ZFS-bound I/O.  The kernel was alive/responsive to some
> > degree but ZFS I/O would just indefinitely stall at that point,
> > requiring a full system reset.  No disk or controller problems (same
> > hardware I'm using today actually!).
> > 
> > I believe there were commits and improvements for snapshotting committed
> > between 8.1-RELEASE and 8.2-RELEASE, but I haven't bothered to test
> > them.  The experience left a very bad taste in my mouth and as such I
> > have avoided ZFS snapshots since.
> > 
> > I'd be willing to try them again assuming someone can at least confirm
> > that there were commits done to address snapshot concerns during the
> > past year or so.  But...
> > 
> > There are still some outstanding incidents that directly pertain to ZFS
> > snapshots, or are "related" to ZFS snapshots (meaning things like
> > send/recv which are commonly used alongside snapshots), which I remember
> > reading about but really saw no answer to:
> > 
> > * ZFS send | ssh zfs recv results in ZFS subsystem hanging; 8.1-RELEASE;
> >   February 2011:
> >   http://lists.freebsd.org/pipermail/freebsd-fs/2011-February/010602.html
> > 
> > * Kernel panic during heavy disk I/O while "zfs recv" being used
> >   simultaneously; CURRENT (so ZFS v28?); April 2011:
> >   http://lists.freebsd.org/pipermail/freebsd-fs/2011-April/011155.html
> > 
> > * ZFS snapshots taking an extremely long time to be deleted; RELENG_8_1;
> >   February 2011:
> >   http://lists.freebsd.org/pipermail/freebsd-fs/2011-February/010797.html
> > 
> > * "zfs destroy -r" not working on filesystem-level snapshots but works
> >   on pool-level snapshots; RELENG_8 with ZFS v28 patch (and is specific
> >   to ZFS v28 given the info); May 2011:
> >   http://lists.freebsd.org/pipermail/freebsd-fs/2011-May/011412.html
> > 
> > Sorry to just rattle off a bunch of URLs and issues at once; it's not my
> > intention to slander work on ZFS or anything even remotely like that.
> > 
> > I'm just wondering given the number of problem reports that seem to come
> > in about snapshot or snapshot-related ZFS stuff, where we stand on
> > these?  This is mainly for Charles' benefit and not so much mine (our
> > rsnapshot/rsync-based backups work great for us at this time, sans the
> > stomping of atime).
> > 
> 
> Problem reports are always over-represented on the mailing lists.
> People rarely write that e.g. ZFS snapshot has flawlessly worked for them for the
> millionth time again today.  I am not aware of any known-but-not-fixed issues in
> this area.  Each problem report should be properly investigated individually.

Both absolutely correct and understood.

It just really sucks to be one of the people who experiences problems.
When you have a system that you've taken a lot of time to get up and
working, it runs reliably for weeks/months, then suddenly something like
the above happens, you have to start weighing the pros and cons to
alternatives (using something other than snapshot capability, changing
filesystems, etc.).

It would help if folks had some guidelines for what information would be
helpful for kernel developers in the case of a ZFS deadlock of this
nature.  I would say the majority of the admin/user community (and this
includes me!), once at a "db>" prompt, have no clue how to proceed.

So for Charles' situation, the next time it happens what would be useful
for him to provide?  The best I could come up with was to induce doadump
then reboot to get the system up/working again, and then use kgdb
after-the-fact.

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.               PGP 4BD6C0CB |