EBS snapshot backups from a FreeBSD zfs file system: zpool freeze?

Thu Jul 4 10:32:45 UTC 2013

On Thu, Jul 04, 2013 at 09:47:26AM +0100, Steven Hartland wrote:
> 
> ----- Original Message ----- From: "Jeremy Chadwick"
> <jdc at koitsu.org>
> To: "Steven Hartland" <killing at multiplay.co.uk>
> Cc: "Berend de Boer" <berend at pobox.com>; "freebsd-fs" <freebsd-fs at freebsd.org>
> Sent: Thursday, July 04, 2013 9:22 AM
> Subject: Re: EBS snapshot backups from a FreeBSD zfs file system: zpool freeze?
> 
> 
> >On Thu, Jul 04, 2013 at 09:06:57AM +0100, Steven Hartland wrote:
> >>----- Original Message ----- From: "Berend de Boer"
> >><berend at pobox.com>
> >>Jeremy>   Also, because nobody seems to warn others of this: if
> >>Jeremy> you go the ZFS route on FreeBSD, please do not use
> >>Jeremy> features like dedup or compression.
> >>
> >>While dedup is memory and sometimes cpu hungry, so HW spec
> >>should be considered before using it, compression is not so
> >>and I've not seen any valid reason not to use it should it
> >>fit your uses.
> >>
> >>We actually use compression extensivily here and we've
> >>had nothing but positive results from it so sounds like
> >>FUD to me.
> >
> >The problem with the lack of separate and prioritised write threads for
> >dedup and compression, thus causing interactivity stalls, is not FUD,
> >it's fact.  I explained this in the part of my reply to Berend which you
> >omitted, which included the proof and acknowledgement from folks who
> >are in-the-know (Bob Friesenhahn).  :/  Nobody has told me "yeah that
> >got fixed", so there is no reason for me to believe anything has
> >changed.
> 
> Do you have an links to the discuss on this Jeremy as I'd be intereted
> to read up on the this when I have some spare time?

Warning up front: sorry for the long mail (I did try to keep it terse)
but most of it is demonstrating the problem.

Useful FreeBSD links, specifically the conversations I've had over the
years about this problem, at least the most useful ones.  The first one
is probably the most relevant, since it's a statement from Bob himself
explaining it:

http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012726.html
http://lists.freebsd.org/pipermail/freebsd-fs/2011-October/012752.html
http://lists.freebsd.org/pipermail/freebsd-stable/2013-February/072171.html
http://lists.freebsd.org/pipermail/freebsd-stable/2013-February/072178.html

To be clear (note the date and version): as of September 2011 I was able
to reproduce the problem on stable/8.

While you were writing your mail, I was off actually trying to find out
technical details (specifically the source code changes in OpenSolaris
or later) which fixed it / what Bob alluded to.  I really had to jab at
search engines to find anything useful, and wasn't getting anywhere
until I found this:

http://comments.gmane.org/gmane.os.solaris.opensolaris.zfs/28192

This mentioned the OpenSolaris bug number 6586537.  I then poked about
svnweb and found that this fix was imported into FreeBSD with the "ZFS
version 15" import.  Commit log entry:

6586537	async zio taskqs can block out userland commands (142901-09)

Relevant revisions, dates, and branches for this:

r209962: Jul 2010: head:     http://svnweb.freebsd.org/base?view=revision&revision=209962
r212668: Sep 2010: stable/8: http://svnweb.freebsd.org/base?view=revision&revision=212668

And that head became stable/9 as of September 2011, I believe.

So my testing as of September 2011 would have included the fix for
6586537.  This makes me wonder if 6586537 is truly the issue I've been
describing or not.

It's easy enough to test for on stable/9 today (zfs create, zfs set
compression=on, do the dd and in another window do stuff and see what
happens, then later zfs destroy).  So let's see if it's still there
almost 2 years later...

Yup, still there, but it seems improved in some way, possibly due to a
combination of things.  This box is actually a C2Q (more powerful than
the one in Sep 2011) too, and is actively doing nothing.  Relevant bits:

# zpool list
NAME      SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
backups  1.81T   463G  1.36T    24%  1.00x  ONLINE  -
data     2.72T   694G  2.04T    24%  1.00x  ONLINE  -
# zdb -C | grep ashift
            ashift: 12
            ashift: 12
# zfs create -o compression=lzjb -o mountpoint=/mnt backups/comptest
# zfs get all backups/comptest | grep compression
backups/comptest  compression           lzjb                   local

The "backups" pool is a single disk (WD20EFRX) running at SATA300 with
NCQ, backed by an Intel ICH9 in AHCI mode.  The disk is a 4K sector
drive where the gnop trick was used (proof above).  I could have used
the "data" pool (raidz1 driven by 3 disks (WD10EFRX) + gnop), but it
wouldn't matter -- the problem is consistent no matter what the pool.

I can't demonstrate the problem using "while : ; do date ; sleep 1 ;
done" because sleep 1 isn't granular enough (yes I'm aware FreeBSD
sleep(1) supports more granularity) and because date/strftime doesn't
show microseconds.  So off into perl + Time::HiRes we go...

window1# date +%s ; dd if=/dev/zero of=/mnt/bigfile bs=64k
1372932367
^C123977+0 records in
123977+0 records out
8124956672 bytes transferred in 16.437748 secs (494286486 bytes/sec)
window2# perl -e 'use Time::HiRes qw(time sleep); $|=1; while(1) { print time, "\n"; sleep(0.2); }'

Now because even 0.2 seconds probably isn't granular enough, I ended up
pressing Enter in the middle of the running perl output every time I'd
notice that lines weren't coming across at consistent 0.2 second
intervals (I guess I have a good eye for this sort of thing).  So blank
lines are me noticing the pauses/delays I've been talking about:

1372932411.90407
1372932412.10415
1372932412.30513
1372932412.50614
1372932412.70713
1372932412.90813
1372932413.10913
1372932413.31013
1372932413.51112

1372932413.71213
1372932413.91315
1372932414.11413
1372932414.31513
1372932414.51615
1372932414.71714
1372932415.00015

1372932415.27278
1372932415.47316
1372932415.67416
1372932415.87514
1372932416.07615
1372932416.27715
1372932416.48115
1372932416.78215

1372932416.98614
1372932417.18717
1372932417.38814
1372932417.58912
1372932417.79016
1372932417.99115

1372932418.40577
1372932418.60617
1372932418.80715
1372932419.00813
1372932419.20913
1372932419.41013
1372932419.64116
1372932419.85516

1372932420.11614
1372932420.31716
1372932420.51813
1372932420.71913
1372932420.92016
1372932421.12115
1372932421.32216
1372932421.58213

1372932421.78316
1372932421.98416
1372932422.18515
1372932422.38613
1372932422.58713
1372932422.80118
1372932423.05617
1372932423.34016

1372932423.54116
1372932423.74215
1372932423.94314
1372932424.14415
1372932424.43316
1372932424.63417
1372932424.85514

1372932425.05613
1372932425.25715
1372932425.45813
1372932425.65913
1372932425.86017
1372932426.18416

1372932426.51216
1372932426.71312
1372932426.91413
1372932427.11515
1372932427.31613
1372932427.74915

1372932428.00214
1372932428.20315
1372932428.40415
1372932428.60514
1372932428.80613
1372932429.00713
1372932429.38115

1372932429.58214
1372932429.78316
1372932429.98417
1372932430.18519
1372932430.38614
1372932430.58713
1372932430.92817

1372932431.12914
1372932431.33012
1372932431.53115
1372932431.73214
1372932431.93313
1372932432.13413

1372932432.48115
1372932432.73414
1372932432.93514
1372932433.13616
1372932433.33713
1372932433.53817
1372932433.73915
1372932433.95151

1372932434.28214
1372932434.48316
1372932434.68414
1372932434.88515
1372932435.08614
1372932435.28712
1372932435.48916
1372932435.84146

1372932436.05013
1372932436.25117
^C

There's a quite consistent pattern if you look closely: about every 8
lines of output.  Each line = every 0.2 seconds, so about every 1.5
seconds is where I'd see a pause which would last for about 0.5 seconds.

And no, the above output *was not* being written to a file on ZFS, only
to stdout.  :-)

What's interesting: I tried compression=gzip-9, which historically was
worse (I remember this clearly), but the stalls are about the same.
Maybe it's because I'm using /dev/zero rather than /dev/random, but the
issue there is that /dev/random would tax the CPU (entropy, etc.) more.

We didn't use compression at my previous job on Solaris (available CPU
time was very, very important given what the machines did), so I don't
have any context for comparison.

But: I can do this exact same procedure on the /backups filesystem/pool,
without compression of course, and there are no stalls -- just smooth
interactivity.

Now let me circle back to the convo I had with Fabian in 2013...

I have zero experience doing this "sched trace" stuff.  I do not speak
Python, but looking at /usr/src/tools/sched/schedgraph.py almost implies
it has some kind of "visual graphing" (via X?  I have no clue from the
code) and "borders" and "colour" support -- this is not an X system, so
unless this Python script generates image files somehow (I have no
image libraries installed on my system)...

My kernel does contain:

options         KTR
options         KTR_ENTRIES=262144
options         KTR_COMPILE=(KTR_SCHED)
options         KTR_MASK=(KTR_SCHED)

And I can follow the instructions at the top of the Python script and
provide the ktrdump somewhere if needed, but that's about it.  I don't
know if that would help or be beneficial in any way -- because even
though I have some familiarity with userland profiling via *_p.a libs,
this is something at a completely different level.

So if someone wants this, I need a bit of hand-holding to know what all
I'm supposed to be doing.  The instructions in the Python script make me
a little weary, particularly since it doesn't say to re-set
debug.ktr.mask to 536870912 afterward, so I'm not sure what the
implications are.

> >If a person considering use of compression on FreeBSD ZFS doesn't mind
> >that problem, then by all means use it.  It doesn't change the fact that
> >there's an issue, and one that folks should be made aware of up front.
> >It's not spreading FUD: it's spreading knowledge of a certain behaviour
> >that differs between FreeBSD and Solaris/Illumos.  The issue is a
> >deal-breaker for me; if it's not for you, great.
> 
> Sounds like it could well be use case based then, as we've not had any
> problems compression causing interactively problems. Quite the opposite
> in fact, the reduced physical IO that compression results in improved
> interactivity.
> 
> So I guess its like everything, one size doesn't fit all, so temporing
> statements about blanket avoiding the these features seems like the
> way to go :)

While I see the logic in what you're saying, I prefer to publicly
disclose the differences in behaviours between Illumos ZFS and FreeBSD
ZFS.

I'm well-aware of the tremendous and positive effort to minimise those
differences (code-wise) -- I remember mm@ talking about this some time
ago -- but if this is somehow one of them, I do not see the harm in
telling people "FYI, there is this quirk/behavioural aspect specific to
FreeBSD that you should be aware of".

It doesn't mean ZFS on FreeBSD sucks, it doesn't mean it's broken, it
just means it's something that would completely surprise someone out of
the blue.  Imagine the thread: "my system intermittently stalls, even at
VGA console... does anyone know what's causing this?" -- I doubt anyone
would think to check ZFS.

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Making life hard for others since 1977.             PGP 4BD6C0CB |