ZFS sync / ZIL clarification

Peter Maloney peter.maloney at brockmann-consult.de
Mon Jan 30 07:47:55 UTC 2012


On 01/30/2012 05:30 AM, Mark Felder wrote:
> I believe I was told something misleading a few weeks ago and I'd like
> to have this officially clarified.
>
> NFS on ZFS is horrible unless you have sync = disabled. 
With ESXi = true
with others = depends on your definition of horrible

> I was told this was effectively disabling the ZIL, which is of course
> naughty. Now I stumbled upon this tonight:
>
true only for the specific dataset you specified
eg.
zfs set sync=disabled tank/esxi

>> Just for the archives... sync=disabled won't disable disable the
>> zil,it'll disable waiting for a disk-flush on fsync etc. 
Same thing... "waiting for a disk-flush" is the only time the ZIL is
used, from what I understand.

>> With a batterybacked controller cache, those flushes should go to
>> cache, and bepretty mich free. You end up tossing away something for
>> nothing.
False I guess. Would be nice, but how do you battery back your RAM,
which ZFS uses as a write cache? (If you know something I don't know,
please share.)
>
> Is this accurate?

sync=disabled caused data corruption for me. So you need to have battery
backed cache... unfortunately, the cache we are talking about is in RAM,
not your IO controller. So put a UPS on there, and you are safe except
when you get a kernel panic (which is what happened to cause my
corruption). But if you get something like the Gigabyte iRAM or the
Acard ANS-9010
<http://www.acard.com.tw/english/fb01-product.jsp?prod_no=ANS-9010&type1_title=%20Solid%20State%20Drive&idno_no=270>,
set it as your ZIL, and leave sync=standard, you should be safer. (I
don't know if the iRAM works in FreeBSD, but someone
<http://christopher-technicalmusings.blogspot.com/2011/06/speeding-up-freebsds-nfs-on-zfs-for-esx.html>
told me he uses the ANS-9010)

And NFS with ZFS is not horrible, except with ESXi's built in NFS client
it uses for datastores.  (the same someone that said he uses the
ANS-9010 also provides a 'patch' for the FreeBSD NFS server that
disables ESXi's stupid behavior, without disabling sync entirely, but
also possibly disables it for others that use it responsibly [a database
perhaps])

here
<http://www.citi.umich.edu/projects/nfs-perf/results/cel/write-throughput.html>
is a fantastic study about NFS; dunno if this study resulted in patches
now in use or not, or how old it is [newest reference is 2002, so at
most 10 years old]. In my experience, the write caching in use today
still sucks. If I run async with sync=disabled, I can still see a huge
improvement (20% on large files, up to 100% for smaller files <200MB)
using an ESXi virtual disk (with ext4 doing write caching) compared to
NFS directly.


Here begins the rant about ESXi, which may be off topic:

ESXi goes 7 MB/s with an SSD ZIL at 100% load, and 80 MB/s with a
ramdisk ZIL at 100% load (pathetic!),
something I can't reproduce (thought it was just a normal Linux client
with "-o sync" over 10 Gbps ethernet) got over 70MB/s with the ZIL at
70-90% load,
and other clients set to "-o sync,noatime,..." or "-o noatime,..."with
the ZIL only randomly 0-5% load, but go faster than 100 MB/s. I didn't
test "async", and without "sync", they seem to go the same speed.
setting sync=disabled always goes around 100 MB/s, and changes the load
on the ZIL to 0%.

The thing I can't reproduce might have been only possible on a pool that
I created with FreeBSD 8.2-RELEASE and then upgraded, which I no longer
have. Or maybe it was with "sync" without "noatime".

I am going to test with 9000 MTU, and if it is not much faster, I am
giving up on NFS. My original plan was to use ESXi with a ZFS datastore
with a replicated backup. That works terribly using the ESXi NFS client.
Netbooting the OSses to bypass the ESXi client works much better, but
still not good enough for many servers. NFS is poorly implemented, with
terrible write caching on the client side. Now my plan is to use FreeBSD
with VirtualBox and ZFS all in one system, and send replication
snapshots from there. I wanted to use ESXi, but I guess I can't.

And the worst thing about ESXi, is if you have 1 client going 7MB/s, the
second client has to share that 7MB/s, and non-ESXi clients will still
go horribly slow. If you have 10 non-ESXi clients going at 100 MB/s,
each one is limited to around 100 MB/s (again I only tested this with
1500 MTU so far), but together they can write much more.

Just now I tested 2 clients writing 100+100 MB/s (reported by GNU dd),
and 3 writing 50+60+60 MB/s (reported by GNU dd)
Output from "zpool iostat 5":
two clients:
tank        38.7T  4.76T      0  1.78K  25.5K   206M (matches 100+100)
three clients:
tank        38.7T  4.76T      1  2.44K   205K   245M (does not match
60+60+50)

(one client is a Linux netboot, and the others are using the Linux NFS
client)

But I am not an 'official', so this cannot be considered 'officially
clarified' ;)


> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"


-- 

--------------------------------------------
Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.maloney at brockmann-consult.de
Internet: http://www.brockmann-consult.de
--------------------------------------------



More information about the freebsd-fs mailing list