ZFS sync / ZIL clarification

Peter Maloney peter.maloney at brockmann-consult.de
Tue Jan 31 08:09:23 UTC 2012


On 01/30/2012 09:30 PM, Dennis Glatting wrote:
> On Mon, 2012-01-30 at 08:47 +0100, Peter Maloney wrote:
>> On 01/30/2012 05:30 AM, Mark Felder wrote:
>>> I believe I was told something misleading a few weeks ago and I'd like
>>> to have this officially clarified.
>>>
>>> NFS on ZFS is horrible unless you have sync = disabled. 
>> With ESXi = true
>> with others = depends on your definition of horrible
>>
>>> I was told this was effectively disabling the ZIL, which is of course
>>> naughty. Now I stumbled upon this tonight:
>>>
>> true only for the specific dataset you specified
>> eg.
>> zfs set sync=disabled tank/esxi
>>
>>>> Just for the archives... sync=disabled won't disable disable the
>>>> zil,it'll disable waiting for a disk-flush on fsync etc. 
>> Same thing... "waiting for a disk-flush" is the only time the ZIL is
>> used, from what I understand.
>>
>>>> With a batterybacked controller cache, those flushes should go to
>>>> cache, and bepretty mich free. You end up tossing away something for
>>>> nothing.
>> False I guess. Would be nice, but how do you battery back your RAM,
>> which ZFS uses as a write cache? (If you know something I don't know,
>> please share.)
>>> Is this accurate?
>> sync=disabled caused data corruption for me. So you need to have battery
>> backed cache... unfortunately, the cache we are talking about is in RAM,
>> not your IO controller. So put a UPS on there, and you are safe except
>> when you get a kernel panic (which is what happened to cause my
>> corruption). But if you get something like the Gigabyte iRAM or the
>> Acard ANS-9010
>> <http://www.acard.com.tw/english/fb01-product.jsp?prod_no=ANS-9010&type1_title=%20Solid%20State%20Drive&idno_no=270>,
>> set it as your ZIL, and leave sync=standard, you should be safer. (I
>> don't know if the iRAM works in FreeBSD, but someone
>> <http://christopher-technicalmusings.blogspot.com/2011/06/speeding-up-freebsds-nfs-on-zfs-for-esx.html>
>> told me he uses the ANS-9010)
>>
>> And NFS with ZFS is not horrible, except with ESXi's built in NFS client
>> it uses for datastores.  (the same someone that said he uses the
>> ANS-9010 also provides a 'patch' for the FreeBSD NFS server that
>> disables ESXi's stupid behavior, without disabling sync entirely, but
>> also possibly disables it for others that use it responsibly [a database
>> perhaps])
>>
>> here
>> <http://www.citi.umich.edu/projects/nfs-perf/results/cel/write-throughput.html>
>> is a fantastic study about NFS; dunno if this study resulted in patches
>> now in use or not, or how old it is [newest reference is 2002, so at
>> most 10 years old]. In my experience, the write caching in use today
>> still sucks. If I run async with sync=disabled, I can still see a huge
>> improvement (20% on large files, up to 100% for smaller files <200MB)
>> using an ESXi virtual disk (with ext4 doing write caching) compared to
>> NFS directly.
>>
>>
>> Here begins the rant about ESXi, which may be off topic:
>>
> ESXi 3.5, 4.0, 4.1, 5.0, or all of the above?
>
I didn't know 5.0.0 was available for free. Thanks for the notice.

My testing has been with 4.1.0 build 348481, but if you look around on
the net, you will find no official sensible workarounds/fixes/etc.. They
don't even acknowledge the issue is in the ESXi NFS client... even
though it is obvious. So I doubt the problem will be fixed any time
soon. Even using the "sync" option is discouraged, and they actually go
do the absolute worst thing and send O_SYNC with every write (even when
saving state of a VM; I turn off sync in zfs when I do this). Some
groups have some solutions that mitigate but do not eliminate the
problem. The issue also exists with other file systems and platforms,
but it seems the worst on ZFS. I couldn't find anything equivalent to
those solutions that work on FreeBSD and ZFS. The closest is the patch I
mentioned above
(http://christopher-technicalmusings.blogspot.com/2011/06/speeding-up-freebsds-nfs-on-zfs-for-esx.html)
which possibly would result in data corruption for non-ESXi connections
to your NFS server that responsibly use the O_SYNC flag. I didn't test
that patch, because I would rather just throw away ESXi. I hate how much
it limits you (no software raid, no file system choice, no rsync, no
firewall, top, iostat, etc.). And it handles network interruptions
terribly... in some cases you need to reboot to get it to find all the
.vmx files again. In other cases hacks work to reconnect to the NFS mounts.

But many just simply switch to iSCSI. And from what I've heard, iSCSI
also sucks on ESXi with the default settings, but a single setting fixes
most of the problem. I'm not sure if this applies to FreeBSD or ZFS
(didn't test it yet). Here are some pages from the starwind forum (where
we can assume their servers are Windows based):

Here they say "doing Write-Back Cache helps but not completely" (Windows
specific)
http://www.starwindsoftware.com/forums/starwind-f5/esxi-iscsi-initiator-write-speed-t2398-15.html

And here is something (Windows specific) about changing the ACK timing:
http://www.starwindsoftware.com/forums/starwind-f5/esxi-iscsi-initiator-write-speed-t2398.html

And here is some other page that ended up in my bookmarks:
http://www.starwindsoftware.com/forums/starwind-f5/recommended-settings-for-esx-iscsi-initiator-t2296.html

Somewhere on those 3 or linked somewhere (can't find it now), there are
instructions to turn off "Delayed ACK" (in ESXi):

in ESXi, click the host
click "Configuration" tab.
Click "Storage Adapters"
find and select the "iSCSI Software Adapter"
click "properties" (a blue link on the right, in the "details" section)
click "advanced" (must be enabled or this button is greyed out)
look for the "Delayed ACK" option in there somewhere (at the end in my
list), and uncheck the box.

And this is said to improve things considerably, but I didn't iSCSI at
all on ESXi or ZFS.

I wanted to test iSCSI on ZFS, but I found zvols to be buggy... so I
decided to avoid them. So I am not very motivated to try again.

I guess I can work around buggy zvols by using a loop device for a file
instead of a zvol... but I am always too busy. Give it a few months.

>> ESXi goes 7 MB/s with an SSD ZIL at 100% load, and 80 MB/s with a
>> ramdisk ZIL at 100% load (pathetic!),
>> something I can't reproduce (thought it was just a normal Linux client
>> with "-o sync" over 10 Gbps ethernet) got over 70MB/s with the ZIL at
>> 70-90% load,
>> and other clients set to "-o sync,noatime,..." or "-o noatime,..."with
>> the ZIL only randomly 0-5% load, but go faster than 100 MB/s. I didn't
>> test "async", and without "sync", they seem to go the same speed.
>> setting sync=disabled always goes around 100 MB/s, and changes the load
>> on the ZIL to 0%.
>>
>> The thing I can't reproduce might have been only possible on a pool that
>> I created with FreeBSD 8.2-RELEASE and then upgraded, which I no longer
>> have. Or maybe it was with "sync" without "noatime".
>>
>> I am going to test with 9000 MTU, and if it is not much faster, I am
>> giving up on NFS. My original plan was to use ESXi with a ZFS datastore
>> with a replicated backup. That works terribly using the ESXi NFS client.
>> Netbooting the OSses to bypass the ESXi client works much better, but
>> still not good enough for many servers. NFS is poorly implemented, with
>> terrible write caching on the client side. Now my plan is to use FreeBSD
>> with VirtualBox and ZFS all in one system, and send replication
>> snapshots from there. I wanted to use ESXi, but I guess I can't.
>>
>> And the worst thing about ESXi, is if you have 1 client going 7MB/s, the
>> second client has to share that 7MB/s, and non-ESXi clients will still
>> go horribly slow. If you have 10 non-ESXi clients going at 100 MB/s,
>> each one is limited to around 100 MB/s (again I only tested this with
>> 1500 MTU so far), but together they can write much more.
>>
>> Just now I tested 2 clients writing 100+100 MB/s (reported by GNU dd),
>> and 3 writing 50+60+60 MB/s (reported by GNU dd)
>> Output from "zpool iostat 5":
>> two clients:
>> tank        38.7T  4.76T      0  1.78K  25.5K   206M (matches 100+100)
>> three clients:
>> tank        38.7T  4.76T      1  2.44K   205K   245M (does not match
>> 60+60+50)
>>
>> (one client is a Linux netboot, and the others are using the Linux NFS
>> client)
>>
>> But I am not an 'official', so this cannot be considered 'officially
>> clarified' ;)
>>
>>
>>> _______________________________________________
>>> freebsd-fs at freebsd.org mailing list
>>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
>>
>


-- 

--------------------------------------------
Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.maloney at brockmann-consult.de
Internet: http://www.brockmann-consult.de
--------------------------------------------



More information about the freebsd-fs mailing list