Options for zfs inside a VM backed by zfs on the host

Fri Aug 28 16:27:32 UTC 2015

> On Aug 27, 2015, at 7:47 PM, Tenzin Lhakhang <tenzin.lhakhang at gmail.com> wrote:
> 
> On Thu, Aug 27, 2015 at 3:53 PM, Chad J. Milios <milios at ccsys.com <mailto:milios at ccsys.com>> wrote:
> 
> Whether we are talking ffs, ntfs or zpool atop zvol, unfortunately there are really no simple answers. You must consider your use case, the host and vm hardware/software configuration, perform meaningful benchmarks and, if you care about data integrity, thorough tests of the likely failure modes (all far more easily said than done). I’m curious to hear more about your use case(s) and setups so as to offer better insight on what alternatives may make more/less sense for you. Performance needs? Are you striving for lower individual latency or higher combined throughput? How critical are integrity and availability? How do you prefer your backup routine? Do you handle that in guest or host? Want features like dedup and/or L2ARC up in the mix? (Then everything bears reconsideration, just about triple your research and testing efforts.)
> 
> Sorry, I’m really not trying to scare anyone away from ZFS. It is awesome and capable of providing amazing solutions with very reliable and sensible behavior if handled with due respect, fear, monitoring and upkeep. :)
> 
> There are cases to be made for caching [meta-]data in the child, in the parent, checksumming in the child/parent/both, compressing in the child/parent. I believe `gstat` along with your custom-made benchmark or test load will greatly help guide you.
> 
> ZFS on ZFS seems to be a hardly studied, seldom reported, never documented, tedious exercise. Prepare for accelerated greying and balding of your hair. The parent's volblocksize, child's ashift, alignment, interactions involving raidz stripes (if used) can lead to problems from slightly decreased performance and storage efficiency to pathological write amplification within ZFS, performance and responsiveness crashing and sinking to the bottom of the ocean. Some datasets can become veritable black holes to vfs system calls. You may see ZFS reporting elusive errors, deadlocking or panicing in the child or parent altogether. With diligence though, stable and performant setups can be discovered for many production situations.
> 
> For example, for a zpool (whether used by a VM or not, locally, thru iscsi, ggate[cd], or whatever) atop zvol which sits on parent zpool with no redundancy, I would set primarycache=metadata checksum=off compression=off for the zvol(s) on the host(s) and for the most part just use the same zpool settings and sysctl tunings in the VM (or child zpool, whatever role it may conduct) that i would otherwise use on bare cpu and bare drives (defaults + compression=lz4 atime=off). However, that simple case is likely not yours.
> 
> With ufs/ffs/ntfs/ext4 and most other filesystems atop a zvol i use checksums on the parent zvol, and compression too if the child doesn’t support it (as ntfs can), but still caching only metadata on the host and letting the child vm/fs cache real data.
> 
> My use case involves charging customers for their memory use so admittedly that is one motivating factor, LOL. Plus, i certainly don’t want one rude VM marching through host ARC unfairly evacuating and starving the other polite neighbors.
> 
> VM’s swap space becomes another consideration and I treat it like any other ‘dumb’ filesystem with compression and checksumming done by the parent but recent versions of many operating systems may be paging out only already compressed data, so investigate your guest OS. I’ve found lz4’s claims of an almost-no-penalty early-abort to be vastly overstated when dealing with zvols, small block sizes and high throughput so if you can be certain you’ll be dealing with only compressed data then turn it off. For the virtual memory pagers in most current-day OS’s though set compression on the swap’s backing zvol to lz4.
> 
> Another factor is the ZIL. One VM can hoard your synchronous write performance. Solutions are beyond the scope of this already-too-long email :) but I’d be happy to elaborate if queried.
> 
> And then there’s always netbooting guests from NFS mounts served by the host and giving the guest no virtual disks, don’t forget to consider that option.
> 
> Hope this provokes some fruitful ideas for you. Glad to philosophize about ZFS setups with ya’ll :)
> 
> -chad

> That was a really awesome read!  The idea of turning metadata on at the backend zpool and then data on the VM was interesting, I will give that a try. Please can you elaborate more on the ZILs and synchronous writes by VMs.. that seems like a great topic.

> I am right now exploring the question: are SSD ZILs necessary in an all SSD pool? and then the question of NVMe SSD ZILs onto of an all SSD pool.  My guess at the moment is that SSD ZILs are not necessary at all in an SSD pool during intensive IO.  I've been told that ZILs are always there to help you, but when your pool aggregate IOPs is greater than the a ZIL, it doesn't seem to make sense.. Or is it the latency of writing to a single disk vs striping across your "fast" vdevs?
> 
> Thanks,
> Tenzin

Well the ZIL (ZFS Intent Log) is basically an absolute necessity. Without it, a call to fsync() could take over 10 seconds on a system serving a relatively light load. HOWEVER, a source of confusion is the terminology people often throw around. See, the ZIL is basically a concept, a method, a procedure. It is not a device. A 'SLOG' is what most people mean when they say ZIL. That is a Seperate Log device. (ZFS ‘log’ vdev type; documented in man 8 zpool.) When you aren’t using a SLOG device, your ZIL is transparently allocated by ZFS, roughly a little chunk of space reserved near the “middle” (at least ZFS attempts to locate it there physically but on SSDs or SMR HDs there’s no way to and no point to) of the main pool (unless you’ve gone out of your way to deliberately disable the ZIL entirely).

The other confusion often surrounding the ZIL is when it gets used. Most writes (in the world) would bypass the ZIL (built-in or SLOG) entirely anyway because they are asynchronous writes, not synchronous ones. Only the latter are candidates to clog a ZIL bottleneck. You will need to consider your workload specifically to know whether a SLOG will help, and if so, how much SLOG performance is required to not put a damper on the pool’s overall throughput capability. Conversely you want to know how much SLOG performance is overkill because NVMe and SLC SSDs are freaking expensive.

Now for many on the list this is going to be some elementary information so i apologize but i come across this question all the time, sync vs async writes. i’m sure there are many who might find this informative and with ZFS the difference becomes more profound and important than most other filesystems.

See, ZFS always is always bundling up batches of writes into transaction groups (TXGs). Without extraneous detail it can be understood that basically these happen every 5 seconds (sysctl vfs.zfs.txg.timeout). So picture ZFS typically has two TXGs it’s worried about at any given time, one is being filled into memory while the previous one is being flushed out to physical disk.

So when you write something asynchronously the operating system is going to say ‘aye aye captain’ and send you along your merry way very quickly but if you lose power or crash and then reboot, ZFS only guarantees you a CONSISTENT state, not your most recent state. Your pool may come back online and you’ve lost 5-15 seconds worth of work. For your typical desktop or workstation workload that’s probably no big deal. You lost 15 seconds of effort, you repeat it, and continue about your business.

However, imagine a mail server that received many many emails in just that short time and has told all the senders of all those messages “got it, thumbs up”. You cannot redact those assurances you handed out. You have no idea who to contact to ask to repeat themselves. Even if you did it's likely the sending mail servers have long since forgotten about those particular messages. So, with each message you receive, after you tell the operating system to write the data you issue a call to fsync(new_message) and only after that call returns do you give the sender the thumbs up to forget the message and leave it in your capable hands to deliver it to its destination. Thanks to the ZIL, fsync() will typically return in miliseconds or less instead of the many seconds it could take for that write in a bundled TXG to end up physically saved. In an ideal world, the ZIL gets written to and never read again, data just becoming stale and overwritten. (The data stays in the in-memory TXG so it’s redundant in the ZIL once that TXG completes flushing).

The email server is the typical example of the use of fsync but there are thousands of others. Typically applications using central databases are written in a simplistic way to assume the database is trustworthy and fsync is how the database attempts to fulfill that requirement.

To complicate matters, consider VMs, particularly uncooperative, impolite, selfish VMs. Synchronous write iops are a particularly scarce and expensive resource which hasn’t been increasing as quickly and cheaply as, say, io bandwidth, cpu speeds, memory capacities. To make it worse the numbers for iops most SSD makers advertise on their so-called spec sheets are untrustworthy, they have no standard benchmark or enforcement (“The PS in IOPS stands for Per Second so we ran our benchmark on a fresh drive for one second and got 100,000 IOPS" Well, good for you, that is useless to me. Tell me what you can sustain all day long a year down the road.) and they’re seldom accountable to anybody not buying 10,000 units. All this consolidation of VMs/containers/jails can really stress sync i/o capability of even the biggest baddest servers.

And FreeBSD, in all it’s glory is not yet very well suited to the problem of multi-tennency. (It’s great if all jails and VMs on a server are owned and controlled by one stakeholder who can coordinate their friendly coexistence.) My firm develops and supports a proprietary shim into ZFS and jails for enforcing the polite sharing of bandwidth, total iops and sync iops, that can be applied to groups of which the granularity of membership are arbitrary ZFS datasets. So there, that's my shameless plug, LOL. However there are brighter minds than I working on this problem and I’m hoping to maybe some time either participate in a more general development of such facilities with broader application into mainline FreeBSD or to perhaps open source my own work eventually. (I guess I’m being more shy than selfish with it, LOL.)

Hope that’s food for thought for some of you

-chad