ZFS Direct IO (Was: Re: March OpenZFS Leadership Meeting)

Thu Mar 5 09:22:14 UTC 2020

On 3/4/20 2:58 PM, Matthew Ahrens wrote:
>
>  *
>
>     Directio (Mark M)
>
>      o
>
>         User interface
>
>          +
>
>             What happens to partial-block writes that are
>             DIRECTIO-requested?
>
>              #
>
>                 Nobody wants to argue against failing partial block writes
>
qemu is an application which uses O_DIRECT when configured in its
cache=none mode (and also cache=directsync which also uses O_DSYNC):
https://documentation.suse.com/sles/11-SP4/html/SLES-kvm4zseries/cha-qemu-cachemodes.html

In this use case (qemu/virtualization), the key desired benefit of
O_DIRECT is that it bypasses the host's cache. I think the ideal mapping
in this use case is that O_DIRECT should have the same effect as
primarycache=metadata and ZFS should /not/ require recordsize-sized
writes (/not/ "fail[] partial block writes").

VMs are generally writing in 512 B or 4 KiB blocks. It is almost
certainly not feasible to get the VM to write in e.g. 128 KiB blocks. If
the application is required to write in recordsize blocks (and assuming
the application does not want to take on the read-modify-write, which I
think is the case here), this would force the administrator to set the
recordsize to 512 B or 4 KiB. Using a small recordsize like that is
detrimental in terms of compression ratios, metadata overhead, raidz
space overhead (if you use raidz), etc.

A reasonable counter-argument would be that for this use case, the
administrator could set the proposed option to use the current direct IO
behavior (i.e. just ignore O_DIRECT) and then set primarycache=metadata.
If it turns out that requiring recordsize-sized blocks is enough of a
win in other use cases, at least we have a decent fallback for this use
case.

A further question was raised about what downsides caching has. For the
virtualization use case, I've always had a particular concern.
Fundamentally, caching here is using the host's RAM to speed up disk IO.
For virtualization use cases, the vast majority of host RAM is allocated
to guests. This leads to a capacity planning/predictability concern.
Imagine I have a host with e.g. 64 GiB of RAM, I've allocated 32 GiB of
RAM to guests, and everything is working fine. Can I start another guest
that uses 16 GB of RAM? It seems that I should be able to, and if I'm
using uncached (on the host) IO, I definitely can. However, if I'm using
the host's RAM for caching, taking 16 GB away from the cache could have
detrimental performance effects. In other words, I might be
(inadvertently) relying upon the caching to provide acceptable IO
performance. Eliminating the host cache ensures that all IO caching is
occurring in the guest's RAM, which makes RAM allocation easier to
understand / more predictable.

-- 
Richard

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.freebsd.org/pipermail/zfs-devel/attachments/20200305/8e78fabf/attachment.sig>