[Bug 235774] [FUSE]: Need to evict invalidated cache contents on fuse_write_directbackend()

Thu Mar 7 05:23:40 UTC 2019

[bugzilla kills replies, but is in the Cc list twice]

On Wed, 6 Mar 2019, Conrad Meyer wrote:

> On Wed, Mar 6, 2019 at 1:32 PM Rick Macklem <rmacklem at uoguelph.ca> wrote:
>>
>> --- Comment #4 from Conrad Meyer <cem at freebsd.org> ---
>>> I think fuse's IO_DIRECT path is a mess.  Really all IO should go through the
>>> buffer cache, and B_DIRECT and ~B_CACHE are just flags that control the
>>> buffer's lifetime once the operation is complete.  Removing the "direct"
>>> backends entirely (except as implementation details of strategy()) would
>>> simplify and correct the caching logic.
>>
>> Hmm, I'm not sure that I agree that all I/O should go through the buffer cache,
>> in general. (I won't admit to knowing the fuse code well enough to comment
>> specifically on it.)
>
> The scope of the bug and comment you've replied to is just FUSE IO.
>
>> â€¦ having the NFS (or FUSE) client do a
>> large amount of writing to a file can flood the buffer cache and avoiding this
>> for the case where the client won't be reading the file would be nice.
>> What I am not sure is whether O_DIRECT is a good indicator of "doing a lot of
>> writing that won't be read back".
>
> This is the known failure mode of LRU cache policies plus finite cache
> size plus naive clients.  It's not specific to any particular
> filesystem.  You can either enlarge your LRU cache to incorporate the
> entire working set size, incorporate frequency of access in eviction
> policy, or have smart clients provide hints (e.g.,
> POSIX_FADV_DONTNEED).  O_DIRECT -> IO_DIRECT -> B_DIRECT is already
> used as a hint in the bufcache to release bufs/pages aggressively.

It is mostly a failue with naive clients.  Some are so naive that they even
trust the implementation of O_DIRECT to be any good.  Here the naive client
is mostly FUSE.

I fixed this in the md device using POSIX_FADV_DONTNEED and an optional
new caching option that turns this off.  Clients above md can still get
slowness by using block sizes too different from the block sizes (if any)
used by the backing storage, but unlike IO_DIRECT, POSIX_FADV_DONTNEED is
only a hint and it only discards full blocks from the buffer cache for
file systems that use the buffer cache.

zfs doesn't use the buffer cache and most of posix_fadvise(2) including
all of POSIX_FADV_DONTNEED is just a stub that has no effect for it.
zfs also doesn't support IO_DIRECT, so the attempted pessimizations
from using IO_DIRECT for md had no effect.

ffs has a fairly bad implementation of IO_DIRECT.  For writing, it does
the write using the buffer cache and then kills the buffer.  The result
for full blocks is the same as for a normal write followed by
POSIX_FADV_DONTNEED.  The result for a partial block is to kill the
buffer while POSIX_FADV_DONTNEED would keep it.  For reading, it does
much the same unless the optional DIRECTIO option is configured.  Then
the buffer cache is not used at all.  This seems to make no significant
difference when all i/o is direct.  Normal methods using 1 buffer at a
time won't thrash the buffer cache.  Rawread uses a pbuf and pbufs are
a more limited resource with more primitive management, so it might
actually be slower.  zfs also doesn't support DIRECTIO.

md used to use IO_DIRECT only for reading.  With vnode backing on ffs,
only reading with the same block size as ffs was reasonably efficient.
IO_DIRECT prevents normal clustering even without DIRECTIO, so large
block sizes in md were not useful (ffs splits them up), and small block
sizes were very small.  E.g., with 512-blocks in the client above md and
32K-blocks in ffs, reading 32K in the client 512 bytes at a time uses
64 reads of the same 32K-block in ffs.  Caching in the next layer of
storage is usually no so bad, but it takes a lot of CPU and a large
iops in all layers to do 64 times as many i/o's.  Now the ffs block is
kept until it is all read, so this only takes a lot of CPU and a large
iops in layers between md and ffs, but iops there is only limited by
CPU (including memory).

md didn't use IO_DIRECT for writing, since it considered that to be too
slow.  But it was at worst only about 3 times slower than what md did.
md also didn't use any clustering, and it normally doesn't use async
writes (this is an unusable configuration option, since async writes
can hang), so it got much the same slowness as sync mounts in ffs.
The factor of 3 slowness is from having to do a read-modify-write to
write partial blocks.  This gave most of the disadvantages of not using
the buffer cache, but still gave double-caching.  Now writes in md are
cached 1 block at a time and double-caching is avoided for file systems
that support POSIX_FADV_DONTNEED.

Even non-naive clients like md have a hard time managing the block sizes.
E.g., to work as well as possible, md would first need to understand that
POSIX_FADV_DONTNEED is not supported by some file systems and supply
workarounds.  In general, the details of the caching policies and current
cache state in the lower layer(s) would have to be understood.  Even
posix_fadvise(2) doesn't understand much of that.  It is only implemented
at the vfs level where the details are not known except indirectly by their
effect on the buffer and object caches.

There is also some confusion and bugs involving *DONTNEED and *NOREUSE:
- vop_stdadvise() only supports POSIX_FADVISE_DONTNEED.  I does nothing for
   the stronger hint POSIX_FADVISE_NOREUSE.
- posix_advise() knows about this bug and converts POSIX_FADVISE_NOREUSE
   into POSIX_FADVISE_DONTNEED.
- ffs IO_DIRECT wants NOREUSE semantics (to kill the buffer completely).
   It gets this by not using VOP_ADVISE(), but using the buffer cache.
- the buffer cache has the opposite confusion and bugs.  It supports
   B_NOREUSE but not B_DONTNEED.  IO_DIRECT is automatically converted to
   B_NOREUSE when ffs releases the buffer.  This is how ffs kills the
   buffer without know the details.
- my initial fixes for md did more management that would have worked with
   NOREUSE semantics.  md wants to kill the buffer too, but only when it
   is full.  I found that the DONTNEED semantics as implemented in
   vop_stadadvise() worked just as well.  But there is a problem with
   random small i/o's.  My initial fixes wanted to kill even small buffers
   when the next i/o is not contiguous.  But this prevents caching when
   caching is especially needed (it is only sequential i/o's where the
   data is expected to not be needed again).  posix_fadvise and
   vop_stadadvise() have even less idea how to handle random i/o's.  I
   think they just don't free partial blocks.

Bruce

Bruce