Prefaulting for i/o buffers

Thu Mar 1 15:16:51 UTC 2012

On Thu, Mar 01, 2012 at 03:11:16PM +0000, Attilio Rao wrote:
> 2012/3/1, Konstantin Belousov <kostikbel at gmail.com>:
> > On Thu, Mar 01, 2012 at 02:50:40PM +0000, Attilio Rao wrote:
> >> 2012/3/1, Konstantin Belousov <kostikbel at gmail.com>:
> >> > On Thu, Mar 01, 2012 at 02:32:33PM +0000, Attilio Rao wrote:
> >> >> 2012/3/1, Pawel Jakub Dawidek <pjd at freebsd.org>:
> >> >> > On Thu, Mar 01, 2012 at 01:16:24PM +0200, Gleb Kurtsou wrote:
> >> >> >> On (25/02/2012 20:46), Pawel Jakub Dawidek wrote:
> >> >> >> > - "Every file system needs cache. Let's make it general, so that
> >> >> >> > all
> >> >> >> > file
> >> >> >> >   systems can use it!" Well, for VFS each file system is a
> >> >> >> > separate
> >> >> >> >   entity, which is not the case for ZFS. ZFS can cache one block
> >> >> >> > only
> >> >> >> >   once that is used by one file system, 10 clones and 100
> >> >> >> > snapshots,
> >> >> >> >   which all are separate mount points from VFS perspective.
> >> >> >> >   The same block would be cached 111 times by the buffer cache.
> >> >> >>
> >> >> >> Hmm. But this one is optional. Use vop_cachedlookup (or call
> >> >> >> cache_entry() on your own), add a number of cache_prune calls. It's
> >> >> >> pretty much library-like design you describe below.
> >> >> >
> >> >> > Yes, namecache is already library-like, but I was talking about the
> >> >> > buffer cache. I managed to bypass it eventually with suggestions from
> >> >> > ups@, but for a long time I was sure it isn't at all possible.
> >> >>
> >> >> Can you please clarify on this as I really don't understand what you
> >> >> mean?
> >> >>
> >> >> >
> >> >> >> Everybody agrees that VFS needs more care. But there haven't been
> >> >> >> much
> >> >> >> of concrete suggestions or at least there is no VFS TODO list.
> >> >> >
> >> >> > Everybody agrees on that, true, but we disagree on the direction we
> >> >> > should move our VFS, ie. make it more light-weight vs. more
> >> >> > heavy-weight.
> >> >>
> >> >> All I'm saying (and Gleb too) is that I don't see any benefit in
> >> >> replicating all the vnodes lifecycle at the inode level and in the
> >> >> filesystem specific implementation.
> >> >> I don't see a semplification in the work to do, I don't think this is
> >> >> going to be simpler for a single specific filesystem (without
> >> >> mentioning the legacy support, which means re-implement inode handling
> >> >> for every filesystem we have now), we just loose generality.
> >> >>
> >> >> if you want a good example of a VFS primitive that was really
> >> >> UFS-centric and it was mistakenly made generic is vn_start_write() and
> >> >> sibillings. I guess it was introduced just to cater UFS snapshot
> >> >> creation and then it poisoned other consumers.
> >> >
> >> > vn_start_write() has nothing to do with filesystem code at all.
> >> > It is purely VFS layer operation, which shall not be called from fs
> >> > code at all. vn_start_secondary_write() is sometimes useful for the
> >> > filesystem itself.
> >> >
> >> > Suspension (not snapshotting) is very useful and allows to avoid some
> >> > nasty issues with unmounts, remounts or guaranteed syncing of the
> >> > filesystem. The fact that only UFS utilizes this functionality just
> >> > shows that other filesystem implementors do not care about this
> >> > correctness, or that other filesystems are not maintained.
> >>
> >> I'm sure that when I looked into it only UFS suspension was being
> >> touched by it and it was introduced back in the days when snapshotting
> >> was sanitized.
> >>
> >> So what are the races it is supposed to fix and other filesystems
> >> don't care about?
> >
> > You cannot reliably sync the filesystem when other writers are active.
> > So, for instance, loop over vnodes fsyncing them in unmount code can never
> > terminate. The same is true for remounts rw->ro.
> >
> > One of the possible solution there is to suspend writers. If unmount is
> > successfull, writer will get a failure from vn_start_write() call, while
> > it will proceed normal if unmount is terminated or not started at all.
> 
> I don't think we implement that right now, IIRC, but it is an interesting idea.

What don't we implement right now ? Take a look at r183074 (Sep 2008).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20120301/0f3fc2ea/attachment.pgp