Prefaulting for i/o buffers

Sun Feb 26 14:13:39 UTC 2012

On Sun, Feb 26, 2012 at 03:02:54PM +0100, Attilio Rao wrote:
> Il 25 febbraio 2012 22:03, Konstantin Belousov <kostikbel at gmail.com> ha scritto:
> > On Sat, Feb 25, 2012 at 06:45:00PM +0100, Attilio Rao wrote:
> >> Il 25 febbraio 2012 16:13, Pawel Jakub Dawidek <pjd at freebsd.org> ha scritto:
> >> > On Sat, Feb 25, 2012 at 01:01:32PM +0000, Attilio Rao wrote:
> >> >> Il 03 febbraio 2012 19:37, Konstantin Belousov <kostikbel at gmail.com> ha scritto:
> >> >> > FreeBSD I/O infrastructure has well known issue with deadlock caused
> >> >> > by vnode lock order reversal when buffers supplied to read(2) or
> >> >> > write(2) syscalls are backed by mmaped file.
> >> >> >
> >> >> > I previously published the patches to convert i/o path to use VMIO,
> >> >> > based on the Jeff Roberson proposal, see
> >> >> > http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the
> >> >> > deadlock. Since that work is very intrusive and did not got any
> >> >> > follow-up, it get stalled.
> >> >> >
> >> >> > Below is very lightweight patch which only goal is to fix deadlock in
> >> >> > the least intrusive way. This is possible after FreeBSD got the
> >> >> > vm_fault_quick_hold_pages(9) and vm_fault_disable_pagefaults(9) KPIs.
> >> >> > http://people.freebsd.org/~kib/misc/vm1.3.patch
> >> >>
> >> >> Hi,
> >> >> I was reviewing:
> >> >> http://people.freebsd.org/~kib/misc/vm1.11.patch
> >> >>
> >> >> and I think it is great. It is simple enough and I don't have further
> >> >> comments on it.
> > Thank you.
> >
> > This spoiled an announce I intended to send this weekend :)
> >
> >> >>
> >> >> However, as a side note, I was thinking if we could get one day at the
> >> >> point to integrate rangelocks into vnodes lockmgr directly.
> >> >> It would be a huge patch, rewrtiting the locking of several members of
> >> >> vnodes likely, but I think it would be worth it in terms of cleaness
> >> >> of the interface and less overhead. Also, it would be interesting to
> >> >> consider merging rangelock implementation in ZFS' one, at some point.
> >> >
> >> > I personal opinion about rangelocks and many other VFS features we
> >> > currently have is that it is good idea in theory, but in practise it
> >> > tends to overcomplicate VFS.
> >> >
> >> > I'm in opinion that we should move as much stuff as we can to individual
> >> > file systems. We try to implement everything in VFS itself in hope that
> >> > this will simplify file systems we have. It then turns out only one file
> >> > system is really using this stuff (most of the time it is UFS) and this
> >> > is PITA for all the other file systems as well as maintaining VFS. VFS
> >> > became so complicated over the years that there are maybe few people
> >> > that can understand it, and every single change to VFS is a huge risk of
> >> > potentially breaking some unrelated parts.
> >>
> >> I think this is questionable due to the following assets:
> >> - If the problem is filesystems writers having trouble in
> >> understanding the necessary locking we should really provide cleaner
> >> and more complete documentation. One would think the same with our VM
> >> subsystem, but at least in that case there is plenty of comments that
> >> help understanding how to deal with vm_object, vm_pages locking during
> >> their lifelines.
> >> - Our primitives may be more complicated than the
> >> 'all-in-the-filesystem' one, but at least they offer a complete and
> >> centralized view over the resources we have allocated in the whole
> >> system and they allow building better policies about how to manage
> >> them. One problem I see here, is that those policies are not fully
> >> implemented, tuned or just got outdated, removing one of the highest
> >> beneficial that we have by making vnodes so generic
> >>
> >> About the thing I mentioned myself:
> >> - As long as the same path now has both range-locking and vnode
> >> locking I don't see as a good idea to keep both separated forever.
> >> Merging them seems to me an important evolution (not only helping
> >> shrinking the number of primitives themselves but also introducing
> >> less overhead and likely rewamped scalability for vnodes (but I think
> >> this needs a deep investigation).
> > The proper direction to move there is to designate the vnode lock for
> > the vnode structure protection, and have the range lock protect the
> > i/o atomicity. This is somewhat done in the proposed patch (since
> > now vnode lock does not protect the i/o operation, but only chunked
> > i/o transactions inside the operation).
> >
> > The Jeff idea of using page cache as the source of i/o data (implemented
> > in the VM6 patchset) pushes the idea much further. E.g., the write
> > does not obtain the write vnode lock typically (but sometimes it had,
> > to extend the vnode).
> >
> > Probably, I will revive VM6 after this change is landed.
> 
> About that I guess we might be careful.
> The first thing would be having a very scalable VM subsystem and
> recent benchmarks have shown that this is not yet the case (Florian,
> CC'ed, can share some pmc/LOCK_PROFILE analysis on pgsql that, also
> with the vmcontention patch, shows a lot on contention on vm_object,
> pmap lock and vm_page_queue_lock. We have some plans for every of
> them, we will discuss on a separate thread if you prefer). This is
> just to say, that we may need more work in underground areas to bring
> VM6 to the point it will really make a difference.

The benchmarks that were done at that time demonstrated that VM6 do not
cause regressions for e.g. buildworld time, and have a margin improvements,
around 10%, for some postgresql loads.

Main benefit of the VM6 on UFS is that writers no longer block readers
for separate i/o ranges. Also, due to vm_page flags locking improvements,
I suspect the VM6 backpressure code might be simplified and give even
larger benefit right now.

Anyway, I do not think that VM6 can be put into HEAD quickly, and I want
to finish with VM1/prefaulting right now.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20120226/52e045e8/attachment.pgp