more strict KPI for vm_pager_get_pages()

Gleb Smirnoff glebius at FreeBSD.org
Thu Apr 30 14:24:13 UTC 2015


  Hi!

  The reason to write down this patch emerges from the
projects/sendfile branch, where vm_pager_get_pages() is
used in the sendfile(2) system call. Although the new
sendfile works flawlessly, it makes some assumptions
about vnode_pager that theoretically may not be valid,
however always hold in our current code.

Going deeper in the problem I have found more important
points, which yielded in the suggested patch. To start,
let me display the current KPI assumptions:

1) vm_pager_get_pages() works on an array of consequtive
   array of pages. Pindex of (n+1)-th pages must be pindex
   of n-th + 1. One page is special, it is called reqpage.
2) vm_pager_get_pages() guarantees to swapin only the reqpage,
   and may skip or fail other pages for different reasons, that
   may vary from pager to pager.
3) There also is function vm_pager_has_page(), which reports
   availability of a page at given index in the pager, and also
   provides hints on how many consequtive pages before this one
   and after this one can be swapped in in single pager request.
   Most pagers return zeros in these hints. The vnode pager for
   UFS returns a strong promise, that one can later utilize in
   vm_pager_get_pages().
4) All pages must be busied on enter. On exit only reqpage
   will be left busied. The KPI doesn't guarantee that rest
   of the pages is still in place. The pager usually calls
   vm_page_readahead_finish() on them, which can either free,
   or put the page on active/inactive queue, using quite
   a strange approach to choose a queue.
5) The pages must not be wired, since vm_page_free() may be
   called on them. However, this is violated by several
   consumers of KPI, relying on lack of errors in the pager.
   Moreover, the swap pager has a special function to skip
   wired pages, while doing the sweep, to avoid this problem.
   So, passing wired pages to swapper is OK, while to the
   reset is not.
6) Pagers may replace a page in the object with a new one.
   The sg_pager actually does that. To protect from this
   event, consumers of vm_pager_get_pages() always run
   vm_page_lookup() over the array of pages to relookup the pages.
   However, not all consumers do this.

Speaking of pagers and their consumers:
- 11 consumers request array of size 1, a single page
- 3 consumers actually request array

My suggestion is to change the KPI assumptions to the following:

1) There is no reqpage. All pages are entered busied, all pages
   are returned busied and validated. If pager fails to validate
   all pages it must return error.
2) The consumer (not the pager!) is to decide what to do with the
   pages: vm_page_active, vm_page_deactivate, vm_page_flash or just
   vm_page_free them. The consumer also unbusies pages, if it
   wants to. The consumer is free to wire pages before the call.
3) Consumers must first query the pager via vm_pager_has_page(),
   and use the after/before hints to limit the size of the
   requested pages array.
4) In case if pager replaces pages, it must also update the array,
   so that consumer doesn't need to do relookup.

Doing this sweep, I also noticed that all pagers have a copy-pasted
code of zeroing invalid regions of partially valid pages. Also,
many pagers got a set of assertions copy and pasted from each
other. So, I decided to un-inline the vm_pager_get_pages(), bring
it to the vm_pager.c file and gather all these copy-pastes
into one place.

The suggested patch is attached. As expected, it simplifies and
removes quite a lot of code.

Right now it is tested on UFS only, testing NFS and ZFS is on my list.
There is one panic known, but it seems unrelated, and Peter pho@ says
that once it has been seen before.

-- 
Totus tuus, Glebius.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vm_pager_get_pages-new-KPI.diff
Type: text/x-diff
Size: 59722 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20150430/aff87e22/attachment.diff>


More information about the freebsd-arch mailing list