new vm_pager_get_pages() KPI, round 3
kostikbel at gmail.com
Mon Dec 14 11:13:48 UTC 2015
On Sat, Dec 05, 2015 at 08:29:40AM +0300, Gleb Smirnoff wrote:
> [first paragraph for arch subscribers, To: recepients may skip]
> This patch is kinda a prerequisite for the non-blocking sendfile(2),
> that was jointly developed by NGINX and Netflix in 2014 and has been
> running in Netflix production for a year, serving 35% of the whole
> North America (US, Canada, Mexico) Internet traffic.
> Technically, the new sendfile(2) doesn't require the new
> vm_pager_get_pages() KPI. We currently run it on the old KPI. However,
> kib@ suggested that we are abusing the KPI, carefully using its
> edge cases. To address this critic, back in spring, I suggested a KPI,
> where vm_pager_get_pages() offers all-or-none approach to the array of
> pages. Again, kib@ wasn't satisfied, as for "the main user" of
> vm_pager_get_pages, the vm_fault(), all-or-none approach isn't optimal.
> The problem was slowly debated through the summer. And then in October
> jeff@ suggested yet another extension of the KPI, which I have
> implemented and it is described below.
> [for those interested in new sendfile(2), skip to the last paragraph,
> for those willing to review new pager KPI, read on]
> The new KPI offers this prototype for vm_pager_get_pages():
> vm_pager_get_pages(vm_object_t object, vm_page_t pages, int count,
> int *rbehind, in *rahead);
> Where "count" stands for number of pages in the array. The rbehind
> and rahead if not NULL specify how many pages the caller is willing to
> allow the pager to pre-cache, if the pager can.
> Pager doesn't promise to do any read behind or read ahead. If it does,
> then only the pager is responsive for grabbing, busying, unbusying and
> queueing these pages. It also writes the actual values of completed
> read ahead and read behind back to the pointers.
> Pager promises to page in "count" pages or fail. Pager expects the
> pages to be busied, and returns them busied. For a multi page requests,
> the pager demands that the region is a valid region, that exists in
> the pager, which can be checked by preceding call to vm_pager_haspage().
> For single page requests, there is no such demand.
I fail to understand how the case of count > 1 and non-contiguous blocks
in the non-readahead case is handled by new vnode_pager_generic_getpages().
I do not understand how a hole somewhere in the requested range is handled.
Code has a comment that a hole must not appear in the range.
Both issues mean that vm_pager_has_page() still must be called before
pagein, for count > 1 use. E.g. the exec_map_first_page() uses *after
value returned from has_pages() to calculate count, which is an advisory
and not the contract. Same issue prevents converting GEM and TTM (and
probably md) to use the count > 1 KPI.
Same is true for swap pager, and this prevents the removal of the loop
Code assumes that the partially valid page may only appear in the last
position of the page run for the local pager, which again requires
pre-validation of the vm_pager_get_pages() on the caller side.
Overall this is not an KPI that was discussed. It seemingly does not
change semantic for count == 1 case, but is not what it should be for
count > 1. As discussed, new vm_pager_get_pages() was support to just
work for any count, doing the loop over the non-contig ranges or short
reads, and guaranteeing that all existing (or hole-filled) pages are
read until EOF is met. This KPI was supposed to:
- fix my compaints about short reads
- avoid excessive VOP_BMAP() call from has_pages before get_pages()
- allowed to remove the loops from all current get_pages() consumers,
it vm_thread_swapin(), GEM/TTM, image activator
- supposedly served your needs for sendfile(2)
in one go. What is present in your patch might only satisfy the last
item of the list.
> The net result is a win for both vm_fault() and for new sendfile().
> The vm_fault() no longer needs to do prepatory vm_pager_haspage(),
> which removes one I/O operation. The logic for read ahead/behind,
> which is strongly UFS/EXT-centric, moves into vnode_pager.c. So
> we no longer do useless operations when having a fault on ZFS.
> The vm_fault() now knows precisely the read ahead that happened,
> when updates fs.entry->next_read index. This reduces number of
> hardfaults by a tiny fraction (measured building world tree).
> The new sendfile() has a stronger KPI, that doesn't unbusy pages,
> that sendfile() needs to be kept busied.
> Also, the new KPI removes some ugly edges. E.g., since the old
> KPI used to unbusy and free pages in the array in case of an
> error, the pages could not be wired. However, there are places in
> kernel where we want to page in into a wired page. These places
> simply violated the assumption, relying on lack of errors in the
> pager. Moreover, the swap pager has a special function to skip
> wired pages, while doing the freeing sweep, to avoid hitting
> assertion. That means passing wired pages to swapper is kinda
> OK, while to any other pager it is not. So, we end up with
> vm_pager_get_pages() being not pager agnostic, while it is
> designed to be so. Now this is fixed.
More information about the freebsd-arch