new vm_pager_get_pages() KPI, round 3

Mon Dec 14 11:13:48 UTC 2015

On Sat, Dec 05, 2015 at 08:29:40AM +0300, Gleb Smirnoff wrote:
>   Hi,
> 
> [first paragraph for arch subscribers, To: recepients may skip]
> 
> This patch is kinda a prerequisite for the non-blocking sendfile(2),
> that was jointly developed by NGINX and Netflix in 2014 and has been
> running in Netflix production for a year, serving 35% of the whole
> North America (US, Canada, Mexico) Internet traffic.
> Technically, the new sendfile(2) doesn't require the new
> vm_pager_get_pages() KPI. We currently run it on the old KPI. However,
> kib@ suggested that we are abusing the KPI, carefully using its
> edge cases. To address this critic, back in spring, I suggested a KPI,
> where vm_pager_get_pages() offers all-or-none approach to the array of
> pages. Again, kib@ wasn't satisfied, as for "the main user" of
> vm_pager_get_pages, the vm_fault(), all-or-none approach isn't optimal.
> The problem was slowly debated through the summer. And then in October
> jeff@ suggested yet another extension of the KPI, which I have
> implemented and it is described below.
> 
> [for those interested in new sendfile(2), skip to the last paragraph,
> for those willing to review new pager KPI, read on]
> 
> The new KPI offers this prototype for vm_pager_get_pages():
> 
>  int
>  vm_pager_get_pages(vm_object_t object, vm_page_t pages[], int count,
>      int *rbehind, in *rahead);
> 
> Where "count" stands for number of pages in the array. The rbehind
> and rahead if not NULL specify how many pages the caller is willing to
> allow the pager to pre-cache, if the pager can.
> 
> Pager doesn't promise to do any read behind or read ahead. If it does,
> then only the pager is responsive for grabbing, busying, unbusying and
> queueing these pages. It also writes the actual values of completed
> read ahead and read behind back to the pointers.
> 
> Pager promises to page in "count" pages or fail. Pager expects the
> pages to be busied, and returns them busied. For a multi page requests,
> the pager demands that the region is a valid region, that exists in
> the pager, which can be checked by preceding call to vm_pager_haspage().
> For single page requests, there is no such demand.
I fail to understand how the case of count > 1 and non-contiguous blocks
in the non-readahead case is handled by new vnode_pager_generic_getpages().

I do not understand how a hole somewhere in the requested range is handled.
Code has a comment that a hole must not appear in the range.

Both issues mean that vm_pager_has_page() still must be called before
pagein, for count > 1 use.  E.g. the exec_map_first_page() uses *after
value returned from has_pages() to calculate count, which is an advisory
and not the contract.  Same issue prevents converting GEM and TTM (and
probably md) to use the count > 1 KPI.

Same is true for swap pager, and this prevents the removal of the loop
in vm_thread_swaping().

Code assumes that the partially valid page may only appear in the last
position of the page run for the local pager, which again requires
pre-validation of the vm_pager_get_pages() on the caller side.

Overall this is not an KPI that was discussed. It seemingly does not
change semantic for count == 1 case, but is not what it should be for
count > 1. As discussed, new vm_pager_get_pages() was support to just
work for any count, doing the loop over the non-contig ranges or short
reads, and guaranteeing that all existing (or hole-filled) pages are
read until EOF is met. This KPI was supposed to:
- fix my compaints about short reads
- avoid excessive VOP_BMAP() call from has_pages before get_pages()
- allowed to remove the loops from all current get_pages() consumers,
  it vm_thread_swapin(), GEM/TTM, image activator
- supposedly served your needs for sendfile(2)
in one go.  What is present in your patch might only satisfy the last
item of the list.

> 
> The net result is a win for both vm_fault() and for new sendfile().
> 
> The vm_fault() no longer needs to do prepatory vm_pager_haspage(),
> which removes one I/O operation. The logic for read ahead/behind,
> which is strongly UFS/EXT-centric, moves into vnode_pager.c. So
> we no longer do useless operations when having a fault on ZFS.
> 
> The vm_fault() now knows precisely the read ahead that happened,
> when updates fs.entry->next_read index. This reduces number of
> hardfaults by a tiny fraction (measured building world tree).
> 
> The new sendfile() has a stronger KPI, that doesn't unbusy pages,
> that sendfile() needs to be kept busied.
> 
> Also, the new KPI removes some ugly edges. E.g., since the old
> KPI used to unbusy and free pages in the array in case of an
> error, the pages could not be wired. However, there are places in
> kernel where we want to page in into a wired page. These places
> simply violated the assumption, relying on lack of errors in the
> pager. Moreover, the swap pager has a special function to skip
> wired pages, while doing the freeing sweep, to avoid hitting
> assertion. That means passing wired pages to swapper is kinda
> OK, while to any other pager it is not. So, we end up with
> vm_pager_get_pages() being not pager agnostic, while it is
> designed to be so. Now this is fixed.