vop_getpages for zfs

Fri Nov 5 22:17:04 UTC 2010

On 11/04/2010 12:56, Andriy Gapon wrote:
> on 10/10/2010 16:47 Kostik Belousov said the following:
>> I think that readahead requests should be satisfied, if possible. Esp.
>> in the case the readahead does not cause additional i/o.
> 
> So, per Kostik's suggestion, here is patch for your review and testing:
> http://people.freebsd.org/~avg/zfs_getpages.2.diff
> 
> The patch tries to fill as many of the passed in pages as possible as long as that
> can be done from a block (or even two blocks) that has to be read to fill the
> requested page.
> This means that the optimization kicks in only if a vnode has a block size greater
> than the page size.
> I mentioned "two blocks" above because I tried to optimize the case when, for
> example, block size is 6KB and page size is 4KB and the requested page covers
> parts of two adjacent blocks (e.g. range from 4KB to 8KB).  Not sure if this was
> worth the trouble.
> 
> In other words, the code tries to avoid reading more blocks than is strictly
> required, because additional blocks may be far apart on a disk and thus read
> latency could get increased.
> 
> You will also notice that I added vop_bmap method to ZFS vops.
> This is needed to satisfy logic in vnode_pager_haspage(); otherwise, for example,
> vm_fault() would never request any additional pages besides the required one.
> zfs_bmap() implementation is really simple - it returns zeros for a_runp and
> a_runb (and some reasonable-ish but unused values in other fields), and
> vnode_pager_haspage() figures out number of eligible pages from the block size
> (f_iosize actually[*]) and the page size.
> 
> Now, that ZFS has vop_getpages method, it is safe to provide this dummy-ish
> implementation of vop_bmap, because this operation would not be called by any
> other place but vnode_pager_haspage().  Code from vfs_bio.c and vfs_cluster.c is
> never called for ZFS, because it doesn't use the buffer cache layer.
> 
> [*] There is a few interesting questions with respect to f_iosize usage in
> vnode_pager_haspage() and ZFS.
> For one, ZFS has a variable block size and f_iosize for ZFS has a currently
> configured maximum block ('record' in ZFS terms) size.  This should not result in
> non-optimal behavior for files with smaller block sizes, because those files would
> have smaller sizes too (within the block size).
> Unless, of course, recordsize property is modified after some files are created on
> a ZFS filesystem.  In that case there could be large files with some smaller
> (previously maximum) block size; or files with a block size larger than the
> current recordsize == f_iosize.  So, such situations may lead to sub-optimal
> performance.
> Another issue related to the above is that f_iosize for ZFS can change on the fly
> when an administrators runs zfs set recordsize=NNN, and that may happen
> concurrently with vnode_pager_haspage() using f_iosize for its calculations.
> Not sure how to handle this correctly.
> 
> I will appreciate your comment comments, test reports, reviews and suggestions.
> Thanks a lot!

Hey Andriy

Thought I would give you a heads up on this after seeing the post about
zfs_getpages.diff.

I patched up to this before after seeing it posted to <you_know_where>@
and got reliable dumpage from it. Basically a vm_page_unwire fault or
something like that. I believe the following is the backtrace from that.

(kgdb) #0  doadump () at pcpu.h:231
#1  0x80675251 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:419
#2  0x806754e5 in panic (fmt=Variable "fmt" is not available.
) at /usr/src/sys/kern/kern_shutdown.c:592
#3  0x808e27ce in vm_page_unwire (m=0x816b46e0, activate=1)
    at /usr/src/sys/vm/vm_page.c:1564
#4  0x808d123a in vm_fault_unwire (map=0x81690088, start=2568372224,
    end=2568433664, fictitious=0) at /usr/src/sys/vm/vm_fault.c:1123
#5  0x808d8a33 in vm_map_delete (map=0x81690088, start=2568372224,
    end=2568433664) at /usr/src/sys/vm/vm_map.c:2619
#6  0x808d8d0d in vm_map_remove (map=0x81690088, start=2568372224,
end=Variable "end" is not available.
)
    at /usr/src/sys/vm/vm_map.c:2801
#7  0x808d6360 in kmem_free (map=0x81690088, addr=2568372224, size=61440)
    at /usr/src/sys/vm/vm_kern.c:211
#8  0x808cb601 in page_free (mem=0x99164000, size=61440, flags=34 '"')
    at /usr/src/sys/vm/uma_core.c:1069
#9  0x808ccf92 in uma_large_free (slab=0x85be7e6c)
    at /usr/src/sys/vm/uma_core.c:3021
#10 0x8065f7a5 in free (addr=0x99164000, mtp=0x80d8e114)
    at /usr/src/sys/kern/kern_malloc.c:506
#11 0x80cd49db in zil_lwb_write_done () from /boot/kernel/zfs.ko
#12 0x80cd99b1 in zio_done () from /boot/kernel/zfs.ko
#13 0x80cd7d3a in zio_execute () from /boot/kernel/zfs.ko
#14 0x80cd7f2e in zio_notify_parent () from /boot/kernel/zfs.ko
#15 0x80cd9a31 in zio_done () from /boot/kernel/zfs.ko
#16 0x80cd7d3a in zio_execute () from /boot/kernel/zfs.ko
#17 0x80c6656b in taskq_run_safe () from /boot/kernel/zfs.ko
#18 0x806af812 in taskqueue_run (queue=0x85a0ac40)
    at /usr/src/sys/kern/subr_taskqueue.c:239
#19 0x806afa07 in taskqueue_thread_loop (arg=0x85a3f830)
    at /usr/src/sys/kern/subr_taskqueue.c:360
#20 0x80647377 in fork_exit (callout=0x806af94a <taskqueue_thread_loop>,
    arg=0x85a3f830, frame=0xb4439d38) at /usr/src/sys/kern/kern_fork.c:845
#21 0x809126a4 in fork_trampoline () at
/usr/src/sys/i386/i386/exception.s:273
(kgdb)

And that coincided with the dates that I added the patch once seeing it
on the list.
changeset:   351:f1ca4eb51520
user:        J. Hellenthal <jhell at DataIX.net>
date:        Sun Oct 10 22:57:24 2010 -0400
summary:     Remove the zfs_getpages patch from Andriy Gapon

changeset:   350:bb885c047f0a
user:        J. Hellenthal <jhell at DataIX.net>
date:        Sun Oct 10 22:27:31 2010 -0400
summary:     zfs_getpages improvement from Andriy Gapon

If you would like I can patch back up to this patch to provide more
information if its needed, but at the moment I do not have it available
nor do I have the core that was generated.

-- 

 jhell,v