git: b596b4496568 - main - zfs: zfs_getpages: Don't zero freshly allocated pages

From: Jean-Sébastien Pédron <dumbbell_at_FreeBSD.org>
Date: Mon, 20 Oct 2025 19:17:59 UTC
The branch main has been updated by dumbbell:

URL: https://cgit.FreeBSD.org/src/commit/?id=b596b4496568c21e1e8f9c21be913cfc5b71bd9b

commit b596b4496568c21e1e8f9c21be913cfc5b71bd9b
Author:     Jean-Sébastien Pédron <dumbbell@FreeBSD.org>
AuthorDate: 2025-10-16 08:16:18 +0000
Commit:     Jean-Sébastien Pédron <dumbbell@FreeBSD.org>
CommitDate: 2025-10-20 19:16:43 +0000

    zfs: zfs_getpages: Don't zero freshly allocated pages
    
    Initially, `zfs_getpages()` is provided with an array of busy pages by
    the vnode pager. It then tries to acquire the range lock, but if there
    is a concurrent `zfs_write()` running and fails to acquire that range
    lock, it "unbusies" the pages to avoid a deadlock with `zfs_write()`.
    After that, it grabs the pages again and retries to acquire the range
    lock, and so on.
    
    Once it got the range lock, it filters out valid pages, then copy DMU
    data to the remaining invalid pages.
    
    The problem is that freshly allocated zero'd pages it grabbed itself are
    marked as valid. Therefore they are skipped by the second part of the
    function and DMU data is never copied to these pages. This causes mapped
    pages to contain zeros instead of the expected file content.
    
    This was discovered while working on RabbitMQ on FreeBSD. I could
    reproduce the problem easily with the following commands:
    
        git clone https://github.com/rabbitmq/rabbitmq-server.git
        cd rabbitmq-server/deps/rabbit
    
        gmake distclean-ct RABBITMQ_METADATA_STORE=mnesia \
          ct-amqp_client t=cluster_size_3:leader_transfer_stream_send
    
    The testsuite fails because there is a sendfile(2) that can happen
    concurrently to a write(2) on the same file. This leads to sendfile(2)
    or read(2) (after the sendfile) sending/returning data with zeros, which
    causes a function to crash.
    
    The patch consists of not setting the `VM_ALLOC_ZERO` flag when
    `zfs_getpages()` grabs pages again. Then, the last page is zero'd if it
    is invalid, in case it would be partially filled with the end of the
    file content. Other pages are either valid (and will be skipped) or they
    will be entirely overwritten by the file content.
    
    This patch was submitted to OpenZFS as openzfs/zfs#17851 which was
    approved.
    
    Reviewed by:    avg, mav
    Obtained from:  OpenZFS
    OpenZFS commit: 8a3533a366e6df2ea770ad7d80b7b68a94a81023
    MFC after:      3 days
    Differential revision: https://reviews.freebsd.org/D53219
---
 sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c b/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c
index e5344497f1be..f34a2fd37a77 100644
--- a/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c
+++ b/sys/contrib/openzfs/module/os/freebsd/zfs/zfs_vnops_os.c
@@ -4223,8 +4223,20 @@ zfs_getpages(struct vnode *vp, vm_page_t *ma, int count, int *rbehind,
 
 			zfs_vmobject_wlock(object);
 			(void) vm_page_grab_pages(object, OFF_TO_IDX(start),
-			    VM_ALLOC_NORMAL | VM_ALLOC_WAITOK | VM_ALLOC_ZERO,
+			    VM_ALLOC_NORMAL | VM_ALLOC_WAITOK,
 			    ma, count);
+			if (!vm_page_all_valid(ma[count - 1])) {
+				/*
+				 * Later in this function, we copy DMU data to
+				 * invalid pages only. The last page may not be
+				 * entirely filled though, if the file does not
+				 * end on a page boundary. Therefore, we zero
+				 * that last page here to make sure it does not
+				 * contain garbage after the end of file.
+				 */
+				ASSERT(vm_page_none_valid(ma[count - 1]));
+				vm_page_zero_invalid(ma[count - 1], FALSE);
+			}
 			zfs_vmobject_wunlock(object);
 		}
 		if (blksz == zp->z_blksz)