svn commit: r293439 - in head: lib/libc/sys sys/dev/ti sys/kern sys/sys usr.bin/netstat

Fri Jan 8 20:34:58 UTC 2016

Author: glebius
Date: Fri Jan  8 20:34:57 2016
New Revision: 293439
URL: https://svnweb.freebsd.org/changeset/base/293439

Log:
  New sendfile(2) syscall. A joint effort of NGINX and Netflix from 2013 and
  up to now.
  
  The new sendfile is the code that Netflix uses to send their multiple tens
  of gigabits of data per second. The new implementation features asynchronous
  I/O, when I/O operations are launched, but not awaited to be complete. An
  explanation of why such behavior is beneficial compared to old one is
  going to be too long for a commit message, so we will skip it here.
  
  Additional features of new syscall are extra flags, which provide an
  application more control over data sent. The SF_NOCACHE flag tells
  kernel that data shouldn't be cached after it was sent. The SF_READAHEAD()
  macro allows to specify readahead size in pages.
  
  The new syscalls is a drop in replacement. No modifications are required
  to applications. One can take nginx binary for stable/10 and run it
  successfully on head. Although SF_NODISKIO lost its original sense, as now
  sendfile doesn't block, and now means something completely different (tm),
  using the new sendfile the old way is absolutely safe.
  
  Celebrates:	Netflix global launch!
  Sponsored by:	Nginx, Inc.
  Sponsored by:	Netflix
  Relnotes:	yes

Modified:
  head/lib/libc/sys/sendfile.2
  head/sys/dev/ti/if_ti.c
  head/sys/kern/uipc_mbuf.c
  head/sys/kern/uipc_syscalls.c
  head/sys/sys/mbuf.h
  head/sys/sys/sf_buf.h
  head/sys/sys/socket.h
  head/usr.bin/netstat/mbuf.c

Modified: head/lib/libc/sys/sendfile.2
==============================================================================

--- head/lib/libc/sys/sendfile.2	Fri Jan  8 19:47:49 2016	(r293438)
+++ head/lib/libc/sys/sendfile.2	Fri Jan  8 20:34:57 2016	(r293439)
@@ -25,7 +25,7 @@
 .\"
 .\" $FreeBSD$
 .\"
-.Dd January 7, 2010
+.Dd January 7, 2016
 .Dt SENDFILE 2
 .Os
 .Sh NAME
@@ -46,7 +46,7 @@
 The
 .Fn sendfile
 system call
-sends a regular file specified by descriptor
+sends a regular file or shared memory object specified by descriptor
 .Fa fd
 out a stream socket specified by descriptor
 .Fa s .
@@ -101,32 +101,55 @@ the system will write the total number o
 variable pointed to by
 .Fa sbytes .
 .Pp
-The
+The least significant 16 bits of
 .Fa flags
 argument is a bitmap of these values:
-.Bl -item -offset indent
-.It
-.Dv SF_NODISKIO .
-This flag causes any
-.Fn sendfile
-call which would block on disk I/O to instead
-return
-.Er EBUSY .
-Busy servers may benefit by transferring requests that would
-block to a separate I/O worker thread.
-.It
-.Dv SF_MNOWAIT .
-Do not wait for some kernel resource to become available,
-in particular,
-.Vt mbuf
-and
-.Vt sf_buf .
-The flag does not make the
-.Fn sendfile
-syscall truly non-blocking, since other resources are still allocated
-in a blocking fashion.
-.It
-.Dv SF_SYNC .
+.Bl -tag -offset indent
+.It Dv SF_NODISKIO
+This flag causes
+.Nm
+to return
+.Er EBUSY
+instead of blocking when a busy page is encountered.
+This rare situation can happen if some other process is now working
+with the same region of the file.
+It is advised to retry the operation after a short period.
+.Pp
+Note that in older
+.Fx
+versions the
+.Dv SF_NODISKIO
+had slightly different notion.
+The flag prevented
+.Nm
+to run I/O operations in case if an invalid (not cached) page is encountered,
+thus avoiding blocking on I/O.
+Starting with
+.Fx 11
+.Nm
+sending files off the
+.Xr ffs 7
+filesystem doesn't block on I/O
+(see 
+.Sx IMPLEMENTATION NOTES
+), so the condition no longer applies.
+However, it is safe if an application utilizes
+.Dv SF_NODISKIO
+and on
+.Er EBUSY
+performs the same action as it did in
+older
+.Fx
+versions, e.g.
+.Xr aio_read 2,
+.Xr read 2
+or
+.Nm
+in a different context.
+.It Dv SF_NOCACHE
+The data sent to socket will not be cached by the virtual memory system,
+and will be freed directly to the pool of free pages.
+.It Dv SF_SYNC
 .Nm
 sleeps until the network stack no longer references the VM pages
 of the file, making subsequent modifications to it safe.
@@ -134,6 +157,22 @@ Please note that this is not a guarantee
 been sent.
 .El
 .Pp
+The most significant 16 bits of
+.Fa flags
+specify amount of pages that
+.Nm
+may read ahead when reading the file.
+A macro
+.Fn SF_FLAGS
+is provided to combine readahead amount and flags.
+Example shows specifing readahead of 16 pages and
+.Dv SF_NOCACHE
+flag:
+.Pp
+.Bd -literal -offset indent -compact
+	SF_FLAGS(16, SF_NOCACHE)
+.Ed
+.Pp
 When using a socket marked for non-blocking I/O,
 .Fn sendfile
 may send fewer bytes than requested.
@@ -149,6 +188,18 @@ The
 .Fx
 implementation of
 .Fn sendfile
+doesn't block on disk I/O when it sends a file off the
+.Xr ffs 7
+filesystem.
+The syscall returns success before the actual I/O completes, and data
+is put into the socket later unattended.
+However, the order of data in the socket is preserved, so it is safe
+to do further writes to the socket.
+.Pp
+The
+.Fx
+implementation of
+.Fn sendfile
 is "zero-copy", meaning that it has been optimized so that copying of the file data is avoided.
 .Sh TUNING
 On some architectures, this system call internally uses a special
@@ -232,12 +283,10 @@ The
 argument
 is not a valid socket descriptor.
 .It Bq Er EBUSY
-Completing the entire transfer would have required disk I/O, so
-it was aborted.
-Partial data may have been sent.
-(This error can only occur when
+A busy page was encountered and
 .Dv SF_NODISKIO
-is specified.)
+had been specified.
+Partial data may have been sent.
 .It Bq Er EFAULT
 An invalid address was specified for an argument.
 .It Bq Er EINTR
@@ -310,9 +359,19 @@ first appeared in
 .Fx 3.0 .
 This manual page first appeared in
 .Fx 3.1 .
+In
+.Fx 10
+support for sending shared memory descriptors had been introduced.
+In
+.Fx 11
+a non-blocking implementation had been introduced.
 .Sh AUTHORS
-The
+The initial implementation of
 .Fn sendfile
 system call
 and this manual page were written by
 .An David G. Lawrence Aq Mt dg at dglawrence.com .
+The
+.Fx 11
+implementation was written by
+.An Gleb Smirnoff Aq Mt glebius at FreeBSD.org .

Modified: head/sys/dev/ti/if_ti.c
==============================================================================
--- head/sys/dev/ti/if_ti.c	Fri Jan  8 19:47:49 2016	(r293438)
+++ head/sys/dev/ti/if_ti.c	Fri Jan  8 20:34:57 2016	(r293439)
@@ -1634,7 +1634,7 @@ ti_newbuf_jumbo(struct ti_softc *sc, int
 			m[i]->m_data = (void *)sf_buf_kva(sf[i]);
 			m[i]->m_len = PAGE_SIZE;
 			MEXTADD(m[i], sf_buf_kva(sf[i]), PAGE_SIZE,
-			    sf_buf_mext, (void*)sf_buf_kva(sf[i]), sf[i],
+			    sf_mext_free, (void*)sf_buf_kva(sf[i]), sf[i],
 			    0, EXT_DISPOSABLE);
 			m[i]->m_next = m[i+1];
 		}
@@ -1699,7 +1699,7 @@ nobufs:
 		if (m[i])
 			m_freem(m[i]);
 		if (sf[i])
-			sf_buf_mext((void *)sf_buf_kva(sf[i]), sf[i]);
+			sf_mext_free((void *)sf_buf_kva(sf[i]), sf[i]);
 	}
 	return (ENOBUFS);
 }

Modified: head/sys/kern/uipc_mbuf.c
==============================================================================
--- head/sys/kern/uipc_mbuf.c	Fri Jan  8 19:47:49 2016	(r293438)
+++ head/sys/kern/uipc_mbuf.c	Fri Jan  8 20:34:57 2016	(r293439)
@@ -338,6 +338,9 @@ mb_free_ext(struct mbuf *m)
 	case EXT_SFBUF:
 		sf_ext_free(m->m_ext.ext_arg1, m->m_ext.ext_arg2);
 		break;
+	case EXT_SFBUF_NOCACHE:
+		sf_ext_free_nocache(m->m_ext.ext_arg1, m->m_ext.ext_arg2);
+		break;
 	default:
 		KASSERT(m->m_ext.ext_cnt != NULL,
 		    ("%s: no refcounting pointer on %p", __func__, m));
@@ -404,6 +407,7 @@ mb_dupcl(struct mbuf *n, const struct mb
 
 	switch (m->m_ext.ext_type) {
 	case EXT_SFBUF:
+	case EXT_SFBUF_NOCACHE:
 		sf_ext_ref(m->m_ext.ext_arg1, m->m_ext.ext_arg2);
 		break;
 	default:

Modified: head/sys/kern/uipc_syscalls.c
==============================================================================
--- head/sys/kern/uipc_syscalls.c	Fri Jan  8 19:47:49 2016	(r293438)
+++ head/sys/kern/uipc_syscalls.c	Fri Jan  8 20:34:57 2016	(r293439)
@@ -113,15 +113,6 @@ static int getpeername1(struct thread *t
 
 counter_u64_t sfstat[sizeof(struct sfstat) / sizeof(uint64_t)];
 
-/*
- * sendfile(2)-related variables and associated sysctls
- */
-static SYSCTL_NODE(_kern_ipc, OID_AUTO, sendfile, CTLFLAG_RW, 0,
-    "sendfile(2) tunables");
-static int sfreadahead = 1;
-SYSCTL_INT(_kern_ipc_sendfile, OID_AUTO, readahead, CTLFLAG_RW,
-    &sfreadahead, 0, "Number of sendfile(2) read-ahead MAXBSIZE blocks");
-
 static void
 sfstat_init(const void *unused)
 {
@@ -1858,13 +1849,12 @@ sf_ext_free(void *arg1, void *arg2)
 	sf_buf_free(sf);
 
 	vm_page_lock(pg);
-	vm_page_unwire(pg, PQ_INACTIVE);
 	/*
 	 * Check for the object going away on us. This can
 	 * happen since we don't hold a reference to it.
 	 * If so, we're responsible for freeing the page.
 	 */
-	if (pg->wire_count == 0 && pg->object == NULL)
+	if (vm_page_unwire(pg, PQ_INACTIVE) && pg->object == NULL)
 		vm_page_free(pg);
 	vm_page_unlock(pg);
 
@@ -1878,6 +1868,43 @@ sf_ext_free(void *arg1, void *arg2)
 }
 
 /*
+ * Same as above, but forces the page to be detached from the object
+ * and go into free pool.
+ */
+void
+sf_ext_free_nocache(void *arg1, void *arg2)
+{
+	struct sf_buf *sf = arg1;
+	struct sendfile_sync *sfs = arg2;
+	vm_page_t pg = sf_buf_page(sf);
+
+	sf_buf_free(sf);
+
+	vm_page_lock(pg);
+	if (vm_page_unwire(pg, PQ_NONE)) {
+		vm_object_t obj;
+
+		/* Try to free the page, but only if it is cheap to. */
+		if ((obj = pg->object) == NULL)
+			vm_page_free(pg);
+		else if (!vm_page_xbusied(pg) && VM_OBJECT_TRYWLOCK(obj)) {
+			vm_page_free(pg);
+			VM_OBJECT_WUNLOCK(obj);
+		} else
+			vm_page_deactivate(pg);
+	}
+	vm_page_unlock(pg);
+
+	if (sfs != NULL) {
+		mtx_lock(&sfs->mtx);
+		KASSERT(sfs->count > 0, ("Sendfile sync botchup count == 0"));
+		if (--sfs->count == 0)
+			cv_signal(&sfs->cv);
+		mtx_unlock(&sfs->mtx);
+	}
+}
+
+/*
  * sendfile(2)
  *
  * int sendfile(int fd, int s, off_t offset, size_t nbytes,
@@ -1974,103 +2001,252 @@ freebsd4_sendfile(struct thread *td, str
 }
 #endif /* COMPAT_FREEBSD4 */
 
+ /*
+  * How much data to put into page i of n.
+  * Only first and last pages are special.
+  */
+static inline off_t
+xfsize(int i, int n, off_t off, off_t len)
+{
+
+	if (i == 0)
+		return (omin(PAGE_SIZE - (off & PAGE_MASK), len));
+
+	if (i == n - 1 && ((off + len) & PAGE_MASK) > 0)
+		return ((off + len) & PAGE_MASK);
+
+	return (PAGE_SIZE);
+}
+
+/*
+ * Offset within object for i page.
+ */
+static inline vm_offset_t
+vmoff(int i, off_t off)
+{
+
+	if (i == 0)
+		return ((vm_offset_t)off);
+
+	return (trunc_page(off + i * PAGE_SIZE));
+}
+
+/*
+ * Pretend as if we don't have enough space, subtract xfsize() of
+ * all pages that failed.
+ */
+static inline void
+fixspace(int old, int new, off_t off, int *space)
+{
+
+	KASSERT(old > new, ("%s: old %d new %d", __func__, old, new));
+
+	/* Subtract last one. */
+	*space -= xfsize(old - 1, old, off, *space);
+	old--;
+
+	if (new == old)
+		/* There was only one page. */
+		return;
+
+	/* Subtract first one. */
+	if (new == 0) {
+		*space -= xfsize(0, old, off, *space);
+		new++;
+	}
+
+	/* Rest of pages are full sized. */
+	*space -= (old - new) * PAGE_SIZE;
+
+	KASSERT(*space >= 0, ("%s: space went backwards", __func__));
+}
+
+/*
+ * Structure describing a single sendfile(2) I/O, which may consist of
+ * several underlying pager I/Os.
+ *
+ * The syscall context allocates the structure and initializes 'nios'
+ * to 1.  As sendfile_swapin() runs through pages and starts asynchronous
+ * paging operations, it increments 'nios'.
+ *
+ * Every I/O completion calls sf_iodone(), which decrements the 'nios', and
+ * the syscall also calls sf_iodone() after allocating all mbufs, linking them
+ * and sending to socket.  Whoever reaches zero 'nios' is responsible to
+ * call pru_ready on the socket, to notify it of readyness of the data.
+ */
+struct sf_io {
+	volatile u_int	nios;
+	u_int		error;
+	int		npages;
+	struct file	*sock_fp;
+	struct mbuf	*m;
+	vm_page_t	pa[];
+};
+
+static void
+sf_iodone(void *arg, vm_page_t *pg, int count, int error)
+{
+	struct sf_io *sfio = arg;
+	struct socket *so;
+
+	for (int i = 0; i < count; i++)
+		vm_page_xunbusy(pg[i]);
+
+	if (error)
+		sfio->error = error;
+
+	if (!refcount_release(&sfio->nios))
+		return;
+
+	so = sfio->sock_fp->f_data;
+
+	if (sfio->error) {
+		struct mbuf *m;
+
+		/*
+		 * I/O operation failed.  The state of data in the socket
+		 * is now inconsistent, and all what we can do is to tear
+		 * it down. Protocol abort method would tear down protocol
+		 * state, free all ready mbufs and detach not ready ones.
+		 * We will free the mbufs corresponding to this I/O manually.
+		 *
+		 * The socket would be marked with EIO and made available
+		 * for read, so that application receives EIO on next
+		 * syscall and eventually closes the socket.
+		 */
+		so->so_proto->pr_usrreqs->pru_abort(so);
+		so->so_error = EIO;
+
+		m = sfio->m;
+		for (int i = 0; i < sfio->npages; i++)
+			m = m_free(m);
+	} else {
+		CURVNET_SET(so->so_vnet);
+		(void )(so->so_proto->pr_usrreqs->pru_ready)(so, sfio->m,
+		    sfio->npages);
+		CURVNET_RESTORE();
+	}
+
+	/* XXXGL: curthread */
+	fdrop(sfio->sock_fp, curthread);
+	free(sfio, M_TEMP);
+}
+
+/*
+ * Iterate through pages vector and request paging for non-valid pages.
+ */
 static int
-sendfile_readpage(vm_object_t obj, struct vnode *vp, int nd,
-    off_t off, int xfsize, int bsize, struct thread *td, vm_page_t *res)
+sendfile_swapin(vm_object_t obj, struct sf_io *sfio, off_t off, off_t len,
+    int npages, int rhpages, int flags)
 {
-	vm_page_t m;
-	vm_pindex_t pindex;
-	ssize_t resid;
-	int error, readahead, rv;
+	vm_page_t *pa = sfio->pa;
+	int nios;
 
-	pindex = OFF_TO_IDX(off);
-	VM_OBJECT_WLOCK(obj);
-	m = vm_page_grab(obj, pindex, (vp != NULL ? VM_ALLOC_NOBUSY |
-	    VM_ALLOC_IGN_SBUSY : 0) | VM_ALLOC_WIRED | VM_ALLOC_NORMAL);
+	nios = 0;
+	flags = (flags & SF_NODISKIO) ? VM_ALLOC_NOWAIT : 0;
 
 	/*
-	 * Check if page is valid for what we need, otherwise initiate I/O.
-	 *
-	 * The non-zero nd argument prevents disk I/O, instead we
-	 * return the caller what he specified in nd.  In particular,
-	 * if we already turned some pages into mbufs, nd == EAGAIN
-	 * and the main function send them the pages before we come
-	 * here again and block.
+	 * First grab all the pages and wire them.  Note that we grab
+	 * only required pages.  Readahead pages are dealt with later.
 	 */
-	if (m->valid != 0 && vm_page_is_valid(m, off & PAGE_MASK, xfsize)) {
-		if (vp == NULL)
-			vm_page_xunbusy(m);
-		VM_OBJECT_WUNLOCK(obj);
-		*res = m;
-		return (0);
-	} else if (nd != 0) {
-		if (vp == NULL)
-			vm_page_xunbusy(m);
-		error = nd;
-		goto free_page;
+	VM_OBJECT_WLOCK(obj);
+	for (int i = 0; i < npages; i++) {
+		pa[i] = vm_page_grab(obj, OFF_TO_IDX(vmoff(i, off)),
+		    VM_ALLOC_WIRED | VM_ALLOC_NORMAL | flags);
+		if (pa[i] == NULL) {
+			npages = i;
+			rhpages = 0;
+			break;
+		}
 	}
 
-	/*
-	 * Get the page from backing store.
-	 */
-	error = 0;
-	if (vp != NULL) {
-		VM_OBJECT_WUNLOCK(obj);
-		readahead = sfreadahead * MAXBSIZE;
+	for (int i = 0; i < npages;) {
+		int j, a, count, rv;
+
+		/* Skip valid pages. */
+		if (vm_page_is_valid(pa[i], vmoff(i, off) & PAGE_MASK,
+		    xfsize(i, npages, off, len))) {
+			vm_page_xunbusy(pa[i]);
+			SFSTAT_INC(sf_pages_valid);
+			i++;
+			continue;
+		}
 
 		/*
-		 * Use vn_rdwr() instead of the pager interface for
-		 * the vnode, to allow the read-ahead.
-		 *
-		 * XXXMAC: Because we don't have fp->f_cred here, we
-		 * pass in NOCRED.  This is probably wrong, but is
-		 * consistent with our original implementation.
+		 * Now 'i' points to first invalid page, iterate further
+		 * to make 'j' point at first valid after a bunch of
+		 * invalid ones.
 		 */
-		error = vn_rdwr(UIO_READ, vp, NULL, readahead, trunc_page(off),
-		    UIO_NOCOPY, IO_NODELOCKED | IO_VMIO | ((readahead /
-		    bsize) << IO_SEQSHIFT), td->td_ucred, NOCRED, &resid, td);
-		SFSTAT_INC(sf_iocnt);
-		VM_OBJECT_WLOCK(obj);
-	} else {
-		if (vm_pager_has_page(obj, pindex, NULL, NULL)) {
-			rv = vm_pager_get_pages(obj, &m, 1, NULL, NULL);
-			SFSTAT_INC(sf_iocnt);
-			if (rv != VM_PAGER_OK) {
-				vm_page_lock(m);
-				vm_page_free(m);
-				vm_page_unlock(m);
-				m = NULL;
-				error = EIO;
+		for (j = i + 1; j < npages; j++)
+			if (vm_page_is_valid(pa[j], vmoff(j, off) & PAGE_MASK,
+			    xfsize(j, npages, off, len))) {
+				SFSTAT_INC(sf_pages_valid);
+				break;
 			}
-		} else {
-			pmap_zero_page(m);
-			m->valid = VM_PAGE_BITS_ALL;
-			m->dirty = 0;
+
+		/*
+		 * Now we got region of invalid pages between 'i' and 'j'.
+		 * Check that they belong to pager.  They may not be there,
+		 * which is a regular situation for shmem pager.  For vnode
+		 * pager this happens only in case of sparse file.
+		 *
+		 * Important feature of vm_pager_has_page() is the hint
+		 * stored in 'a', about how many pages we can pagein after
+		 * this page in a single I/O.
+		 */
+		while (!vm_pager_has_page(obj, OFF_TO_IDX(vmoff(i, off)),
+		    NULL, &a) && i < j) {
+			pmap_zero_page(pa[i]);
+			pa[i]->valid = VM_PAGE_BITS_ALL;
+			pa[i]->dirty = 0;
+			vm_page_xunbusy(pa[i]);
+			i++;
 		}
-		if (m != NULL)
-			vm_page_xunbusy(m);
-	}
-	if (error == 0) {
-		*res = m;
-	} else if (m != NULL) {
-free_page:
-		vm_page_lock(m);
-		vm_page_unwire(m, PQ_INACTIVE);
+		if (i == j)
+			continue;
 
 		/*
-		 * See if anyone else might know about this page.  If
-		 * not and it is not valid, then free it.
+		 * We want to pagein as many pages as possible, limited only
+		 * by the 'a' hint and actual request.
+		 *
+		 * We should not pagein into already valid page, thus if
+		 * 'j' didn't reach last page, trim by that page.
+		 *
+		 * When the pagein fulfils the request, also specify readahead.
 		 */
-		if (m->wire_count == 0 && m->valid == 0 && !vm_page_busied(m))
-			vm_page_free(m);
-		vm_page_unlock(m);
-	}
-	KASSERT(error != 0 || (m->wire_count > 0 &&
-	    vm_page_is_valid(m, off & PAGE_MASK, xfsize)),
-	    ("wrong page state m %p off %#jx xfsize %d", m, (uintmax_t)off,
-	    xfsize));
+		if (j < npages)
+			a = min(a, j - i - 1);
+		count = min(a + 1, npages - i);
+
+		refcount_acquire(&sfio->nios);
+		rv = vm_pager_get_pages_async(obj, pa + i, count, NULL,
+		    i + count == npages ? &rhpages : NULL,
+		    &sf_iodone, sfio);
+		KASSERT(rv == VM_PAGER_OK, ("%s: pager fail obj %p page %p",
+		    __func__, obj, pa[i]));
+
+		SFSTAT_INC(sf_iocnt);
+		SFSTAT_ADD(sf_pages_read, count);
+		if (i + count == npages)
+			SFSTAT_ADD(sf_rhpages_read, rhpages);
+
+#ifdef INVARIANTS
+		for (j = i; j < i + count && j < npages; j++)
+			KASSERT(pa[j] == vm_page_lookup(obj,
+			    OFF_TO_IDX(vmoff(j, off))),
+			    ("pa[j] %p lookup %p\n", pa[j],
+			    vm_page_lookup(obj, OFF_TO_IDX(vmoff(j, off)))));
+#endif
+		i += count;
+		nios++;
+	}
+
 	VM_OBJECT_WUNLOCK(obj);
-	return (error);
+
+	if (nios == 0 && npages != 0)
+		SFSTAT_INC(sf_noiocnt);
+
+	return (nios);
 }
 
 static int
@@ -2178,42 +2354,37 @@ vn_sendfile(struct file *fp, int sockfd,
 	struct vnode *vp;
 	struct vm_object *obj;
 	struct socket *so;
-	struct mbuf *m;
+	struct mbuf *m, *mh, *mhtail;
 	struct sf_buf *sf;
-	struct vm_page *pg;
 	struct shmfd *shmfd;
 	struct sendfile_sync *sfs;
 	struct vattr va;
-	off_t off, xfsize, fsbytes, sbytes, rem, obj_size;
-	int error, bsize, nd, hdrlen, mnw;
+	off_t off, sbytes, rem, obj_size;
+	int error, softerr, bsize, hdrlen;
 
-	pg = NULL;
 	obj = NULL;
 	so = NULL;
-	m = NULL;
+	m = mh = NULL;
 	sfs = NULL;
-	fsbytes = sbytes = 0;
-	hdrlen = mnw = 0;
-	rem = nbytes;
-	obj_size = 0;
+	sbytes = 0;
+	softerr = 0;
 
 	error = sendfile_getobj(td, fp, &obj, &vp, &shmfd, &obj_size, &bsize);
 	if (error != 0)
 		return (error);
-	if (rem == 0)
-		rem = obj_size;
 
 	error = kern_sendfile_getsock(td, sockfd, &sock_fp, &so);
 	if (error != 0)
 		goto out;
 
-	/*
-	 * Do not wait on memory allocations but return ENOMEM for
-	 * caller to retry later.
-	 * XXX: Experimental.
-	 */
-	if (flags & SF_MNOWAIT)
-		mnw = 1;
+#ifdef MAC
+	error = mac_socket_check_send(td->td_ucred, so);
+	if (error != 0)
+		goto out;
+#endif
+
+	SFSTAT_INC(sf_syscalls);
+	SFSTAT_ADD(sf_rhpages_requested, SF_READAHEAD(flags));
 
 	if (flags & SF_SYNC) {
 		sfs = malloc(sizeof *sfs, M_TEMP, M_WAITOK | M_ZERO);
@@ -2221,37 +2392,27 @@ vn_sendfile(struct file *fp, int sockfd,
 		cv_init(&sfs->cv, "sendfile");
 	}
 
-#ifdef MAC
-	error = mac_socket_check_send(td->td_ucred, so);
-	if (error != 0)
-		goto out;
-#endif
-
 	/* If headers are specified copy them into mbufs. */
-	if (hdr_uio != NULL) {
+	if (hdr_uio != NULL && hdr_uio->uio_resid > 0) {
 		hdr_uio->uio_td = td;
 		hdr_uio->uio_rw = UIO_WRITE;
-		if (hdr_uio->uio_resid > 0) {
-			/*
-			 * In FBSD < 5.0 the nbytes to send also included
-			 * the header.  If compat is specified subtract the
-			 * header size from nbytes.
-			 */
-			if (kflags & SFK_COMPAT) {
-				if (nbytes > hdr_uio->uio_resid)
-					nbytes -= hdr_uio->uio_resid;
-				else
-					nbytes = 0;
-			}
-			m = m_uiotombuf(hdr_uio, (mnw ? M_NOWAIT : M_WAITOK),
-			    0, 0, 0);
-			if (m == NULL) {
-				error = mnw ? EAGAIN : ENOBUFS;
-				goto out;
-			}
-			hdrlen = m_length(m, NULL);
+		/*
+		 * In FBSD < 5.0 the nbytes to send also included
+		 * the header.  If compat is specified subtract the
+		 * header size from nbytes.
+		 */
+		if (kflags & SFK_COMPAT) {
+			if (nbytes > hdr_uio->uio_resid)
+				nbytes -= hdr_uio->uio_resid;
+			else
+				nbytes = 0;
 		}
-	}
+		mh = m_uiotombuf(hdr_uio, M_WAITOK, 0, 0, 0);
+		hdrlen = m_length(mh, &mhtail);
+	} else
+		hdrlen = 0;
+
+	rem = nbytes ? omin(nbytes, obj_size - offset) : obj_size - offset;
 
 	/*
 	 * Protect against multiple writers to the socket.
@@ -2272,21 +2433,13 @@ vn_sendfile(struct file *fp, int sockfd,
 	 * The outer loop checks the state and available space of the socket
 	 * and takes care of the overall progress.
 	 */
-	for (off = offset; ; ) {
+	for (off = offset; rem > 0; ) {
+		struct sf_io *sfio;
+		vm_page_t *pa;
 		struct mbuf *mtail;
-		int loopbytes;
-		int space;
-		int done;
-
-		if ((nbytes != 0 && nbytes == fsbytes) ||
-		    (nbytes == 0 && obj_size == fsbytes))
-			break;
+		int nios, space, npages, rhpages;
 
 		mtail = NULL;
-		loopbytes = 0;
-		space = 0;
-		done = 0;
-
 		/*
 		 * Check the socket state for ongoing connection,
 		 * no errors and space in socket buffer.
@@ -2362,49 +2515,58 @@ retry_space:
 				VOP_UNLOCK(vp, 0);
 				goto done;
 			}
-			obj_size = va.va_size;
+			if (va.va_size != obj_size) {
+				if (nbytes == 0)
+					rem += va.va_size - obj_size;
+				else if (offset + nbytes > va.va_size)
+					rem -= (offset + nbytes - va.va_size);
+				obj_size = va.va_size;
+			}
 		}
 
+		if (space > rem)
+			space = rem;
+
+		npages = howmany(space + (off & PAGE_MASK), PAGE_SIZE);
+
+		/*
+		 * Calculate maximum allowed number of pages for readahead
+		 * at this iteration.  First, we allow readahead up to "rem".
+		 * If application wants more, let it be, but there is no
+		 * reason to go above MAXPHYS.  Also check against "obj_size",
+		 * since vm_pager_has_page() can hint beyond EOF.
+		 */
+		rhpages = howmany(rem + (off & PAGE_MASK), PAGE_SIZE) - npages;
+		rhpages += SF_READAHEAD(flags);
+		rhpages = min(howmany(MAXPHYS, PAGE_SIZE), rhpages);
+		rhpages = min(howmany(obj_size - trunc_page(off), PAGE_SIZE) -
+		    npages, rhpages);
+
+		sfio = malloc(sizeof(struct sf_io) +
+		    npages * sizeof(vm_page_t), M_TEMP, M_WAITOK);
+		refcount_init(&sfio->nios, 1);
+		sfio->error = 0;
+
+		nios = sendfile_swapin(obj, sfio, off, space, npages, rhpages,
+		    flags);
+
 		/*
 		 * Loop and construct maximum sized mbuf chain to be bulk
 		 * dumped into socket buffer.
 		 */
-		while (space > loopbytes) {
-			vm_offset_t pgoff;
+		pa = sfio->pa;
+		for (int i = 0; i < npages; i++) {
 			struct mbuf *m0;
 
 			/*
-			 * Calculate the amount to transfer.
-			 * Not to exceed a page, the EOF,
-			 * or the passed in nbytes.
+			 * If a page wasn't grabbed successfully, then
+			 * trim the array. Can happen only with SF_NODISKIO.
 			 */
-			pgoff = (vm_offset_t)(off & PAGE_MASK);
-			rem = obj_size - offset;
-			if (nbytes != 0)
-				rem = omin(rem, nbytes);
-			rem -= fsbytes + loopbytes;
-			xfsize = omin(PAGE_SIZE - pgoff, rem);
-			xfsize = omin(space - loopbytes, xfsize);
-			if (xfsize <= 0) {
-				done = 1;		/* all data sent */
-				break;
-			}
-
-			/*
-			 * Attempt to look up the page.  Allocate
-			 * if not found or wait and loop if busy.
-			 */
-			if (m != NULL)
-				nd = EAGAIN; /* send what we already got */
-			else if ((flags & SF_NODISKIO) != 0)
-				nd = EBUSY;
-			else
-				nd = 0;
-			error = sendfile_readpage(obj, vp, nd, off,
-			    xfsize, bsize, td, &pg);
-			if (error != 0) {
-				if (error == EAGAIN)
-					error = 0;	/* not a real error */
+			if (pa[i] == NULL) {
+				SFSTAT_INC(sf_busy);
+				fixspace(npages, i, off, &space);
+				npages = i;
+				softerr = EBUSY;
 				break;
 			}
 
@@ -2417,56 +2579,59 @@ retry_space:
 			 * threads might exhaust the buffers and then
 			 * deadlock.
 			 */
-			sf = sf_buf_alloc(pg, (mnw || m != NULL) ? SFB_NOWAIT :
-			    SFB_CATCH);
+			sf = sf_buf_alloc(pa[i],
+			    m != NULL ? SFB_NOWAIT : SFB_CATCH);
 			if (sf == NULL) {
 				SFSTAT_INC(sf_allocfail);
-				vm_page_lock(pg);
-				vm_page_unwire(pg, PQ_INACTIVE);
-				KASSERT(pg->object != NULL,
-				    ("%s: object disappeared", __func__));
-				vm_page_unlock(pg);
+				for (int j = i; j < npages; j++) {
+					vm_page_lock(pa[j]);
+					vm_page_unwire(pa[j], PQ_INACTIVE);
+					vm_page_unlock(pa[j]);
+				}
 				if (m == NULL)
-					error = (mnw ? EAGAIN : EINTR);
+					softerr = ENOBUFS;
+				fixspace(npages, i, off, &space);
+				npages = i;
 				break;
 			}
 
-			/*
-			 * Get an mbuf and set it up as having
-			 * external storage.
-			 */
-			m0 = m_get((mnw ? M_NOWAIT : M_WAITOK), MT_DATA);
-			if (m0 == NULL) {
-				error = (mnw ? EAGAIN : ENOBUFS);
-				sf_ext_free(sf, NULL);
-				break;
-			}
-			/*
-			 * Attach EXT_SFBUF external storage.
-			 */
-			m0->m_ext.ext_buf = (caddr_t )sf_buf_kva(sf);
+			m0 = m_get(M_WAITOK, MT_DATA);
+			m0->m_ext.ext_buf = (char *)sf_buf_kva(sf);
 			m0->m_ext.ext_size = PAGE_SIZE;
 			m0->m_ext.ext_arg1 = sf;
 			m0->m_ext.ext_arg2 = sfs;
-			m0->m_ext.ext_type = EXT_SFBUF;
+			/*
+			 * SF_NOCACHE sets the page as being freed upon send.
+			 * However, we ignore it for the last page in 'space',
+			 * if the page is truncated, and we got more data to
+			 * send (rem > space), or if we have readahead
+			 * configured (rhpages > 0).
+			 */
+			if ((flags & SF_NOCACHE) == 0 ||
+			    (i == npages - 1 &&
+			    ((off + space) & PAGE_MASK) &&
+			    (rem > space || rhpages > 0)))
+				m0->m_ext.ext_type = EXT_SFBUF;
+			else
+				m0->m_ext.ext_type = EXT_SFBUF_NOCACHE;
 			m0->m_ext.ext_flags = 0;
-			m0->m_flags |= (M_EXT|M_RDONLY);
-			m0->m_data = (char *)sf_buf_kva(sf) + pgoff;
-			m0->m_len = xfsize;
+			m0->m_flags |= (M_EXT | M_RDONLY);
+			if (nios)
+				m0->m_flags |= M_NOTREADY;
+			m0->m_data = (char *)sf_buf_kva(sf) +
+			    (vmoff(i, off) & PAGE_MASK);
+			m0->m_len = xfsize(i, npages, off, space);
+
+			if (i == 0)
+				sfio->m = m0;
 
 			/* Append to mbuf chain. */
 			if (mtail != NULL)
 				mtail->m_next = m0;
-			else if (m != NULL)
-				m_last(m)->m_next = m0;
 			else
 				m = m0;
 			mtail = m0;
 
-			/* Keep track of bits processed. */
-			loopbytes += xfsize;
-			off += xfsize;
-
 			if (sfs != NULL) {
 				mtx_lock(&sfs->mtx);
 				sfs->count++;
@@ -2477,49 +2642,60 @@ retry_space:
 		if (vp != NULL)
 			VOP_UNLOCK(vp, 0);
 
+		/* Keep track of bytes processed. */
+		off += space;
+		rem -= space;
+
+		/* Prepend header, if any. */
+		if (hdrlen) {
+			mhtail->m_next = m;
+			m = mh;
+			mh = NULL;
+		}
+
+		if (m == NULL) {
+			KASSERT(softerr, ("%s: m NULL, no error", __func__));
+			error = softerr;
+			free(sfio, M_TEMP);
+			goto done;
+		}
+
 		/* Add the buffer chain to the socket buffer. */
-		if (m != NULL) {
-			int mlen, err;
+		KASSERT(m_length(m, NULL) == space + hdrlen,
+		    ("%s: mlen %u space %d hdrlen %d",
+		    __func__, m_length(m, NULL), space, hdrlen));
 
-			mlen = m_length(m, NULL);
-			SOCKBUF_LOCK(&so->so_snd);
-			if (so->so_snd.sb_state & SBS_CANTSENDMORE) {
-				error = EPIPE;
-				SOCKBUF_UNLOCK(&so->so_snd);
-				goto done;
-			}
-			SOCKBUF_UNLOCK(&so->so_snd);
-			CURVNET_SET(so->so_vnet);
-			/* Avoid error aliasing. */
-			err = (*so->so_proto->pr_usrreqs->pru_send)
-				    (so, 0, m, NULL, NULL, td);
-			CURVNET_RESTORE();
-			if (err == 0) {
-				/*
-				 * We need two counters to get the
-				 * file offset and nbytes to send
-				 * right:
-				 * - sbytes contains the total amount
-				 *   of bytes sent, including headers.

*** DIFF OUTPUT TRUNCATED AT 1000 LINES ***