[CFT/review] new sendfile(2)

Gleb Smirnoff glebius at FreeBSD.org
Thu May 29 10:21:01 UTC 2014


  Hello!

  At Netflix and Nginx we are experimenting with improving FreeBSD
wrt sending large amounts of static data via HTTP.

  One of the approaches we are experimenting with is new sendfile(2)
implementation, that doesn't block on the I/O done from the file
descriptor.

  The problem with classic sendfile(2) is that if the the request
length is large enough, and file data is not cached in VM, then
sendfile(2) syscall would not return until it fills socket buffer
with data. With modern internet socket buffers can be up to 1 Mb,
thus time taken by the syscall raises by order of magnitude. All
the time, the nginx worker is blocked in syscall and doesn't
process data from other clients. The best current practice to
mitigate that is known as "sendfile(2) + aio_read(2)". This is
special mode of nginx operation on FreeBSD. The sendfile(2) call
is issued with SF_NODISKIO flag, that forbids the syscall to
perform disk I/O, and send only data that is cached by VM. If
sendfile(2) reports that I/O needs to be done (but forbidden), then
nginx would do aio_read() of a chunk of the file. The data read
is cached by VM, as side affect. Then sendfile() is called again.

  Now for the new sendfile. The core idea is that sendfile()
schedules the I/O, but doesn't wait for it to complete. It
returns immediately to the process, and I/O completion is
processed in kernel context. Unlike aio(4), no additional
threads in kernel are created. The new sendfile is a drop-in
replacement for the old one. Applications (like nginx) doesn't
need recompile, neither configuration change. The SF_NODISKIO is
ignored.

  The patch for review is available at:

https://phabric.freebsd.org/D102

And for those who prefer email attachments, it is also attached.
The patch has 3 logically separate changes in itself:

1) Split of socket buffer sb_cc field into sb_acc and sb_ccc. Where
sb_acc stands for "available character count" and sb_ccc is "claimed
character count". This allows us to write a data to a socket, that is
not ready yet. The data sits in the socket, consumes its space, and
keeps itself in the right order with earlier or later writes to socket.
But it can be send only after it is marked as ready. This change is
split across many files.

2) A new vnode operation: VOP_GETPAGES_ASYNC(). This one lives in sys/vm.

3) Actual implementation of new sendfile(2). This one lives in
kern/uipc_syscalls.c



  At Netflix, we already see improvements with new sendfile(2).
We can send more data utilizing same amount of CPU, and we can
push closer to 0% idle, without experiencing short lags.

However, we have somewhat modified VM subsystem, that behaves
optimal for our task, but suboptimal for average FreeBSD system.
I'd like someone from community to try the new sendfile(2) at
other setup and see how does it serve for you.

  To be the early tester you need to checkout projects/sendfile
branch and build kernel from it. The world from head/ would
run fine with it.

  svn co http://svn.freebsd.org/base/projects/sendfile
  cd sendfile
  ... build kernel ...

Limitations:
- No testing were done on serving files on NFS.
- No testing were done on serving files on ZFS.

-- 
Totus tuus, Glebius.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: project-sendfile.diff
Type: text/x-diff
Size: 118246 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20140529/3ab92e5b/attachment-0001.diff>


More information about the freebsd-arch mailing list