[CFT] new sendfile(2)

Gleb Smirnoff glebius at FreeBSD.org
Mon Feb 17 11:16:45 UTC 2014


  At Netflix and Nginx we are experimenting with improving FreeBSD
wrt sending large amounts of static data via HTTP.

  One of the approaches we are experimenting with is new sendfile(2)
implementation, that doesn't block on the I/O done from the file

  The problem with classic sendfile(2) is that if the the request
length is large enough, and file data is not cached in VM, then
sendfile(2) syscall would not return until it fills socket buffer
with data. With modern internet socket buffers can be up to 1 Mb,
thus time taken by the syscall raises by order of magnitude. All
the time, the nginx worker is blocked in syscall and doesn't
process data from other clients. The best current practice to
mitigate that is known as "sendfile(2) + aio_read(2)". This is
special mode of nginx operation on FreeBSD. The sendfile(2) call
is issued with SF_NODISKIO flag, that forbids the syscall to
perform disk I/O, and send only data that is cached by VM. If
sendfile(2) reports that I/O needs to be done (but forbidden), then
nginx would do aio_read() of a chunk of the file. The data read
is cached by VM, as side affect. Then sendfile() is called again.

  Now for the new sendfile. The core idea is that sendfile()
schedules the I/O, but doesn't wait for it to complete. It
returns immediately to the process, and I/O completion is
processed in kernel context. Unlike aio(4), no additional
threads in kernel are created. The new sendfile is a drop-in
replacement for the old one. Applications (like nginx) doesn't
need recompile, neither configuration change. The SF_NODISKIO is

  At Netflix, we already see improvements with new sendfile(2).
We can send more data utilizing same amount of CPU, and we can
push closer to 0% idle, without experiencing short lags.

However, we have somewhat modified VM subsystem, that behaves
optimal for our task, but suboptimal for average FreeBSD system.
I'd like someone from community to try the new sendfile(2) at
other setup and see how does it serve for you.

  To be the early tester you need to checkout projects/sendfile
branch and build kernel from it. The world from head/ would
run fine with it.

  svn co http://svn.freebsd.org/base/projects/sendfile
  cd sendfile
  ... build kernel ...

- Some subsystems that use socket buffers are not compilable, namely SCTP.
- No testing were done on serving files on NFS.
- No testing were done on serving files on ZFS.
- There is mbuf leak. The leak is very slow. It takes 3 days serving up to
  20 Gbit/s to deplete the cluster zone. I'm working on finding the leak.

Totus tuus, Glebius.

More information about the freebsd-current mailing list