[CFT/review] new sendfile(2)

Peter Holm peter at holm.cc
Fri May 30 08:58:21 UTC 2014


On Thu, May 29, 2014 at 02:20:54PM +0400, Gleb Smirnoff wrote:
>   Hello!
> 
>   At Netflix and Nginx we are experimenting with improving FreeBSD
> wrt sending large amounts of static data via HTTP.
> 
>   One of the approaches we are experimenting with is new sendfile(2)
> implementation, that doesn't block on the I/O done from the file
> descriptor.
> 
>   The problem with classic sendfile(2) is that if the the request
> length is large enough, and file data is not cached in VM, then
> sendfile(2) syscall would not return until it fills socket buffer
> with data. With modern internet socket buffers can be up to 1 Mb,
> thus time taken by the syscall raises by order of magnitude. All
> the time, the nginx worker is blocked in syscall and doesn't
> process data from other clients. The best current practice to
> mitigate that is known as "sendfile(2) + aio_read(2)". This is
> special mode of nginx operation on FreeBSD. The sendfile(2) call
> is issued with SF_NODISKIO flag, that forbids the syscall to
> perform disk I/O, and send only data that is cached by VM. If
> sendfile(2) reports that I/O needs to be done (but forbidden), then
> nginx would do aio_read() of a chunk of the file. The data read
> is cached by VM, as side affect. Then sendfile() is called again.
> 
>   Now for the new sendfile. The core idea is that sendfile()
> schedules the I/O, but doesn't wait for it to complete. It
> returns immediately to the process, and I/O completion is
> processed in kernel context. Unlike aio(4), no additional
> threads in kernel are created. The new sendfile is a drop-in
> replacement for the old one. Applications (like nginx) doesn't
> need recompile, neither configuration change. The SF_NODISKIO is
> ignored.
> 
>   The patch for review is available at:
> 
> https://phabric.freebsd.org/D102
> 
> And for those who prefer email attachments, it is also attached.
> The patch has 3 logically separate changes in itself:
> 
> 1) Split of socket buffer sb_cc field into sb_acc and sb_ccc. Where
> sb_acc stands for "available character count" and sb_ccc is "claimed
> character count". This allows us to write a data to a socket, that is
> not ready yet. The data sits in the socket, consumes its space, and
> keeps itself in the right order with earlier or later writes to socket.
> But it can be send only after it is marked as ready. This change is
> split across many files.
> 
> 2) A new vnode operation: VOP_GETPAGES_ASYNC(). This one lives in sys/vm.
> 
> 3) Actual implementation of new sendfile(2). This one lives in
> kern/uipc_syscalls.c
> 
> 
> 
>   At Netflix, we already see improvements with new sendfile(2).
> We can send more data utilizing same amount of CPU, and we can
> push closer to 0% idle, without experiencing short lags.
> 
> However, we have somewhat modified VM subsystem, that behaves
> optimal for our task, but suboptimal for average FreeBSD system.
> I'd like someone from community to try the new sendfile(2) at
> other setup and see how does it serve for you.
> 
>   To be the early tester you need to checkout projects/sendfile
> branch and build kernel from it. The world from head/ would
> run fine with it.
> 
>   svn co http://svn.freebsd.org/base/projects/sendfile
>   cd sendfile
>   ... build kernel ...
> 
> Limitations:
> - No testing were done on serving files on NFS.
> - No testing were done on serving files on ZFS.
> 

I got this:
panic: sbready: sb 0xfffff801834219e8 NULL fnrdy

http://people.freebsd.org/~pho/stress/log/gleb007.txt

- Peter


More information about the freebsd-arch mailing list