Update: Debox sendfile modifications

Tue Nov 4 10:27:07 PST 2003

Sorry for not replying sooner - under deadline pressure right now.

Mike Silbersack wrote:
> Ok, I've reread the debox paper, looked over the patch, and talked to Alan
> Cox about his present and upcoming work on the vm system.
> 
> The debox patch does three basic things (if I'm understanding everything
> correctly.):
> 
> 1.  It ensures that the header is sent in the same packet as the first
> part of the data, fixing performance with small files.
> 
> - This part of the patch needs a little cleanup, but that's easy enough.
> I will try to integrate it next week.

ok, that sounds good

> 2.  The patch merges sendfile buffers so that when the same page is sent
> to multiple connections, kernel address space is not wasted.
> 
> - While this is the part of the patch with the widest benefit, it will be
> the most difficult to integrate.  In order to support 64-bit architectures
> better, Alan has refactored the sendfile code, meaning that the patch
> would have to be rewritten to fit this new layout.

The one other aspect of this is that sf_bufs mappings are maintained for
a configurable amount of time, reducing the number of TLB ops. You can
specify the parameter for how long, ranging from -1 (no coalescing at
all), 0 (coalesce, but free immediately after last holder release), to
any other time. Obviously, any value above 0 will increase the amount of
wired memory at any given point in time, but it's configurable.

> 3.  The patch returns a new error when sendfile realizes that it will have
> to block on disk I/O, thereby allowing Flash to have a helper do the
> blocking call.
> 
> - While this change could be made easily enough, I'm not sure that it
> would benefit anything other than Flash, so I'm not certain if we should
> do it.  However, based on what you learned with Flash, I have an alternate
> idea:
> 
> ---
> 
> Suppose that sendfile is called to send to a non-blocking socket, and that
> it detects that the page(s) required are not in memory, and that disk I/O
> will be necessary.  Instead of blocking, sendfile would call a sendfile
> helper kernel thread (either by calling kthread_create, or by having a
> preexisting pool.)  After dispatching this thread, sendfile would return
> EWOULDBLOCK to the caller.  Note that only a limited number of threads
> would exist (perhaps 8?), so, if all threads were busy, sendfile would
> have to block like it does at present.
> 
> Once the I/O was complete, the thread would call sowakeup (or whatever is
> called typically when a thread is now ready for writing) for the socket in
> question.  The application would call sendfile, like normal, but this time
> everything would succeed because the page would be in memory.
> 
> ---
> 
> If such a feature were implemented, it might have the same increased
> performance effect that your new return value does, except that it would
> require no modification for a non-blocking sendfile based application to
> take advantage of it.
> 
> Alan, would this be possible from the VM system's perspective?  Is it safe
> to assume that once the page in question was in the page cache that it
> would hang around long enough for the second sendfile call to access it
> before it is paged back out again?

There are two problems to this approach that I see
a) you'd have to return a value back to sendfile while this async
    operation is in progress, because you don't have any other
    non-intrusive way of giving back the return value at any other time
b) once that small number of kernel threads gets exhausted, you lose
    the opportunity to serve requests out of main memory

part (a) means that this change can't be made completely without
application re-coding, and part (b) means that more sophisticated
applications could lose performance.

How about something that lets you choose what happens - we've got a
flag field anyway, so why not have options to control the behavior on
missing pages? Applications like Flash might just want the error
message so that they can handle it themselves, while other applications
may be happy with the default that you're suggesting.

-Vivek