sendfile(2) SF_NOPUSH flag proposal

Tue May 27 08:55:11 PDT 2003

Igor Sysoev wrote:
> > I would be really surprised if you were able to demonstrate a
> > measuarble performance difference which was above the noise.
> 
> I hope I will demonstrate at least CPU usage in near future.

See other post: that's the only place I expect there to be a
potential win; however, unless you CPU power is relatively
low, compared to memory and PCI bus bandwidth, I expect the
limiting factor to be PCI bus bandwidth first, memory second,
and CPU overhead a distant third.  That changes if you are
doing crypto, but then IPSEC changes all your assumptions.

> > You were talking about the file and the header living in the
> > same packet.
> 
> I mean that if you have 230 bytes header then sendfile() will send it
> in separate packet nevertheless the size of header and of the file.
> Something like this - 230, 1460, 1460, ...

Again, see other post: this is arguably a sendfile(2) bug,
though a reall minor one; one which should be addressed in
the sendfile(2) implementation, and doesn't need options
added to the API in order to address it.

> > > it will return me 230 bytes:
> >
> > The "HEAD" is atypical, compared to the "GET"; the full Google
> > front page is larger than that, and consists of multiple files;
> > assuming you support HTTP/1.1 and pipelining, it's going to be
> > a back-to-back transfer involving multiple sendfile() calls.
> 
> I use HEAD to show you the size of the HTTP header.
> The HEAD is atypical but such small HTTP header is typical.

Here is my problem: you are arguing both amortized cost and
total cost, depending on which is more supportive of your
main thesis.  These arguments are seperate and orthogonal to
each other: they don't support each other.  You can argue
tiny files, and a relatively high total cost, or you can argue
large files and pipelining, and a relatively high amortized
cost, but you can't argue both time and large files and
many connections and one connection at the same time.

Personally, I'd step back and get the arguments straight,
and get an implementation that demonstrates statistically
significant performance differences, and then come back, if
I wanted to press the case for additional option flags.  I
have done this several times in the past, e.g. with my soft
interrupt coelescing implementation that's now part of most
of the ethernet drivers people care about.

Actually, in this case, I'd just try to fix sendfile(2) to
do the packet coelescing I'd expect, given the relative
state of the TCP_NODELAY and TCP_NOPUSH options flags.

> > 3 packets vs. 6.  And using HTTP/1.0, there's also the three
> > handshake packets, SYN/SYN-ACK/ACK, and the tear-down three
> > teardown packets, FIN/FIN-ACK/ACK (or 4), plus the ACK's for
> > the packets you sent (should be one ACK, since that's below
> > the TCP window size).
> 
> Actually 6 vs. 6 for this 8K file. But I said about another thing.
> Let's see 48K file and 250 bytes header. sendfile() usually sends
> it as 4K or 8K hunks so there are 48/8 * 6 + 1 (header) = 37 packets.
> But (48K + 250) / 1460 = 33 * 1460 + 1270 i.e. 34 packets.
> It's 8% decrease of data packets.

Which may or may not be a possible win; it depends on how
close to the bandwidth limit you are capable of driving
your hardware.  The bandwidth delay product between you and
the other end of the connection is probably going to be much
more significant a factor, when moving barely enough data to
trigger one window framing event (forced ACK).

> Add here the possible retransmitions.

Retransmissions are probably irrelevent; when you talk about
a retransmit, you are talking about data which is persisting
in your send sockbuf because it is outstanding unacknowledged
data.  At that point, the mbuf chains are assemebled.  The
internal fragmentation you are complaining about here happens
because of the initial lack of a TF_NOPUSH flag on tcpcb when
the tcp_output() is called on it after the headers have been
enqueued, but before any file data has been enqueued.

So when a retransmit, if any, is necessary, the packet stream
will not have the same decoelesced state: it will retransmit
exactly as you wanted it to transmit in the first place.

BTW: I'm still wary of the initial fault on the file data, if
it's not already in cache: arguably, it's better to start
sending the headers, and avoid the startup latency of delaying
sending the headers until the fault is satisfied: part of the
thing that's going to be eating your PCI bandwidth is the
disk I/O, and your disks are going to be the slowest data
sources/sinks in the whole equation.

> > Really: it's in the noise.  Unless you are paying by packet
> > count, you probably shouldn't care.
> 
> So do you consider that IP fragmentation is the good thing ?

Depends; can I go end-to-end without any fragmentation that
happens at all, or am I required to use frags to get packets
through at all?  If I have to use frags to get packets through,
fragged data is *much* better than no data.  8-) 8-).

In any case, I expect that this should be handled in the
context of TCP_NODELAY and TCP_NOPUSH, rather than by adding
options to work around an arguably broken sendfile(2).

-- Terry