svn commit: r275326 - in head: sys/dev/cxgbe/tom sys/kern sys/netinet sys/sys usr.bin/bluetooth/btsockstat usr.bin/netstat usr.bin/systat

Sun Nov 30 23:21:33 UTC 2014

  Robert,

  thanks for longer reply!

On Sun, Nov 30, 2014 at 09:33:50PM +0000, Robert Watson wrote:
R> I do have a substantial worry about 'compatibility' -- which is to say, the 
R> risk of not only running into trouble with more obscure bits of the network 
R> stack, but also third-party extensions in local code bases that may make 
R> previously reasonable assumptions about mbuf behaviour.

>From viewpoint of outer code that deals with socket buffer code via APIs like
sbappendstream(), sbdrop(), etc, the change is backward compatible. We
still accept mbufs as we did before. Changes are applied only when we are
explicitly called with PRUS_NOTREADY.

Those who would suffer incompatibilities, are those who touch struct sockbuf
itself, like SCTP does. Well, this isn't a right thing to do! Of course, the
BSD stack didn't provide API for sockbufs, and our current stack doesn't
provide an excellent API. In a very perfect case struct sockbuf should be
private to uipc_sockbuf.c. We are very far from that, but we should head
into that direction.

R> One way to bound the 
R> scope of that issue is to ensure, very firmly, that these flags can never leak 
R> out of the socket-buffer/protocol components that explicitly support it.  I 
R> think I might feel more comfortable if these were globally visible mbuf flags, 
R> which we could assert never passed through certain types of interfaces -- 
R> e.g., made it to ip_output(); it might also be useful to assert that these 
R> flags are never set when an mbuf is freed back to the mbuf allocator.  I also 
R> worry that leakage of M_PROTO flags from other components could lead to quite 
R> hard-to-debug problems, and so it would be good to ensure that subsystems that 
R> will use these flags assert that they are never set as they arrive in the 
R> subsystem (e.g., tcp_input()).  I'm not sure how we're doing on mbuf flags, 
R> but we might soon find we require another field in struct mbuf for them. 
R> Anywhere we can use assertions to help us with this kind of problem, I think 
R> it's a big win.

Andre did a lot on asserting that M_PROTOs do not leak between protocol
boundaries and subsystems. Probably, we can add more there. I don't like
the idea of declaring these flags at global scope, although we used that
during debugging. We actually should emphasize that they are local and hide
them.

The assertions can and should use M_PROTO defines. If we use M_NOTREADY
define in assertions outside socket buffer code, that could be very misleading
once such assertion fires. A developer would seek for the bug in socket buffer
code, while a culprit could be any other code that uses M_PROTO1.

If we move M_NOTREADY out of the M_PROTO* pool of bits, then we can very
soon run out of bits.

R> Although these changes have clearly been refined over an extended period, and 
R> were developed in a Subversion project branch, I think it would have been nice 
R> to be able to have more "Reviewed by:" entries in the commit message.  I've 
R> been really happy to see a gradual change over the last few years towards a 
R> large percentage of network-stack commits having that tag, and think it's 
R> worth pursuing this pretty vigorously.  A phabricator round to networks@ (or 
R> whatever the recently created alias/group was) would not go amiss in the 
R> future.  While everyone is busy, creating an environment of mutual review 
R> (I'll scratch your back if you scratch mine) is very valuable -- especially 
R> for subsystems as subtle as the network stack.  Looking back in the e-mail 
R> archives, I see that you did post a phabricator link to an earlier version of 
R> this work in May -- I suspect more armtwisting was required, but would in 
R> general be worth it.  (I've been very tuned out the last few months due to 
R> ongoing things outside of the FreeBSD world -- but please do feel free to ping 
R> me if there's a critical patch like this in flight where I can at least lend a 
R> hand with a quick read.)

Yes, I'm very sorry for not forcing people enough, but please understand that
guilt is mutual there between me and potential reviewers. I didn't hide the
upcoming patch under a carpet. I posted it several times.

If you can do a post commit review, that would be much appreciated. I've split
the changes into as small self-contained chunks as possible to make that easier.

R> Although not an immediate issue, it may be worth looking at what happens to 
R> performance with these changes in the presence of lower-latency storage 
R> devices, such as nvme.  Where there is substantial latency due to a spinning 
R> disk, or even a congested SATA-attached SSD, it's easy to see how these 
R> changes would improve performance.  I do worry that with lower latency, you 
R> might see increases in lock contention -- or at least shifts, which would be 
R> worth understanding.  I'm not sure if your testing environment is suitable for 
R> that (nor, I think, the FreeBSD Project's netperf cluster -- we've just 
R> received our first nvme device at the Computer Laboratory for network-stack 
R> performance work .. and it will be interesting to see how that plays out. 
R> With <30us RTT on I/O, multiqueue support, etc, the dynamics are changing 
R> quite a lot.)  The concern would be that we might now be doing many more 
R> socket-buffer locking/wakeup-style events than before, and that if they are 
R> paced too quickly the threads will bump into each other.

While I was working on the new sendfile, all the people who observed my work
(scottl@, Igor Sysoev, others including myself) predicted that HDD based
servers would benefit from new sendfile much more than SSD based servers.
And now according to Netflix production, we mistaken :) I don't have a nice
explanation for that phenomena right now.

Regarding lock contention/wakeups. No, that wouldn't be many more, would be
less.  Let me count. Note that any software that wants to push above 10 Gig
with heavy disk I/O via sendfile() now utilizes SF_NODISKIO. What happens:

app -> sendfile()
	SOCKBUF_LOCK() at preamble for sbspace();
    <- EBUSY
app -> aio_read()
app -> kevent
    <- aio done
app -> sendfile()
	SOCKBUF_LOCK() at preamble for sbspace();
	SOCKBUF_LOCK() at pru_send

With new code:

app -> sendfile()  
	SOCKBUF_LOCK() at preamble for sbspace();
	SOCKBUF_LOCK() at pru_send
    <- 0
cam -> sf_iodone()
	SOCKBUF_LOCK() at pru_ready

Even less amount of locking the sockbuf, when serving a small request, that
can be populated into VM cache with single aio_read(). If request is larger,
app can bounce between sendfile returning EBUSY and aio_read several times.

Anyway, all these acquisions of socket buffer lock are quite short. We used
to profile contention on socket buffer locks at Netflix and we found only
sbdrop() being a long one. This yilded in the sbcut() optimisation r256185.

R> FWIW, I think it would be worth pondering adding an aio_sendfile(), if only 
R> because it would make it easier to expose this kernel asynchrony to userspace 
R> concurrency frameworks like libdispatch.

We decided not to go for aio_sendfile() for the following two reasons:

1) The aio(4) is quite a heavy framework. It has kernel daemons, that have
   context etc. All AIO is actually serialized through queues, etc. We don't
   need all this heaviness for such simple task as sf_iodone().
   We actually tried to implement aio_sendfile(), I tried and Adrian tried.
   This was not about to yield in a good code.

2) We aren't the most popular OS. If we introduce new API, no matter how cool
   it is, we also need all the software around to accept it and adopt. The
   new sendfile(2), not blocking on I/O, is not a new API. It is a drop in
   replacement or old implementation. Any software: nginx, varnish, lighttpd,
   etc, doesn't need recompilation neither code changes. Just upgrade to
   FreeBSD 11 and get some benefits.

-- 
Totus tuus, Glebius.