svn commit: r194672 - in head/sys: kern netinet sys
Andre Oppermann
andre at freebsd.org
Tue Jun 23 08:29:45 UTC 2009
Robert Watson wrote:
>
> On Mon, 22 Jun 2009, Andre Oppermann wrote:
>
>> Add soreceive_stream(), an optimized version of soreceive() for
>> stream (TCP) sockets.
>
> While this sounds like very interesting work, I was struck by the lack of:
>
> Reviewed by: ?
> Tested by: ?
>
> Have you tested this change with some of our TCP conumer edge cases,
> especially in-kernel consumers:
>
> - Accept filters
> - NFS client
> - NFS server
> - smbfs
> - iscsi initiator
No, yes, yes, no, no. Plus a large number of SSH copying (it will trip
over every corrupted byte) and fetching half of the ports collection
programs source code and comparing the checksum.
> I also assume that the plan is that this will not be enabled by default
> in 8.0? With soreceive_dgram, it took three months to shake out the
> nits, and the datagram case is far simpler than the stream case. Given
> that experience, I'd guess it will take longer for the new stream code
> to shake out, and probably involve painfully tracking subtle bugs in
> blocking code, data corruption in NFS, etc. I've identified one such
> possible bug for iscsi.
No, it not supposed to be enable by default in 8.0.
> To provide easier access for early adopters, we shipped with
> soreceive_dgram available for UDP, but optionally enabled by a loader
> tunable, for at least one minor rev. I would recommend doing something
> similar here.
A good hint, thank you.
> A few inline comments below, including one serious bug (assuming my
> reading is right). This doesn't consistute a full review because it's
> too early in the morning to reason fully about the socket buffer code.
OK. Thanks. I will look into the specifics later today.
--
Andre
> Robert N M Watson
> Computer Laboratory
> University of Cambridge
>
>> It is functionally identical to generic soreceive() but has a
>> number stream specific optimizations:
>> o does only one sockbuf unlock/lock per receive independent of
>> the length of data to be moved into the uio compared to
>> soreceive() which unlocks/locks per *mbuf*.
>
> Hmm. I count four locks and unlocks per receive, assuming no blocking
> and a I/O to user memory:
>
> - sblock/sbunlock to stabilize sb_mb and prevent I/O interlacing.
> - start/stop socket buffer mutex lock/unlock at beginning/end of function
> - unlock/relock around m_mbuftouio()
> - unlock/relock around pru_rcvd
>
> One idea I've futzed with a little is whether or not we could drop
> sblock() in certain cases for receive and transmit. As far as I'm
> aware, it is used for at least four purposes in receive:
>
> - Prevent I/O interlacing from concurrent system calls
> - Provide consistency for blocking logic during concurrent system calls
> - Stabilize non-NULL sb_mb while the socket buffer mutex is dropped
> - Minimize contention on the socket buffer mutex during concurrent system
> calls
>
> Some of these are more important than others -- in particular, the
> function of preventing I/O interlacing for stream system calls seems
> something we might drop, as it's not a guarantee provided by other OS's,
> so no portable application should depend on it. Have you looked at all
> at this?
>
>> o uses m_mbuftouio() instead of its own copy(out) variant.
>> o much more compact code flow as a large number of special
>> cases is removed.
>> o much improved reability.
>>
>> It offers significantly reduced CPU usage and lock contention
>> when receiving fast TCP streams. Additional gains are obtained
>> when the receiving application is using SO_RCVLOWAT to batch up
>> some data before a read (and wakeup) is done.
>>
>> This function was written by "reverse engineering" and is not
>> just a stripped down variant of soreceive().
>>
>> It is not yet enabled by default on TCP sockets. Instead it is
>> commented out in the protocol initialization in tcp_usrreq.c
>> until more widespread testing has been done.
>
> Have you looked at using this with SCTP and UNIX domain sockets? UNIX
> domain socket performance is quite important for database workloads, and
> SCTP is
> going to be increasingly important for Internet applications. In the
> UNIX domain socket case, not supporting control mbuf will be a potential
> problem (see comments below about either implementing it or falling
> back, and if falling back m_mbuftouio needs to know how to properly free
> them).
>
>> Testers, especially with 10GigE gear, are welcome.
>>
>> MFP4: r164817 //depot/user/andre/soreceive_stream/
>>
>> Modified:
>> head/sys/kern/uipc_socket.c
>> head/sys/netinet/tcp_usrreq.c
>> head/sys/sys/socketvar.h
>>
>> Modified: head/sys/kern/uipc_socket.c
>> ==============================================================================
>>
>> --- head/sys/kern/uipc_socket.c Mon Jun 22 22:54:44 2009 (r194671)
>> +++ head/sys/kern/uipc_socket.c Mon Jun 22 23:08:05 2009 (r194672)
>> @@ -1857,6 +1857,202 @@ release:
>> }
>>
>> /*
>> + * Optimized version of soreceive() for stream (TCP) sockets.
>> + */
>> +int
>> +soreceive_stream(struct socket *so, struct sockaddr **psa, struct uio
>> *uio,
>> + struct mbuf **mp0, struct mbuf **controlp, int *flagsp)
>> +{
>> + int len = 0, error = 0, flags, oresid;
>> + struct sockbuf *sb;
>> + struct mbuf *m, *n = NULL;
>> +
>> + /* We only do stream sockets. */
>> + if (so->so_type != SOCK_STREAM)
>> + return (EINVAL);
>
> This should be a KASSERT(). You should also assert !PR_ADDR and
> !PR_ATOMIC, and possibly a few other things to ensure that
> soreceive_stream is not used with protocols that require features it
> does not implement.
>
> Much better to discover on sanity-checking the kernel than through
> subtle application bugs later when people start throwing switches
> without giving things full consideration.
>
>> + if (psa != NULL)
>> + *psa = NULL;
>> + if (controlp != NULL)
>> + return (EINVAL);
>
> This would seem to preclude easy future use of ancillary data with TCP
> -- one type I've been considering adding is TCP_INFO updates as part of
> the stream. How would you feel about making this call into
> soreceive_generic() instead?
>
> My reading of the control code was that it actually was fairly simple
> and I did include it in soreceive_dgram() (especially given that it's
> popular for UDP so you can get the receive IP address per-datagram).
>
>> + if (flagsp != NULL)
>> + flags = *flagsp &~ MSG_EOR;
>> + else
>> + flags = 0;
>> + if (flags & MSG_OOB)
>> + return (soreceive_rcvoob(so, uio, flags));
>> + if (mp0 != NULL)
>> + *mp0 = NULL;
>> +
>> + sb = &so->so_rcv;
>> +
>> + /* Prevent other readers from entering the socket. */
>> + error = sblock(sb, SBLOCKWAIT(flags));
>> + if (error)
>> + goto out;
>> + SOCKBUF_LOCK(sb);
>> +
>> + /* Easy one, no space to copyout anything. */
>> + if (uio->uio_resid == 0) {
>> + error = EINVAL;
>> + goto out;
>> + }
>
> Didn't we previously allow a resid of 0 to be used to probe for socket
> errors without receiving data? I've never used this semantic but it
> seems to be supported, which undoubtably means someone is using it :-).
> I notice that previously we didn't allow receiving EWOULDBLOCK using a
> 0-byte buffer, but perhaps we should have.
>
>> + oresid = uio->uio_resid;
>> +
>> + /* We will never ever get anything unless we are connected. */
>> + if (!(so->so_state & (SS_ISCONNECTED|SS_ISDISCONNECTED))) {
>> + /* When disconnecting there may be still some data left. */
>> + if (sb->sb_cc > 0)
>> + goto deliver;
>> + if (!(so->so_state & SS_ISDISCONNECTED))
>> + error = ENOTCONN;
>> + goto out;
>> + }
>> +
>> + /* Socket buffer is empty and we shall not block. */
>> + if (sb->sb_cc == 0 &&
>> + ((sb->sb_flags & SS_NBIO) || (flags &
>> (MSG_DONTWAIT|MSG_NBIO)))) {
>> + error = EAGAIN;
>> + goto out;
>> + }
>
> I think you've changed the error number reporting here in subtle ways,
> but it's early in the morning. It appears we now prefer ENOTCONN to
> so_error once the socket has entered SS_ISDISCONNECTED. We need to
> report and clear so_error if non-0 so that ECONNRESET, EPERM, etc, can
> be reported properly.
>
> The new so_state logic is markedly different from what came before, but
> I'm not sure what the impact of that is. Certainly the comment needs to
> change to "... are or have been connected". The previous logic appears
> to have handled SS_ISCONNECTING differently, and possibly
> SS_ISDISCONNECTING.
>
>> +restart:
>> + SOCKBUF_LOCK_ASSERT(&so->so_rcv);
>> +
>> + /* Abort if socket has reported problems. */
>> + if (so->so_error) {
>> + if (sb->sb_cc > 0)
>> + goto deliver;
>> + if (oresid > uio->uio_resid)
>> + goto out;
>> + error = so->so_error;
>> + if (!(flags & MSG_PEEK))
>> + so->so_error = 0;
>> + goto out;
>> + }
>> +
>> + /* Door is closed. Deliver what is left, if any. */
>> + if (sb->sb_state & SBS_CANTRCVMORE) {
>> + if (sb->sb_cc > 0)
>> + goto deliver;
>> + else
>> + goto out;
>> + }
>> +
>> + /* Socket buffer got some data that we shall deliver now. */
>> + if (sb->sb_cc > 0 && !(flags & MSG_WAITALL) &&
>> + ((sb->sb_flags & SS_NBIO) ||
>> + (flags & (MSG_DONTWAIT|MSG_NBIO)) ||
>> + sb->sb_cc >= sb->sb_lowat ||
>> + sb->sb_cc >= uio->uio_resid ||
>> + sb->sb_cc >= sb->sb_hiwat) ) {
>> + goto deliver;
>> + }
>> +
>> + /* On MSG_WAITALL we must wait until all data or error arrives. */
>> + if ((flags & MSG_WAITALL) &&
>> + (sb->sb_cc >= uio->uio_resid || sb->sb_cc >= sb->sb_lowat))
>> + goto deliver;
>> +
>> + /*
>> + * Wait and block until (more) data comes in.
>> + * NB: Drops the sockbuf lock during wait.
>> + */
>> + error = sbwait(sb);
>> + if (error)
>> + goto out;
>> + goto restart;
>> +
>> +deliver:
>> + SOCKBUF_LOCK_ASSERT(&so->so_rcv);
>> + KASSERT(sb->sb_cc > 0, ("%s: sockbuf empty", __func__));
>> + KASSERT(sb->sb_mb != NULL, ("%s: sb_mb == NULL", __func__));
>> +
>> + /* Statistics. */
>> + if (uio->uio_td)
>> + uio->uio_td->td_ru.ru_msgrcv++;
>> +
>> + /* Fill uio until full or current end of socket buffer is
>> reached. */
>> + len = min(uio->uio_resid, sb->sb_cc);
>> + if (mp0 != NULL) {
>> + /* Dequeue as many mbufs as possible. */
>> + if (!(flags & MSG_PEEK) && len >= sb->sb_mb->m_len) {
>> + for (*mp0 = m = sb->sb_mb;
>> + m != NULL && m->m_len <= len;
>> + m = m->m_next) {
>> + len -= m->m_len;
>> + uio->uio_resid -= m->m_len;
>> + sbfree(sb, m);
>> + n = m;
>> + }
>
> Since control mbufs aren't supported, you should assert !MT_CONTROL. Or
> add control mbuf support per above.
>
> Again, early in the morning, but: does (mp0 != NULL) still work with
> (flags & MSG_WAITALL)? In soreceive_generic(), mp0 is updated to point
> to &m->m_next as the loop iterates, so that as you go back to 'repeat:',
> the chain pointed to by mp0 is appended to rather than replaced, whereas
> your code appears to replace it in each loop (which seems wrong).
> Perhaps it's just too early in the morning.
>
> I think the only consumer of (mp0 != NULL) with MSG_WAITALL is the iscsi
> initiator, but breaking that would be bad. smbfs probably *should* use
> it... :-)
>
>> + sb->sb_mb = m;
>> + if (sb->sb_mb == NULL)
>> + SB_EMPTY_FIXUP(sb);
>> + n->m_next = NULL;
>> + }
>> + /* Copy the remainder. */
>> + if (len > 0) {
>> + KASSERT(sb->sb_mb != NULL,
>> + ("%s: len > 0 && sb->sb_mb empty", __func__));
>> +
>> + m = m_copym(sb->sb_mb, 0, len, M_DONTWAIT);
>> + if (m == NULL)
>> + len = 0; /* Don't flush data from sockbuf. */
>> + else
>> + uio->uio_resid -= m->m_len;
>> + if (*mp0 != NULL)
>> + n->m_next = m;
>> + else
>> + *mp0 = m;
>> + if (*mp0 == NULL) {
>> + error = ENOBUFS;
>> + goto out;
>> + }
>> + }
>> + } else {
>> + /* NB: Must unlock socket buffer as uiomove may sleep. */
>> + SOCKBUF_UNLOCK(sb);
>> + error = m_mbuftouio(uio, sb->sb_mb, len);
>
> m_mbuftouio() should assert !MT_CONTROL on each mbuf if it doesn't
> support control mbufs. See also above. :-)
>
>> + SOCKBUF_LOCK(sb);
>> + if (error)
>> + goto out;
>> + }
>> + SBLASTRECORDCHK(sb);
>> + SBLASTMBUFCHK(sb);
>> +
>> + /*
>> + * Remove the delivered data from the socket buffer unless we
>> + * were only peeking.
>> + */
>> + if (!(flags & MSG_PEEK)) {
>> + if (len > 0)
>> + sbdrop_locked(sb, len);
>> +
>> + /* Notify protocol that we drained some data. */
>> + if ((so->so_proto->pr_flags & PR_WANTRCVD) &&
>> + (((flags & MSG_WAITALL) && uio->uio_resid > 0) ||
>> + !(flags & MSG_SOCALLBCK))) {
>> + SOCKBUF_UNLOCK(sb);
>> + (*so->so_proto->pr_usrreqs->pru_rcvd)(so, flags);
>> + SOCKBUF_LOCK(sb);
>> + }
>> + }
>> +
>> + /*
>> + * For MSG_WAITALL we may have to loop again and wait for
>> + * more data to come in.
>> + */
>> + if ((flags & MSG_WAITALL) && uio->uio_resid > 0)
>> + goto restart;
>> +out:
>> + SOCKBUF_LOCK_ASSERT(sb);
>> + SBLASTRECORDCHK(sb);
>> + SBLASTMBUFCHK(sb);
>> + SOCKBUF_UNLOCK(sb);
>> + sbunlock(sb);
>> + return (error);
>> +}
>> +
>> +/*
>> * Optimized version of soreceive() for simple datagram cases from
>> userspace.
>> * Unlike in the stream case, we're able to drop a datagram if copyout()
>> * fails, and because we handle datagrams atomically, we don't need to
>> use a
>>
>> Modified: head/sys/netinet/tcp_usrreq.c
>> ==============================================================================
>>
>> --- head/sys/netinet/tcp_usrreq.c Mon Jun 22 22:54:44 2009
>> (r194671)
>> +++ head/sys/netinet/tcp_usrreq.c Mon Jun 22 23:08:05 2009
>> (r194672)
>> @@ -1032,6 +1032,9 @@ struct pr_usrreqs tcp_usrreqs = {
>> .pru_send = tcp_usr_send,
>> .pru_shutdown = tcp_usr_shutdown,
>> .pru_sockaddr = in_getsockaddr,
>> +#if 0
>> + .pru_soreceive = soreceive_stream,
>> +#endif
>> .pru_sosetlabel = in_pcbsosetlabel,
>> .pru_close = tcp_usr_close,
>> };
>> @@ -1053,6 +1056,9 @@ struct pr_usrreqs tcp6_usrreqs = {
>> .pru_send = tcp_usr_send,
>> .pru_shutdown = tcp_usr_shutdown,
>> .pru_sockaddr = in6_mapped_sockaddr,
>> +#if 0
>> + .pru_soreceive = soreceive_stream,
>> +#endif
>> .pru_sosetlabel = in_pcbsosetlabel,
>> .pru_close = tcp_usr_close,
>> };
>>
>> Modified: head/sys/sys/socketvar.h
>> ==============================================================================
>>
>> --- head/sys/sys/socketvar.h Mon Jun 22 22:54:44 2009 (r194671)
>> +++ head/sys/sys/socketvar.h Mon Jun 22 23:08:05 2009 (r194672)
>> @@ -345,6 +345,9 @@ int sopoll_generic(struct socket *so, in
>> struct ucred *active_cred, struct thread *td);
>> int soreceive(struct socket *so, struct sockaddr **paddr, struct
>> uio *uio,
>> struct mbuf **mp0, struct mbuf **controlp, int *flagsp);
>> +int soreceive_stream(struct socket *so, struct sockaddr **paddr,
>> + struct uio *uio, struct mbuf **mp0, struct mbuf **controlp,
>> + int *flagsp);
>> int soreceive_dgram(struct socket *so, struct sockaddr **paddr,
>> struct uio *uio, struct mbuf **mp0, struct mbuf **controlp,
>> int *flagsp);
>>
>
>
More information about the svn-src-head
mailing list