svn commit: r194672 - in head/sys: kern netinet sys

Robert Watson rwatson at FreeBSD.org
Tue Jun 23 07:07:27 UTC 2009


On Mon, 22 Jun 2009, Andre Oppermann wrote:

>  Add soreceive_stream(), an optimized version of soreceive() for
>  stream (TCP) sockets.

While this sounds like very interesting work, I was struck by the lack of:

   Reviewed by:	?
   Tested by:	?

Have you tested this change with some of our TCP conumer edge cases, 
especially in-kernel consumers:

- Accept filters
- NFS client
- NFS server
- smbfs
- iscsi initiator

I also assume that the plan is that this will not be enabled by default in 
8.0?  With soreceive_dgram, it took three months to shake out the nits, and 
the datagram case is far simpler than the stream case.  Given that experience, 
I'd guess it will take longer for the new stream code to shake out, and 
probably involve painfully tracking subtle bugs in blocking code, data 
corruption in NFS, etc.  I've identified one such possible bug for iscsi.

To provide easier access for early adopters, we shipped with soreceive_dgram 
available for UDP, but optionally enabled by a loader tunable, for at least 
one minor rev.  I would recommend doing something similar here.

A few inline comments below, including one serious bug (assuming my reading is 
right).  This doesn't consistute a full review because it's too early in the 
morning to reason fully about the socket buffer code.

Robert N M Watson
Computer Laboratory
University of Cambridge

>  It is functionally identical to generic soreceive() but has a
>  number stream specific optimizations:
>  o does only one sockbuf unlock/lock per receive independent of
>    the length of data to be moved into the uio compared to
>    soreceive() which unlocks/locks per *mbuf*.

Hmm.  I count four locks and unlocks per receive, assuming no blocking and a 
I/O to user memory:

- sblock/sbunlock to stabilize sb_mb and prevent I/O interlacing.
- start/stop socket buffer mutex lock/unlock at beginning/end of function
- unlock/relock around m_mbuftouio()
- unlock/relock around pru_rcvd

One idea I've futzed with a little is whether or not we could drop sblock() in 
certain cases for receive and transmit.  As far as I'm aware, it is used for 
at least four purposes in receive:

- Prevent I/O interlacing from concurrent system calls
- Provide consistency for blocking logic during concurrent system calls
- Stabilize non-NULL sb_mb while the socket buffer mutex is dropped
- Minimize contention on the socket buffer mutex during concurrent system
   calls

Some of these are more important than others -- in particular, the function of 
preventing I/O interlacing for stream system calls seems something we might 
drop, as it's not a guarantee provided by other OS's, so no portable 
application should depend on it.  Have you looked at all at this?

>  o uses m_mbuftouio() instead of its own copy(out) variant.
>  o much more compact code flow as a large number of special
>    cases is removed.
>  o much improved reability.
>
>  It offers significantly reduced CPU usage and lock contention
>  when receiving fast TCP streams.  Additional gains are obtained
>  when the receiving application is using SO_RCVLOWAT to batch up
>  some data before a read (and wakeup) is done.
>
>  This function was written by "reverse engineering" and is not
>  just a stripped down variant of soreceive().
>
>  It is not yet enabled by default on TCP sockets.  Instead it is
>  commented out in the protocol initialization in tcp_usrreq.c
>  until more widespread testing has been done.

Have you looked at using this with SCTP and UNIX domain sockets?  UNIX domain 
socket performance is quite important for database workloads, and SCTP is
going to be increasingly important for Internet applications.  In the UNIX 
domain socket case, not supporting control mbuf will be a potential problem 
(see comments below about either implementing it or falling back, and if 
falling back m_mbuftouio needs to know how to properly free them).

>  Testers, especially with 10GigE gear, are welcome.
>
>  MFP4:	r164817 //depot/user/andre/soreceive_stream/
>
> Modified:
>  head/sys/kern/uipc_socket.c
>  head/sys/netinet/tcp_usrreq.c
>  head/sys/sys/socketvar.h
>
> Modified: head/sys/kern/uipc_socket.c
> ==============================================================================
> --- head/sys/kern/uipc_socket.c	Mon Jun 22 22:54:44 2009	(r194671)
> +++ head/sys/kern/uipc_socket.c	Mon Jun 22 23:08:05 2009	(r194672)
> @@ -1857,6 +1857,202 @@ release:
> }
>
> /*
> + * Optimized version of soreceive() for stream (TCP) sockets.
> + */
> +int
> +soreceive_stream(struct socket *so, struct sockaddr **psa, struct uio *uio,
> +    struct mbuf **mp0, struct mbuf **controlp, int *flagsp)
> +{
> +	int len = 0, error = 0, flags, oresid;
> +	struct sockbuf *sb;
> +	struct mbuf *m, *n = NULL;
> +
> +	/* We only do stream sockets. */
> +	if (so->so_type != SOCK_STREAM)
> +		return (EINVAL);

This should be a KASSERT().  You should also assert !PR_ADDR and !PR_ATOMIC, 
and possibly a few other things to ensure that soreceive_stream is not used 
with protocols that require features it does not implement.

Much better to discover on sanity-checking the kernel than through subtle 
application bugs later when people start throwing switches without giving 
things full consideration.

> +	if (psa != NULL)
> +		*psa = NULL;
> +	if (controlp != NULL)
> +		return (EINVAL);

This would seem to preclude easy future use of ancillary data with TCP -- one 
type I've been considering adding is TCP_INFO updates as part of the stream. 
How would you feel about making this call into soreceive_generic() instead?

My reading of the control code was that it actually was fairly simple and I 
did include it in soreceive_dgram() (especially given that it's popular for 
UDP so you can get the receive IP address per-datagram).

> +	if (flagsp != NULL)
> +		flags = *flagsp &~ MSG_EOR;
> +	else
> +		flags = 0;
> +	if (flags & MSG_OOB)
> +		return (soreceive_rcvoob(so, uio, flags));
> +	if (mp0 != NULL)
> +		*mp0 = NULL;
> +
> +	sb = &so->so_rcv;
> +
> +	/* Prevent other readers from entering the socket. */
> +	error = sblock(sb, SBLOCKWAIT(flags));
> +	if (error)
> +		goto out;
> +	SOCKBUF_LOCK(sb);
> +
> +	/* Easy one, no space to copyout anything. */
> +	if (uio->uio_resid == 0) {
> +		error = EINVAL;
> +		goto out;
> +	}

Didn't we previously allow a resid of 0 to be used to probe for socket errors 
without receiving data?  I've never used this semantic but it seems to be 
supported, which undoubtably means someone is using it :-).  I notice that 
previously we didn't allow receiving EWOULDBLOCK using a 0-byte buffer, but 
perhaps we should have.

> +	oresid = uio->uio_resid;
> +
> +	/* We will never ever get anything unless we are connected. */
> +	if (!(so->so_state & (SS_ISCONNECTED|SS_ISDISCONNECTED))) {
> +		/* When disconnecting there may be still some data left. */
> +		if (sb->sb_cc > 0)
> +			goto deliver;
> +		if (!(so->so_state & SS_ISDISCONNECTED))
> +			error = ENOTCONN;
> +		goto out;
> +	}
> +
> +	/* Socket buffer is empty and we shall not block. */
> +	if (sb->sb_cc == 0 &&
> +	    ((sb->sb_flags & SS_NBIO) || (flags & (MSG_DONTWAIT|MSG_NBIO)))) {
> +		error = EAGAIN;
> +		goto out;
> +	}

I think you've changed the error number reporting here in subtle ways, but 
it's early in the morning.  It appears we now prefer ENOTCONN to so_error once 
the socket has entered SS_ISDISCONNECTED.  We need to report and clear 
so_error if non-0 so that ECONNRESET, EPERM, etc, can be reported properly.

The new so_state logic is markedly different from what came before, but I'm 
not sure what the impact of that is.  Certainly the comment needs to change to 
"... are or have been connected".  The previous logic appears to have handled 
SS_ISCONNECTING differently, and possibly SS_ISDISCONNECTING.

> +restart:
> +	SOCKBUF_LOCK_ASSERT(&so->so_rcv);
> +
> +	/* Abort if socket has reported problems. */
> +	if (so->so_error) {
> +		if (sb->sb_cc > 0)
> +			goto deliver;
> +		if (oresid > uio->uio_resid)
> +			goto out;
> +		error = so->so_error;
> +		if (!(flags & MSG_PEEK))
> +			so->so_error = 0;
> +		goto out;
> +	}
> +
> +	/* Door is closed.  Deliver what is left, if any. */
> +	if (sb->sb_state & SBS_CANTRCVMORE) {
> +		if (sb->sb_cc > 0)
> +			goto deliver;
> +		else
> +			goto out;
> +	}
> +
> +	/* Socket buffer got some data that we shall deliver now. */
> +	if (sb->sb_cc > 0 && !(flags & MSG_WAITALL) &&
> +	    ((sb->sb_flags & SS_NBIO) ||
> +	     (flags & (MSG_DONTWAIT|MSG_NBIO)) ||
> +	     sb->sb_cc >= sb->sb_lowat ||
> +	     sb->sb_cc >= uio->uio_resid ||
> +	     sb->sb_cc >= sb->sb_hiwat) ) {
> +		goto deliver;
> +	}
> +
> +	/* On MSG_WAITALL we must wait until all data or error arrives. */
> +	if ((flags & MSG_WAITALL) &&
> +	    (sb->sb_cc >= uio->uio_resid || sb->sb_cc >= sb->sb_lowat))
> +		goto deliver;
> +
> +	/*
> +	 * Wait and block until (more) data comes in.
> +	 * NB: Drops the sockbuf lock during wait.
> +	 */
> +	error = sbwait(sb);
> +	if (error)
> +		goto out;
> +	goto restart;
> +
> +deliver:
> +	SOCKBUF_LOCK_ASSERT(&so->so_rcv);
> +	KASSERT(sb->sb_cc > 0, ("%s: sockbuf empty", __func__));
> +	KASSERT(sb->sb_mb != NULL, ("%s: sb_mb == NULL", __func__));
> +
> +	/* Statistics. */
> +	if (uio->uio_td)
> +		uio->uio_td->td_ru.ru_msgrcv++;
> +
> +	/* Fill uio until full or current end of socket buffer is reached. */
> +	len = min(uio->uio_resid, sb->sb_cc);
> +	if (mp0 != NULL) {
> +		/* Dequeue as many mbufs as possible. */
> +		if (!(flags & MSG_PEEK) && len >= sb->sb_mb->m_len) {
> +			for (*mp0 = m = sb->sb_mb;
> +			     m != NULL && m->m_len <= len;
> +			     m = m->m_next) {
> +				len -= m->m_len;
> +				uio->uio_resid -= m->m_len;
> +				sbfree(sb, m);
> +				n = m;
> +			}

Since control mbufs aren't supported, you should assert !MT_CONTROL.  Or add 
control mbuf support per above.

Again, early in the morning, but: does (mp0 != NULL) still work with (flags & 
MSG_WAITALL)?  In soreceive_generic(), mp0 is updated to point to &m->m_next 
as the loop iterates, so that as you go back to 'repeat:', the chain pointed 
to by mp0 is appended to rather than replaced, whereas your code appears to 
replace it in each loop (which seems wrong).  Perhaps it's just too early in 
the morning.

I think the only consumer of (mp0 != NULL) with MSG_WAITALL is the iscsi 
initiator, but breaking that would be bad.  smbfs probably *should* use it... 
:-)

> +			sb->sb_mb = m;
> +			if (sb->sb_mb == NULL)
> +				SB_EMPTY_FIXUP(sb);
> +			n->m_next = NULL;
> +		}
> +		/* Copy the remainder. */
> +		if (len > 0) {
> +			KASSERT(sb->sb_mb != NULL,
> +			    ("%s: len > 0 && sb->sb_mb empty", __func__));
> +
> +			m = m_copym(sb->sb_mb, 0, len, M_DONTWAIT);
> +			if (m == NULL)
> +				len = 0;	/* Don't flush data from sockbuf. */
> +			else
> +				uio->uio_resid -= m->m_len;
> +			if (*mp0 != NULL)
> +				n->m_next = m;
> +			else
> +				*mp0 = m;
> +			if (*mp0 == NULL) {
> +				error = ENOBUFS;
> +				goto out;
> +			}
> +		}
> +	} else {
> +		/* NB: Must unlock socket buffer as uiomove may sleep. */
> +		SOCKBUF_UNLOCK(sb);
> +		error = m_mbuftouio(uio, sb->sb_mb, len);

m_mbuftouio() should assert !MT_CONTROL on each mbuf if it doesn't support 
control mbufs.  See also above. :-)

> +		SOCKBUF_LOCK(sb);
> +		if (error)
> +			goto out;
> +	}
> +	SBLASTRECORDCHK(sb);
> +	SBLASTMBUFCHK(sb);
> +
> +	/*
> +	 * Remove the delivered data from the socket buffer unless we
> +	 * were only peeking.
> +	 */
> +	if (!(flags & MSG_PEEK)) {
> +		if (len > 0)
> +			sbdrop_locked(sb, len);
> +
> +		/* Notify protocol that we drained some data. */
> +		if ((so->so_proto->pr_flags & PR_WANTRCVD) &&
> +		    (((flags & MSG_WAITALL) && uio->uio_resid > 0) ||
> +		     !(flags & MSG_SOCALLBCK))) {
> +			SOCKBUF_UNLOCK(sb);
> +			(*so->so_proto->pr_usrreqs->pru_rcvd)(so, flags);
> +			SOCKBUF_LOCK(sb);
> +		}
> +	}
> +
> +	/*
> +	 * For MSG_WAITALL we may have to loop again and wait for
> +	 * more data to come in.
> +	 */
> +	if ((flags & MSG_WAITALL) && uio->uio_resid > 0)
> +		goto restart;
> +out:
> +	SOCKBUF_LOCK_ASSERT(sb);
> +	SBLASTRECORDCHK(sb);
> +	SBLASTMBUFCHK(sb);
> +	SOCKBUF_UNLOCK(sb);
> +	sbunlock(sb);
> +	return (error);
> +}
> +
> +/*
>  * Optimized version of soreceive() for simple datagram cases from userspace.
>  * Unlike in the stream case, we're able to drop a datagram if copyout()
>  * fails, and because we handle datagrams atomically, we don't need to use a
>
> Modified: head/sys/netinet/tcp_usrreq.c
> ==============================================================================
> --- head/sys/netinet/tcp_usrreq.c	Mon Jun 22 22:54:44 2009	(r194671)
> +++ head/sys/netinet/tcp_usrreq.c	Mon Jun 22 23:08:05 2009	(r194672)
> @@ -1032,6 +1032,9 @@ struct pr_usrreqs tcp_usrreqs = {
> 	.pru_send =		tcp_usr_send,
> 	.pru_shutdown =		tcp_usr_shutdown,
> 	.pru_sockaddr =		in_getsockaddr,
> +#if 0
> +	.pru_soreceive =	soreceive_stream,
> +#endif
> 	.pru_sosetlabel =	in_pcbsosetlabel,
> 	.pru_close =		tcp_usr_close,
> };
> @@ -1053,6 +1056,9 @@ struct pr_usrreqs tcp6_usrreqs = {
> 	.pru_send =		tcp_usr_send,
> 	.pru_shutdown =		tcp_usr_shutdown,
> 	.pru_sockaddr =		in6_mapped_sockaddr,
> +#if 0
> +	.pru_soreceive =	soreceive_stream,
> +#endif
>  	.pru_sosetlabel =	in_pcbsosetlabel,
> 	.pru_close =		tcp_usr_close,
> };
>
> Modified: head/sys/sys/socketvar.h
> ==============================================================================
> --- head/sys/sys/socketvar.h	Mon Jun 22 22:54:44 2009	(r194671)
> +++ head/sys/sys/socketvar.h	Mon Jun 22 23:08:05 2009	(r194672)
> @@ -345,6 +345,9 @@ int	sopoll_generic(struct socket *so, in
> 	    struct ucred *active_cred, struct thread *td);
> int	soreceive(struct socket *so, struct sockaddr **paddr, struct uio *uio,
> 	    struct mbuf **mp0, struct mbuf **controlp, int *flagsp);
> +int	soreceive_stream(struct socket *so, struct sockaddr **paddr,
> +	    struct uio *uio, struct mbuf **mp0, struct mbuf **controlp,
> +	    int *flagsp);
> int	soreceive_dgram(struct socket *so, struct sockaddr **paddr,
> 	    struct uio *uio, struct mbuf **mp0, struct mbuf **controlp,
> 	    int *flagsp);
>


More information about the svn-src-all mailing list