kern/174087: Problems with ephemeral port selection

Mon Dec 3 15:40:00 UTC 2012

>Number:         174087
>Category:       kern
>Synopsis:       Problems with ephemeral port selection
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Mon Dec 03 15:40:00 UTC 2012
>Closed-Date:
>Last-Modified:
>Originator:     Keith Arner
>Release:        7.2
>Organization:
Panasas
>Environment:
FreeBSD pa-twin-19a 7.2-RELEASE FreeBSD 7.2-RELEASE #0: Mon Apr 19 16:24:09 EDT 2010     root at perf-x3:/usr/obj/usr0/jimz/freebsd-c-rack/sys/PANASAS  amd64
>Description:
Date:      Fri, 30 Nov 2012 09:09:08 -0500
From:      Keith Arner <vornum at gmail.com>
To:        freebsd-net at freebsd.org
Subject:   Problems with ephemeral port selection
Message-ID:  <CAEo_tUH9LPzPFP-O=317rYEQ3nT66b4biQshV_8=L8hReO_BLg at mail.gmail.com>

I've noticed some issues with ephemeral port number selection from
tcp_connect(), which limit the number of concurrent, outgoing connections
that can be established (connect(), rather than accept()).  Sifting through
the source code, I believe the issuess stem from two problems in the
tcp_connect() code path.  Specifically:

 1) The wrong function gets called to determine if a given ephemeral
    port number is currently usable.
 2) The ephemeral port number gets selected without considering the
    foreign addr/port.

Curiously, the effect of #1 mostly cancels the effect of #2, such that
the common calling convention gives you a correct result so long as you
only have a small number of outgoing connections.  However, once you get to
a large number of outgoing connections, things start to break down.  (I'll
define large and small later.)

As a side note, I have been working with FreeBSD 7.2.  The implementations
of several of the relevant functions have been refactored somewhere between
7.2-RELEASE and 9-STABLE, but the core problems in the logic seem to be
the same between versions.

For problem #1, the code path that selects the ephemeral port number is:
 tcp_connect() ->
   in_pcbbind() ->
     in_pcbbind_setup() ->
       in_pcb_lport() [not in FreeBSD 7.2] ->
         in_pcblookup_local()

There is a loop in in_pcb_lport() [or directly in in_pcbbind_setup() in
earlier releases] that considers candidate ephemeral port numbers and
calls in_pcblookup_local() to determine if a given candidate is suitable.
The default behaviour (if the caller has not set either SO_REUSEADDR or
SO_REUSEPORT) is to pick a local port number that is not in use by
*any* local TCP socket.

So long as the number of concurrent, outgoing connections is less than the
range configured by `sysctl net.inet.ip.portrange.*`, selecting a totally
unique ephemeral port number works OK.  However, you cannot exceed that
limit, even if each outgoing connection has a unique faddr/fport.  This
does not limit the number of connections that can be accept()'ed, only the
number of connections that can be connect()'ed.

In this particular path, I think the code should call in_pcblookup_hash(),
rather than in_pcblookup_local().  The criteria in in_pcblookup_hash() only
match if the full 5-tuple matches, rather than just the local port number.
The complication, of course, comes from the fact that in_pcbbind() is
called from both bind() and for the implicit bind that happens for a
connect().  The matching criteria in in_pcblookup_local() make sense for
the former but not quite for the later.

I mentioned that the above is the default behaviour you get when you don't
specify SO_REUSEADDR or SO_REUSEPORT.  Setting SO_REUSEADDR
before calling connect() has some surprizing consequences (surprizing in the
sense that I don't believe SO_REUSEADDR is supposed to have any effect
on connect()).  In this case, when in_pcblookup_local() is called, wild_okay
is set to false.  This changes the matching criteria to (in effect) allow
tcp_connect() to use the full 5-tuple space.  However, this brings us to the
second problem.

Problem #2 is that the ephemeral port number is chosen before the
fport/faddr gets set on the pcb; that is tcp_connect() calls in_pcbbind() to
select the ephemeral port number, *then* calls in_pcbconnect_setup() to
populate the fport/faddr.  With SO_REUSEADDR, in_pcbbind() can select
an in-use local port.  If the local port is used by a socket with a different
laddr/fport/faddr, all is good.  However, if the local port selection
results in a
full conflict it will get rejected by the call to in_pcblookup_hash() inside
in_pcbconnect_setup().  This happens *after* the loop inside
in_pcbbind(), so the call to tcp_connect() fails with EADDRINUSE.  Thus,
with SO_REUSEADDR, connect() can fail with EADDRINUSE long before
the ephemeral port space has been exhausted.  The application could re-try
the call to connect() and likely succeed, as a new local port would be
selected.

Overall, this behaviour hinders the ability to open a large number of
outbound connections:
 * If you don't specify SO_REUSEADDR, you have a fairly limited maximum
   number of outbound connections.
 * If you do specify SO_REUSEADDR, you are able to open a much larger
   number of outbound connections, but must retry on EADDRINUSE.

I believe that the logic under tcp_connect() should be modified to:

 - behave uniformly whether or not SO_REUSEADDR has been set
 - allow outgoing connection requests to re-use a local port number, so
   long as the remaining elements of the tuple (laddr, fport, faddr) are
   unique

==========
Follow-up from the freebsd-net mailing list:

Date:      Sat, 01 Dec 2012 11:31:31 -0300
From:      Fernando Gont <fernando at gont.com.ar>
To:        Keith Arner <vornum at gmail.com>
Cc:        freebsd-net at freebsd.org
Subject:   Re: Problems with ephemeral port selection
Message-ID:  <50BA14C3.4070601 at gont.com.ar>
In-Reply-To: <CAEo_tUH9LPzPFP-O=317rYEQ3nT66b4biQshV_8=L8hReO_BLg at mail.gmail.com>
References:  <CAEo_tUH9LPzPFP-O=317rYEQ3nT66b4biQshV_8=L8hReO_BLg at mail.gmail.com>

Next in thread | Previous in thread | Raw E-Mail | Index | Archive | Help

Hi, Keith,

On 11/30/2012 11:09 AM, Keith Arner wrote:
>
>  - behave uniformly whether or not SO_REUSEADDR has been set
>  - allow outgoing connection requests to re-use a local port number, so
>    long as the remaining elements of the tuple (laddr, fport, faddr) are
>    unique

Please take a look at the discussion on how to "steal" incomming
connections in Section 3.1 of RFC 6056.

Cheers,
-- 
Fernando Gont
e-mail: fernando at gont.com.ar || fgont at si6networks.com
PGP Fingerprint: 7809 84F5 322E 45C7 F1C9 3945 96EE A9EF D076 FFF1

>How-To-Repeat:
connect() a large number of sockets, specifying SO_REUSEADDR before
calling connect().  Note that the call to connect() fails with
EADDRINUSE long before we run into any resource exhaustion.

Then connect() a large number of sockets, without specificying
SO_REUSADDR (while all the previous sockets are still open).  Note
that connect() then fails with EADDRNOTAVAIL;  this occurs as soon
as the total number of outgoing connections equals the ephemeral
port range.

#include <sys/types.h>
#include <sys/socket.h>
#include <stdio.h>
#include <errno.h>
#include <stdlib.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <net/if.h>
#include <arpa/inet.h>

int last_child = -1;

#define complain(exit_val)                      \
    {                                           \
        return(exit_val);                       \
    }

int SockOpt(int s, int level, int opt)
{
    int opt_val = 1;
    int ret = setsockopt(s, level, opt, &opt_val, sizeof(opt_val));
    if (ret) {
        perror("Could not setsockopt() on socket");
        complain(-1);
    }
    return 0;
}

int open_server(int port)
{
    int ret;
    struct sockaddr_in sin;

    sin.sin_family = AF_INET;
    sin.sin_addr.s_addr = htonl(INADDR_ANY);
    sin.sin_port = htons(port);

    int server = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP);
    if (server < 0) {
        perror("Could not open server socket");
        complain(-1);
    }

    SockOpt(server, SOL_SOCKET, SO_REUSEADDR);

    ret = bind(server, (struct sockaddr *)&sin, sizeof(sin));
    if (ret) {
        perror("Could not bind() server socket");
        complain(-1);
    }

    ret = listen(server, 5);
    if (ret) {
        perror("Could not listen() server socket");
        complain(-1);
    }

    return server;
}

int cycle_client(int server, int iteration, int port, int reuse)
{
    int ret;
    struct sockaddr_in sin;

    sin.sin_family = AF_INET;
    sin.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
    sin.sin_port = htons(port);

    int client = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP);
    if (client < 0) {
        fprintf(stderr, "Iteration %d, errno %d: ", iteration, errno);
        perror("Could not open client socket");
        complain(-1);
    }

    if (reuse) {
        SockOpt(client, SOL_SOCKET, SO_REUSEADDR);
    }

    ret = connect(client, (struct sockaddr *)&sin, sizeof(sin));
    if (ret) {
        fprintf(stderr, "Iteration %d, errno %d: ", iteration, errno);
        perror("Could not connect() client socket");
        complain(-1);
    }

    int len;
    int child = accept(server, (struct sockaddr *)&sin, &len);
    if (child < 0) {
        fprintf(stderr, "Iteration %d, errno %d: ", iteration, errno);
        perror("Could not accept() child socket");
        complain(-1);
    }

    /* Why are we not closing the sockets?
     *
     * The point of this program is to illustrate the behaviour of the
     *  network stack when we open (or, rather connect()) a large number of
     *  outgoing sockets.  Thus, we want the sockets to linger around, to
     *  consume ephemeral port numbers.  Note that we could get largely
     *  similar behaviour by closing the sockets (if we close the client
     *  socket first), as the pcbs would linger in the TIME_WAIT state,
     *  consuming emphemeral port numbers.  
     *
     * Note that because TIME_WAIT connections count against up, the
     *  behaviour being illustrated does not rely on a large number of
     *  concurrent connections, just a large number of outgoing connections
     *  established over a short time period.  But it is easier to understand
     *  the operation of this program if we leave the sockets open.
    /* 
    ret = close(client);
    if (ret) {
        fprintf(stderr, "Iteration %d, errno %d: ", iteration), errno;
        perror("Could not close() client");
        complain(-1);
    }
    */

    /*
    if (last_child) {
        ret = close(child);
        if (ret) {
            fprintf(stderr, "Iteration %d, errno %d: ", iteration, errno);
            perror("Could not close() child");
            complain(-1);
        }
    }
    */

    last_child = child;

    return 0;
}

/* Main loop to illustrate ephemeral port number behaviour.*/
int main(int argc, void **argv)
{
    /* num_iterations: How many sockets do we want to try to open per remote
     *  port number?  Should be set higher than the number of unique
     *  ephemeral port numbers that the stack can choose from.  With the
     *  default FreeBSD settings, that works out to:
     *
     *  net.inet.ip.portrange.last: 65535
     *  net.inet.ip.portrange.first: 49152
     *
     *  65535 - 49152 = 16383
     */
    int num_iterations = 20 * 1000;

    /* num_ports: How many distinct remote ports to we want to connect to? */
    int num_ports = 2;

    /* port: base, remote port number to connect to */
    int port = 12345;

    /* reuse: Should we set SO_REUSEADDR before calling connect()?
     *  Note that we alternate this value each for each remote port, to
     *  illustrate the differences in behaviour between setting it or not. */
    int reuse = 1;

    int port_loop;

    for (port_loop=0; port_loop<num_ports; port_loop++) {
        /* Set up a listening socket on the next remote port number. */
        int server = open_server(port);

        int i=0;
        for(; i<num_iterations; i++) {
            /* Open a bunch of sockets; and bail out on the first failure. */
            if (cycle_client(server, i, port, reuse)) {
                break;
            }
        }
        /* How many connections did we manage to establish on this port
         *  number (and with this "reuse" setting)?  If all is working,
         *  we ought to be able to establish as many connections as there
         *  are ephemeral ports, and we ought to be able to do so for each
         *  remote port number (baring memory exhaustion problems). */
        fprintf(stderr, "port %d; reuse %d; opened %d\n",
               port, reuse, i);

        /* Advance to the next remote port, and toggle whether we set
         *  SO_REUSEADDR. */
        port++;
        reuse = !reuse;
    }
    return 0;
}

>Fix:

>Release-Note:
>Audit-Trail:
>Unformatted: