svn commit: r294536 - head/sys/netinet

Sat Jan 23 07:10:14 UTC 2016

On 01/23/16 05:04, Gleb Smirnoff wrote:
> On Fri, Jan 22, 2016 at 05:19:34PM +1100, Lawrence Stewart wrote:
> L> On 01/22/16 09:53, Gleb Smirnoff wrote:
> L> > Author: glebius
> L> > Date: Thu Jan 21 22:53:12 2016
> L> > New Revision: 294536
> L> > URL: https://svnweb.freebsd.org/changeset/base/294536
> L> > 
> L> > Log:
> L> >   Refactor TCP_CONGESTION setsockopt handling:
> L> >   - Use M_TEMP instead of stack variable.
> L> >   - Unroll error handling, removing several levels of indentation.
> L> 
> L> As noted privately elsewhere, this change introduces races with respect
> L> to CC algorithm module unloading which will need to be fixed. Just
> L> mentioning it again publicly so others are aware of it.
> 
> In the code checked in the check for (CC_ALGO(tp)->ctl_output being
> not NULL and the actual call are under the same lock.
> 
> Is that the race you refer to?

No, the TCP_CONGESTION refactoring i.e. this commit, introduced races in
the get and set cases. I guess I didn't provide enough context in
Crucible, so here goes...

The post refactoring get code is now:

    case TCP_CONGESTION:
        INP_WUNLOCK(inp);
        error = sooptcopyout(sopt, CC_ALGO(tp)->name, TCP_CA_NAME_MAX);
        break;

Consider that tp is using cc_blah and that the cc_blah module is
unloaded as the copy out is happening. By not holding the INP lock, the
CC module unload code is able to walk the list of active connections,
find this connection is using cc_blah, acquire the INP lock, switch this
connection to cc_newreno, release the lock and finally unload the
cc_blah module, rendering the pointer passed in to sooptcopyout garbage.
See cc_deregister_algo() in cc.c and tcp_ccalgounload() in tcp_subr.c
for details related to CC module unload.

The post refactoring set code is now:

    case TCP_CONGESTION:
        INP_WUNLOCK(inp);
        buf = malloc(TCP_CA_NAME_MAX, M_TEMP, M_WAITOK|M_ZERO);
        error = sooptcopyin(sopt, buf, TCP_CA_NAME_MAX, 1);
        if (error) {
	    free(buf, M_TEMP);
            break;
        }
        CC_LIST_RLOCK();
        STAILQ_FOREACH(algo, &cc_list, entries)
            if (strncmp(buf, algo->name, TCP_CA_NAME_MAX) == 0)
                break;
        CC_LIST_RUNLOCK();
        free(buf, M_TEMP);
        if (algo == NULL) {
            error = EINVAL;
            break;
        }
        INP_WLOCK_RECHECK(inp);
        /*
         * We hold a write lock over the tcb so it's safe to
         * do these things without ordering concerns.
         */
        if (CC_ALGO(tp)->cb_destroy != NULL)
            CC_ALGO(tp)->cb_destroy(tp->ccv);
        CC_ALGO(tp) = algo;
        /*
         * If something goes pear shaped initialising the new
         * algo, fall back to newreno (which does not
         * require initialisation).
         */
        if (algo->cb_init != NULL && algo->cb_init(tp->ccv) != 0) {
            CC_ALGO(tp) = &newreno_cc_algo;
            /*
             * The only reason init should fail is
             * because of malloc.
             */
            error = ENOMEM;
       }
       INP_WUNLOCK(inp);
       break;

Consider that the application has requested to switch to "blah"
congestion control and that the cc_blah module is unloaded as the
request is being processed. We hold no locks, copy in the string "blah",
grab the CC list lock, find cc_blah's algo struct in the list leaving
its pointer stored in the "algo" variable and drop the CC list lock.

Crucially at this point, we don't hold the INP lock, so once we drop the
CC list lock, the CC module unload code is free to pull cc_blah from the
CC module list, walk the list of active connections and switch any using
cc_blah to cc_newreno and to unload the cc_blah module. The tp we're
modifying hasn't switched to cc_blah yet so it won't be reverted to
cc_newreno by the CC module unload code, which means the "algo" variable
will be a garbage pointer by the time we get around to grabbing the INP
lock at the INP_WLOCK_RECHECK() line of code.

You have to ensure that the algo pointer is valid and that the INP lock
is held at the time of assigning to the tp, as that will ensure that if
you're racing with the unload code, the unload code will correctly
switch the connection's algo back to cc_newreno once it's able to
acquire the INP lock.

Cheers,
Lawrence