RFC: NUMA mods for SO_REUSEPORT_LB

Drew Gallatin gallatin at netflix.com
Fri May 3 15:11:32 UTC 2019


The next patch up in my NUMA patchset is my patch to
affinitize SO_REUSEPORT_LB sockets.   I have to admit that I'm not super
happy with it, and I was looking for constructive feedback.

In our (Netflix) workload, we have an nginx master process which creates N
different listen sockets when SO_REUSEPORT_LB is in use.  He forks off the
workers, and they then affinitize themselves as directed in the
nginx.conf.  (worker N might not be bound to CPU N).  They then take over
the listen sockets and start serving.

In order to deal with this, I made a TCP_REUSPORT_LB_NUMA socket option.

The inpcblbgroup struct has been modified to add an il_numa_domain field.
When a group is created, this is set to M_NODOM ("numa wildcard").   On
lookup, only groups with matching numa domains are considered when an mbuf
has a non-M_NODOM m_numa_domain field set.  (and a numa wildcard match is
done if no matches are found).

When nginx wants to use this, he calls setsockopt(...
TCP_REUSEPORT_LB_NUMA...) on the existing listen socket

This sockopt:

- gets the CPU affinity mask of the calling thread
- finds the current NUMA domain for the calling thread
- looks up the inp and removes it from the numa-wildcard (M_NODOM) group
and inserts it into a new group specific to that numa domain.

This actually works quite well for me, but I don't think it is ready for
prime-time.  The sockopt API was admittedly done to satisfy my particular
use case, and I'm looking for feedback on how to improve it.

Specifically:

1) Is it OK to add a new option that modifies an existing listen socket?
- This was the right choice for my application.  Is it too awkward in
general?

2) Should the sockopt put the job of selecting the appropriate numa domain
onto the caller?

Right now, everything is automatic.  Should it just take an argument which
corresponds to a NUMA domain (or -1 to remove the NUMA domain affinity)?
 Should it take an argument that corresponds to a CPUSET?

Any feedback is welcome.

Thanks,

Drew
-------------- next part --------------
A non-text attachment was scrubbed...
Name: reuse_numa.diff
Type: text/x-patch
Size: 16043 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-transport/attachments/20190503/90bd37aa/attachment.bin>


More information about the freebsd-transport mailing list