sysctl kern.ipc.somaxconn limit 65535 why?

Thu Jan 5 01:32:48 UTC 2012

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/04/12 16:09, Dan The Man wrote:
> 
> On Wed, 4 Jan 2012, Chuck Swiger wrote:
> 
>> Hi--
>> 
>> On Jan 4, 2012, at 2:23 PM, Dan The Man wrote:
>>>> It is not arbitrary.  Systems ought to provide sensible
>>>> limits, which can be adjusted if needed and appropriate.  The
>>>> fact that a system might have 50,000 file descriptors
>>>> globally available does not mean that it would be OK for any
>>>> random process to consume half of them, even if there is
>>>> still adequate room left for other tasks. It's common for
>>>> "ulimit -n" to be set to 256 or 1024.
>>> 
>>> Sensibly limits means a sensible stock default, not imposing an
>>> OS limit on what admin/developer can set on his own hardware.
>> 
>> In point of fact, protocols like TCP/IP impose limits on what is 
>> possible.  It is in fact the job of the OS to say "no" when a 
>> developer asks for a TTL of a million via setsockopt(), because 
>> RFC-791 limits the maximum value of the "time to live" field to
>> 255.
>> 
>>> With the new IBM developments underway of 16 core atom
>>> processors and hundreds of gigabytes of memory, surely a
>>> backlog of 100k is manageable. Or what about the future of 500
>>> core systems with a terrabyte of memory, 100k listen queue
>>> could be processed instantly.
>> 
>> Um.  I gather you don't have much background in operating system 
>> design or massively parallelized systems?
>> 
>> Due to locking constraints imposed by whatever synchronization 
>> mechanism and communications topology is employed between cores,
>> you simply cannot just add more processors to a system and expect
>> it to go faster in a linear fashion.  Having 500 cores contending
>> over a single queue is almost certain to result in horrible
>> performance.  Even though the problem of a bunch of independent
>> requests is "embarrassingly parallelizeable", you do that by
>> partitioning the queue into multiple pieces that are fed to
>> different groups or pools of processors to minimize contention
>> over a single data structure.
>> 
> 
> I guess your calling me out to talk about what I'm doing based on
> that statement:
> 
> First framework I was working on a few weeks back just had a parent
> bind socket, then spawn a certain amount of children to do the
> accept on the socket, so parent could just focus on dealing with
> SIGCHLD and what not. I had issues with this design for some
> reason, all the sockets were set to non-blocking etc, and using
> kqueue to monitor the socket, but randomly I would have a 1-2
> second delay at times from a child doing an accept, I was horrified
> and changed design quickly.
> 
> New design, parent does all the accepts and passes blocking work
> to children via socketpairs it created when forking. Now you talk
> about scaling on multiple cores, well each child could have its own
> core to do its blocking I/O on and each have its own processor
> time, which isn't parallism , but I never said it was doing that.
> 
> The better part of this design is you have 1 process utilizing a 
> processor efficiently instead of paging the system with useless 
> processes. Also could could have other machines connect in to
> parent and it could do same thing it does with children via a
> socket, so in my opinion its more scalable and can centralize
> everything in one spot. Obviously some cons to this design, you are
> passing data via socket pairs instead of child writing directly to
> client.
> 
> To stress test this new design I simply wrote an asycronouse
> client counterpart to create 100k of connections to parents listen
> queue, then it would go off writing to each socket, of course soon
> as I reached 60k or so client would get numerous failures due to OS
> limits. So my intention was to see how long it would take children
> to process request and send response back to client, starting from
> listen queue with 100k of fd's ready to go I thought would have
> been really nice test not only for testing applications speed but
> also testing cpu usage, I/O usage etc with parent processing a
> client trying to talk to it 100k times at once to really see how
> kqueue does.
> 
> Without being able to increase simple limits like these how ever
> going to find where we can burn down the system and make it
> outperform epoll() one day.
> 
> What it so bad to see how many fd's I could toss at kqueue before
> it croaked? @60k was still handling like a champ with about 50
> children getting handed work in my tests.
> 
> 
>>>> Yes.  If the system doesn't handle connectivity problems via 
>>>> something like exponential backoff, then the weak point is
>>>> poor software design and not FreeBSD being unwilling to set
>>>> the socket listen queue to a value in the hundreds of
>>>> thousands.
>>> 
>>> I think what me and Arnaud are trying to say here, is let
>>> freebsd use a sensible default value, but let the admin dictate
>>> the actual policy if he so chooses to change it for stress
>>> testing, future proofing or anything else.
>> 
>> FreeBSD does provide a sensible default value for the listen
>> queue size.  It's tunable to a factor of about 1000 times larger,
>> and is a value which is sufficiently large to hold a backlog of
>> several minutes worth of connections, assuming you can process
>> the requests at a very high rate to keep draining the queue.
>> 
>> There probably isn't a reasonable use-case for queuing
>> unprocessed requests for longer than MAXTTL, which is about 4
>> minutes.  So, it's conceivable in theory for a high-volume server
>> to want to set the listen queue to, say 1000 req/s * 255 (ie,
>> MAXTTL), but I manage high volume servers for a living, and
>> practical experience including measurements of latency and
>> service performance suggests that tuning the listen queue up to
>> on the order of a thousand or so is the inflection point after
>> which it is better/necessary for the software to recognize and
>> start doing overload mitigation then it is for the OS to blindly
>> queue more requests.
>> 
>> Put more simply, there comes a point where saying "no", ie,
>> dropping the connection with a reset, works better.
> 
> I agree, listen queue of course will go to something reasonable
> when i was done with testing.

Here is a patch which I have not tested, just compiled to validate,
which changes the u_short's to u_int's.

I am personally quite convinced that FreeBSD should make such change
though -- having more than 64K of outstanding and unhandled
connections does not sound a great idea (i.e. it's not a connection
limit after all, but the pending handle connection.  If my math were
right, 64K connections would require about 1Gbps bandwidth in and out
if they happen in the same second.)  But I agree this would be a good
stress test, which might expose some bugs we don't know today.

Cheers,
- -- 
Xin LI <delphij at delphij.net>	https://www.delphij.net/
FreeBSD - The Power to Serve!		Live free or die
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (FreeBSD)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk8E/b4ACgkQOfuToMruuMBgEQCdF3qVnKsvTXexQmJXEn4wenZA
ujQAnjRpBEMW1AR0hIXaae2P8jJ/5PvE
=nGgG
-----END PGP SIGNATURE-----
-------------- next part --------------

diff --git a/sys/kern/uipc_socket.c b/sys/kern/uipc_socket.c
index 2a1bf7f..5977428 100644
--- a/sys/kern/uipc_socket.c
+++ b/sys/kern/uipc_socket.c
@@ -187,7 +187,6 @@ MALLOC_DEFINE(M_PCB, "pcb", "protocol control block");
 
 static int somaxconn = SOMAXCONN;
 static int sysctl_somaxconn(SYSCTL_HANDLER_ARGS);
-/* XXX: we dont have SYSCTL_USHORT */
 SYSCTL_PROC(_kern_ipc, KIPC_SOMAXCONN, somaxconn, CTLTYPE_UINT | CTLFLAG_RW,
     0, sizeof(int), sysctl_somaxconn, "I", "Maximum pending socket connection "
     "queue size");
@@ -3280,7 +3279,7 @@ sysctl_somaxconn(SYSCTL_HANDLER_ARGS)
 	if (error || !req->newptr )
 		return (error);
 
-	if (val < 1 || val > USHRT_MAX)
+	if (val < 1 || val > UINT_MAX)
 		return (EINVAL);
 
 	somaxconn = val;
diff --git a/sys/sys/socketvar.h b/sys/sys/socketvar.h
index 94c3b24..51c1c5d 100644
--- a/sys/sys/socketvar.h
+++ b/sys/sys/socketvar.h
@@ -93,10 +93,10 @@ struct socket {
 	TAILQ_HEAD(, socket) so_incomp;	/* (e) queue of partial unaccepted connections */
 	TAILQ_HEAD(, socket) so_comp;	/* (e) queue of complete unaccepted connections */
 	TAILQ_ENTRY(socket) so_list;	/* (e) list of unaccepted connections */
-	u_short	so_qlen;		/* (e) number of unaccepted connections */
-	u_short	so_incqlen;		/* (e) number of unaccepted incomplete
+	u_int	so_qlen;		/* (e) number of unaccepted connections */
+	u_int	so_incqlen;		/* (e) number of unaccepted incomplete
 					   connections */
-	u_short	so_qlimit;		/* (e) max number queued connections */
+	u_int	so_qlimit;		/* (e) max number queued connections */
 	short	so_timeo;		/* (g) connection timeout */
 	u_short	so_error;		/* (f) error affecting connection */
 	struct	sigio *so_sigio;	/* [sg] information for async I/O or
@@ -169,9 +169,9 @@ struct xsocket {
 	caddr_t	so_pcb;		/* another convenient handle */
 	int	xso_protocol;
 	int	xso_family;
-	u_short	so_qlen;
-	u_short	so_incqlen;
-	u_short	so_qlimit;
+	u_int	so_qlen;
+	u_int	so_incqlen;
+	u_int	so_qlimit;
 	short	so_timeo;
 	u_short	so_error;
 	pid_t	so_pgid;