Re: AF_UNIX socketpair dgram queue sizes

From: Mark Johnston <markj_at_freebsd.org>
Date: Wed, 10 Nov 2021 02:57:47 UTC
On Tue, Nov 09, 2021 at 08:57:20PM -0500, Jan Schaumann via freebsd-net wrote:
> Hello,
> 
> I'm trying to wrap my head around the buffer sizes
> relevant to AF_UNIX/PF_LOCAL dgram socketpairs.
> 
> On a FreeBSD/amd64 13.0 system, creating a socketpair
> and simply writing a single byte in a loop to the
> non-blocking write end without reading the data, I can
> perform 64 writes before causing EAGAIN, yielding 1088
> bytes in FIONREAD on the read end (indicating 16 bytes
> per datagram overhead).

When transmitting on a unix dgram socket, each message will include a
copy of the sender's address, represented by a dummy 16-byte sockaddr in
this case.  This is stripped by the kernel when receiving, but still
incurs overhead with respect to socket buffer accounting.

> This is well below the total net.local.dgram.recvspace
> = 4096 bytes.  I would have expected to be able to
> perform 240 1 byte writes (240 + 240*16 = 4080).
> 
> Now if I try to write SO_SNDBUF = 2048 bytes on each
> iteration (or subsequently as many as I can until
> EAGAIN), then I can send one datagram with 2048 bytes
> and one datagram with 2016 bytes, filling recvspace as
> (2 * 16) + (2048 + 2016) = 4096.
> 
> But at smaller sizes, it looks like the recvspace is
> not filled completely: writes in chunks of > 803 bytes
> will fill recvspace up to 4096 bytes, but below 803
> bytes, recvspace is not maxed out.
> 
> Does anybody know why smaller datagrams can't fill
> recvspace?  Or what I'm missing / misunderstanding
> about the recvspace here?

There is an additional factor: wasted space.  When writing data to a
socket, the kernel buffers that data in mbufs.  All mbufs have some
amount of embedded storage, and the kernel accounts for that storage,
whether or not it's used.  With small byte datagrams there can be a lot
of overhead; with stream sockets the problem is mitigated somewhat by
compression, but for datagrams we don't have a smarter mechanism to
maintain message boundaries.

The kern.ipc.sockbuf_waste_factor sysctl controls the upper limit on
total bytes (used or not) that may be enqueued in a socket buffer.  The
default value of 8 means that we'll waste up to 7 bytes per byte of
data, I think.  Setting it higher should let you enqueue more messages.
As far as I know this limit can't be modified directly, it's a function
of the waste factor and the socket buffer size.