[Bug 287818] libcasper incorrectly uses unix(4) socket
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Tue, 02 Sep 2025 20:13:46 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=287818
--- Comment #6 from Gleb Smirnoff <glebius@FreeBSD.org> ---
Here is detailed diagnosis of the problem, but no fix. I'm releasing the bug
report back to the pool and changing the title.
libcasper would run socketpair(PF_LOCAL, SOCK_STREAM | SOCK_NONBLOCK, ...) and
fork helpers. Note SOCK_NONBLOCK flag! Then, libcasper will read & write on
these sockets via libnv. To be more specific: via FreeBSD shim of libnv, that
lives in lib/libnv, while the bulk of the libnv lives in sys/contrib/libnv.
Here is typical send(2) trace:
libsys.so.7`_sendto+0xa
libnv.so.1`FreeBSD_nvlist_send+0x96
libcasper.so.1`service_message+0x1e1
libcasper.so.1`service_start+0x405
libcasper.so.1`service_execute+0xa7
libcasper.so.1`0x2ab1b9f85a88
libcasper.so.1`0x2ab1b9f858d9
libcasper.so.1`casper_main_loop+0x1b
libcasper.so.1`cap_init+0xad
libcap_fileargs.so.1`fileargs_init+0x52
fileargs_test`atfu_fileargs__open_read_body+0x10e
This FreeBSD_nvlist_send() lives in lib/libnv/msgio.c. It is inlined version
of the chain of functions fd_send()->fd_package_send()->msg_send(). It does
not handle EAGAIN. But since we set SOCK_NONBLOCK flag, we can get EAGAIN if
we are sending a lot of data (this is what test does!) and our peer reads
slowly. The race between test body and casper helpers triggers only at certain
conditions. Apparently, before d15792780760 the race was not hit for some
reason. I'd claim because the socket was slow :) After this revision the test
was intermittently broken on some machines/VMs. Then, after 72ddb6de1028 the
test was "fixed", cause with a fat buffer it is virtually impossible to
reproduce the race.
How to reproduce the libcasper bug with CURRENT after 72ddb6de1028:
./bricoler run freebsd-src-regression-suite --param
freebsd-src-regression-suite:hypervisor=bhyve --param
freebsd-src:url=/usr/src/FreeBSD --param freebsd-src:branch=main --param
freebsd-src-regression-suite:tests='lib/libcasper/services/cap_fileargs/fileargs_test'
--param freebsd-src-regression-suite:count=15 --param
freebsd-src-regression-suite:parallelism=1 --param
freebsd-src-regression-suite-vm-image:sysctls="net.local.stream.recvspace=8192
net.local.stream.sendspace=8192"
This sets buffer sizes to pre-72ddb6de1028 values. I would suggest to run test
interactively, enter the bricoler VM and run in there:
while true; do /usr/tests/lib/libcasper/services/cap_fileargs/fileargs_test
fileargs__open_read; done
In parallel you can watch if EAGAIN is ever returned:
dtrace -n 'fbt::uipc_sosend_stream_or_seqpacket:return /args[1] == 35/ {
ustack(); }'
An interesting fact is that EAGAIN is hit much more often then the test case
actually fails! I guess we aren't losing any data between test body and casper
helper due to internal buffering of libnv. We write rest of the truncated
message when a next message is sent, that's my theory. Only when the test body
is quick enough to recv() zero bytes, only then the test fails.
Ideas on fixing that? I would really try just remove SOCK_NONBLOCK from
libcasper! We are using a library for I/O that is not ready for non-blocking
I/O. But such a bold move requires more understanding of casper than I have.
And a lot of testing of course.
P.S. IMHO, libcasper will benefit from SOCK_SEQPACKET, that in 15.0-CURRENT
works correctly. So that library doesn't need to care about message
boundaries, the kernel will guarantee them.
--
You are receiving this mail because:
You are the assignee for the bug.