Re: Strange network/socket anomalies since about a month
Date: Wed, 24 Apr 2024 10:01:26 UTC
Am 2024-04-22 18:12, schrieb Gleb Smirnoff:
> There were several preparatory commits that were not reverted and one
> of them
> had a bug. The bug manifested itself as failure to send(2) zero bytes
> over
> unix/stream. It was fixed with
> e6a4b57239dafc6c944473326891d46d966c0264. Can
> you please check you have this revision? Other than that there are no
> known
> bugs left.
Yes, I have this fix in my running kernel.
> A> Any ideas how to track this down more easily than running the entire
> A> poudriere in ktrace (e.g. a hint/script which dtrace probes to use)?
>
> I don't have any better idea than ktrace over failing application.
> Yep, I
> understand that poudriere will produce a lot. But first we need to
> determine
Yes, it does. 4.4 GB just for the start of poudriere until the first
package build fails due to a failed sccache start (luckily in the first
builder, but I had at least 2 builders automatically spin up by
poudriere at the time when I validated the failure in the logs and
disabled the tracing).
> what syscall fails and on what type of socket. After that we can scope
> down to
> using dtrace on very particular functions.
I'm not sure I manage to find the cause of the failure... the only thing
which remotely looks like an issue is "Resource temporarily
unavailable", but this is from the process which waits for the server to
have started:
---snip---
58406 sccache 1713947887.504834367 RET __sysctl 0
58406 sccache 1713947887.505521884 CALL
rfork(0x80000000<><invalid>2147483648)
58406 sccache 1713947887.505757777 CAP system call not allowed:
rfork
58406 sccache 1713947887.505774176 RET rfork 58426/0xe43a
58406 sccache 1713947887.507304865 CALL
compat11.kevent(0x3,0x371d360f89e8,0x2,0x371d360f89e8,0x2,0)
58406 sccache 1713947887.507657906 STRU struct freebsd11_kevent[] = {
{ ident=11, filter=EVFILT_READ, flags=0x61<EV_ADD|EV_CLEAR|EV_RECEIPT>,
fflags=0, data=0, udata=0x0 }
{ ident=11, filter=EVFILT_WRITE,
flags=0x61<EV_ADD|EV_CLEAR|EV_RECEIPT>, fflags=0, data=0, udata=0x0 } }
58406 sccache 1713947887.507689980 STRU struct freebsd11_kevent[] = {
{ ident=11, filter=EVFILT_READ, flags=0x4000<EV_ERROR>, fflags=0,
data=0, udata=0x0 }
{ ident=11, filter=EVFILT_WRITE, flags=0x4000<EV_ERROR>,
fflags=0, data=0, udata=0x0 } }
58406 sccache 1713947887.507977155 RET compat11.kevent 2
58406 sccache 1713947887.508015751 CALL write(0x5,0x371515685bcc,0x1)
58406 sccache 1713947887.508086434 GIO fd 5 wrote 1 byte
0x0000 01 |.|
58406 sccache 1713947887.508145930 RET write 1
58406 sccache 1713947887.508183140 CALL
compat11.kevent(0x7,0,0,0x5a5689ab0c40,0x400,0)
58406 sccache 1713947887.508396614 STRU struct freebsd11_kevent[] = {
}
58406 sccache 1713947887.508156537 STRU struct freebsd11_kevent[] = {
{ ident=4, filter=EVFILT_READ, flags=0x60<EV_CLEAR|EV_RECEIPT>,
fflags=0, data=0x1, udata=0xffffffffffffffff } }
58406 sccache 1713947887.508530888 RET compat11.kevent 1
58406 sccache 1713947887.508563736 CALL read(0x4,0x371d3a2887c0,0x80)
58406 sccache 1713947887.508729102 GIO fd 4 read 1 byte
0x0000 01 |.|
58406 sccache 1713947887.508967661 RET read 1
58406 sccache 1713947887.508996753 CALL read(0x4,0x371d3a2887c0,0x80)
58406 sccache 1713947887.509028311 RET read -1 errno 35 Resource
temporarily unavailable
58406 sccache 1713947887.509068838 CALL
compat11.kevent(0x3,0,0,0x5a5689a97540,0x400,0x371d3a2887c8)
...
58406 sccache 1713947897.514352552 CALL
_umtx_op(0x5a5689a3d290,0x10<UMTX_OP_WAKE_PRIVATE>,0x7fffffff,0,0)
58406 sccache 1713947897.514383653 RET _umtx_op 0
58406 sccache 1713947897.514421273 CALL write(0x5,0x371515685bcc,0x1)
58406 sccache 1713947897.515050967 STRU struct freebsd11_kevent[] = {
{ ident=4, filter=EVFILT_READ, flags=0x60<EV_CLEAR|EV_RECEIPT>,
fflags=0, data=0x1, udata=0xffffffffffffffff } }
58406 sccache 1713947897.515146151 RET compat11.kevent 1
58406 sccache 1713947897.515178978 CALL read(0x4,0x371d3a2887c0,0x80)
58406 sccache 1713947897.515368070 GIO fd 4 read 1 byte
0x0000 01 |.|
58406 sccache 1713947897.515396600 RET read 1
58406 sccache 1713947897.515426523 CALL read(0x4,0x371d3a2887c0,0x80)
58406 sccache 1713947897.515457073 RET read -1 errno 35 Resource
temporarily unavailable
58406 sccache 1713947897.515004494 GIO fd 5 wrote 1 byte
0x0000 01 |.|
---snip---
https://www.leidinger.net/test/sccache.tar.bz2 contains the parts of the
trace of the sccache processes (in case someone wants to have a look).
The poudriere run had several builders in parallel, at least 2 were
running at that point in time. What the overlay does is to startup
(sccache --start-server) the sccache server process (forks to return
back on the command line) which creates a file system socket, and then
it queries the stats (sccache --show-stats). So some of the traces in
the tarball are the server start (those with "Timed out waiting for
server startup" are maybe the processes which fork to start the server
and wait for it to be started), some are the stat-query, and some seem
to be a successful start in another poudriere-builder (those with a
successful /root/.ccache/sccache/5/4/<hashvalue> access look like from a
successful start in another jail). Maybe there is also a --stop-server
from poudriere somewhere.
What I noticed (except that printing the new CAP stuff for non-CAP
enabled processes by default is disturbing) is, that compat11 stuff is
called (seems the rust ecosystem is not keeping up with our speed of
development...). Not sure if it matters here that some compat stuff is
called.
Bye,
Alexander.
--
http://www.Leidinger.net Alexander@Leidinger.net: PGP 0x8F31830F9F2772BF
http://www.FreeBSD.org netchild@FreeBSD.org : PGP 0x8F31830F9F2772BF