[Bug 288983] sysutils/slurm-wlm: slurmd and slurmstepd crash due to missing sockaddr length handling in bind() / connect()

From: <bugzilla-noreply_at_freebsd.org>
Date: Wed, 20 Aug 2025 23:25:34 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=288983

            Bug ID: 288983
           Summary: sysutils/slurm-wlm: slurmd and slurmstepd crash due to
                    missing sockaddr length handling in bind() / connect()
           Product: Ports & Packages
           Version: Latest
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: Individual Port(s)
          Assignee: ports-bugs@FreeBSD.org
          Reporter: rikka.goering@outlook.de

When applying the patches that solve bug #288617, #288668, and #288880, both
slurmctld and slurmd start successfully and initially connect. However, after
some time the daemons lose connection. Submitting tasks via srun fails, and
slurmd eventually crashes with a segmentation fault.

The root cause appears to be that several bind() and connect() calls do not set
the sockaddr length (sun_len, sin_len, sin6_len) correctly on FreeBSD. Without
this, sockets are initialized improperly and result in runtime errors.

How to reproduce:
srun -N1 -w Torch -t1 /bin/hostname

Actual result:
srun: error: unable to initialize step launch listening socket: Invalid
argument
srun: Required node not available (down, drained or reserved)
srun: job 3 queued and waiting for resources

slurmd eventually segfaults.

Expected result:
Command runs successfully and prints the hostname of the worker node (here:
Torch).

Workaround:
No known workaround exists, except manually fixing the sockaddr length fields
(sun_len, sin_len, sin6_len) and passing them to bind() / connect().
I am currently preparing patches for this and will upload a unified git diff
once they are finished and tested.

-- 
You are receiving this mail because:
You are the assignee for the bug.