Re: Odd "swp_pager_getswapspace(??): failed"s happen during bulk -Ca for RAM+SWAP=704 GiBytes

From: Mark Millard <marklmi_at_yahoo.com>
Date: Sun, 27 Jul 2025 07:33:16 UTC
On Jul 23, 2025, at 01:42, Mark Millard <marklmi@yahoo.com> wrote:

> In a context with RAM+SWAP = 704 GiBytes (192 GiBytes being RAM,
> 512 GiBytes being SWAP) doing poudriere bulk -Ca builds at some
> point ends up with reports like:
> 
> swp_pager_getswapspace(22): failed
> 
> and:
> 
> was killed: failed to reclaim memory
> 
> for 12 builders, MAKE_JOBS_NUMBER=3 , TMPFS_BLACKLIST
> in use, 32 FreeBSD cpus, etc.
> 
> For example:
> 
> . . .
> Jul 22 10:17:27 7950X3D-ZFS kernel: pid 62915 (scc_16815), jid 780, uid 0: exited on signal 11 (core dumped)
> Jul 22 21:38:10 7950X3D-ZFS kernel: ue0: link state changed to DOWN
> Jul 22 21:38:10 7950X3D-ZFS kernel: ue0: link state changed to UP
> Jul 22 21:38:29 7950X3D-ZFS kernel: swap_pager: out of swap space
> Jul 22 21:38:29 7950X3D-ZFS kernel: swp_pager_getswapspace(22): failed
> Jul 22 21:39:11 7950X3D-ZFS kernel: pid 15059 (dot), jid 780, uid 0, was killed: failed to reclaim memory
> Jul 22 21:43:38 7950X3D-ZFS kernel: swap_pager: out of swap space
> Jul 22 21:43:38 7950X3D-ZFS kernel: swp_pager_getswapspace(14): failed
> Jul 22 21:44:04 7950X3D-ZFS kernel: pid 15049 (dot), jid 780, uid 0, was killed: failed to reclaim memory
> Jul 22 21:56:39 7950X3D-ZFS kernel: swap_pager: out of swap space
> Jul 22 21:56:39 7950X3D-ZFS kernel: swp_pager_getswapspace(15): failed
> Jul 22 21:57:12 7950X3D-ZFS kernel: pid 15045 (dot), jid 780, uid 0, was killed: failed to reclaim memory
> 
> I've not figured out a way to track down such messages
> back to the relevant log file for the builds that were
> killed. Neither the pid, nor the jid appear in
> the log files. Similarly, nothing in /var/log/messages
> identifies the poudriere Job Id or other such.
> 
> (I've never happened to be actively monitoring when
> the issue happened. So I've always ended up looking at
> it after the fact.)
> 
> It would be nice to be able to identify what specific
> packages to try to rebuild for these --and to investigate
> why the SWAP usage that had stayed under 2 GiByte ended
> up reaching 512 GiBytes during that period.

A panic from the activity during another bulk -Ca
test lead to the dump providing enough context to
track down the package that was being built that
got the issue and what is was running that, in
turn, has the problem memory usage:

[2D:01:22:29] [06] [00:00:00] Building   graphics/sdl2_gpu | sdl2_gpu-0.12.0

was using:

UID   PID  PPID  C PRI NI       VSZ      RSS MWCHAN   STAT TT          TIME COMMAND
. . .
 0 79229 40923  4  59  0     23524     4148 wait     D     -       0:00.00 [sh]
 0 79230 79229  5  59  0     14208      172 wait     Ds    -       0:00.01 [make]
 0 79233 79230  4  59  0     14668      176 wait     D     -       0:00.00 [sh]
 0 79234 79233  5  59  0     14668      176 wait     D     -       0:00.00 [sh]
 0 79235 79234 12   0  0     16284      356 select   D     -       0:00.01 [ninja]
 0 79236 79235 28  59  0    223048     1052 uwait    D     -       0:00.44 [doxygen]
 0 79272 79236 25  59  0 157589964 41424308 pfault   D     -       3:25.33 [dot]
 0 79279 79236 31  59  0 157601740 41513520 pfault   D     -       3:23.41 [dot]
 0 79289 79236 14  59  0 157589964 41361600 pfault   D     -       3:22.72 [dot]
 0 79301 79236 18  49  0 157667276 41208476 pfault   D     -       3:24.32 [dot]
. . .

Part of the context was the /06/ text in:
. . .
root     dot        79301    0 /usr/local/poudriere/data/.m/main-ZNV4-bulk_a-alt/06/dev     20 crw-rw-rw-    null  r
root     dot        79289    0 /usr/local/poudriere/data/.m/main-ZNV4-bulk_a-alt/06/dev     20 crw-rw-rw-    null  r
. . .
root     dot        79279    0 /usr/local/poudriere/data/.m/main-ZNV4-bulk_a-alt/06/dev     20 crw-rw-rw-    null  r
. . .
root     dot        79272    0 /usr/local/poudriere/data/.m/main-ZNV4-bulk_a-alt/06/dev     20 crw-rw-rw-    null  r
. . .
root     doxygen    79236    0 /usr/local/poudriere/data/.m/main-ZNV4-bulk_a-alt/06/dev     20 crw-rw-rw-    null  r
. . .

It identifies the [06] builder and the "Building" notice had made it to
the disk before the panic happened. Then I could check the Makefile for
if doxygen was used and it was. graphics/sdl2_gp historical build logs
suggest problems exist.

===
Mark Millard
marklmi at yahoo.com