[Bug 275594] High CPU usage by arc_prune; analysis and fix

From: <bugzilla-noreply_at_freebsd.org>
Date: Fri, 23 Feb 2024 19:25:52 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275594

--- Comment #67 from Peter Much <pmc@citylink.dinoex.sub.org> ---
So, now I read all the material here. Great work!

I had upgraded my deploy engine from 13.2-RELEASE to 13.3-BETA, and found
(among some spurious messages from git) that it can no longer build gcc12.

There is apparently no problem with rust or llvm15, but trying to build gcc12
does reproducibly crash (10 core, 16081M ram). Apparently the crash happens
when gcc fully powers up its LTO for the first time:

last pid: 37369;  load averages:  9.35,  9.93,  9.27    up 0+03:15:25  07:21:42
417 threads:   14 running, 379 sleeping, 24 waiting
CPU: 55.4% user,  0.0% nice, 35.6% system,  0.1% interrupt,  8.8% idle
Mem: 7047M Active, 6121M Inact, 2392M Wired, 984M Buf, 60M Free
ARC: 518M Total, 45M MFU, 451M MRU, 128K Anon, 3990K Header, 17M Other
     467M Compressed, 997M Uncompressed, 2.14:1 Ratio
Swap: 15G Total, 15G Free

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
    0 root         -8    -     0B  2432K CPU4     4   3:14  99.79% kernel{arc_p
    7 root        -16    -     0B    48K CPU6     6   2:45  99.79% pagedaemon{d
   15 root         52    -     0B    16K CPU0     0   3:00  99.70% vnlru
37334 root         52    0   891M   789M pfault   1   0:37  89.24% lto1
37270 root         52    0  1017M   915M pfault   3   0:43  88.63% lto1
37324 root         52    0   831M   770M pfault   8   0:39  88.59% lto1
37338 root         52    0   843M   785M pfault   2   0:36  88.50% lto1
37333 root         52    0   889M   788M pfault   7   0:37  82.76% lto1
37269 root         52    0  1001M   882M pfault   5   0:42  82.09% lto1
37274 root         52    0  1004M   885M pfault   9   0:42  80.24% lto1
    5 root         20    -     0B  1568K t->zth   9   0:02   1.02% zfskern{arc_
37360 root         20    0    14M  4940K CPU9     9   0:00   0.87% top

This is the last output, at this point the system becomes unresponsive, and,
when allowed neither to oom-kill nor panic, continues to consume 300% compute.
Apparently these are the visible three apocalyptic riders (arc_prune,
pagedaemon, vnlru) entertaining themselves. :/

Implementing the patch (i.e. five new git commits from the github repo) solves
the issue, and afterwards it looks like this:

last pid: 11944;  load averages:  7.13,  5.29,  5.77    up 0+03:48:45  16:12:46
424 threads:   19 running, 381 sleeping, 24 waiting
CPU: 67.9% user,  0.0% nice,  5.1% system,  0.0% interrupt, 27.0% idle
Mem: 9308M Active, 2285M Inact, 20M Laundry, 3643M Wired, 865M Buf, 336M Free
eRC: 1638M Total, 855M MFU, 575M MRU, 128K Anon, 11M Header, 198M Other
     1305M Compressed, 2980M Uncompressed, 2.28:1 Ratio
Swap: 15G Total, 15G Free

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
11579 root        103    0  1269M  1066M CPU6     6   4:09 100.00% lto1
11605 root        103    0  1263M  1052M CPU3     3   4:08  99.87% lto1
11589 root        103    0  1295M  1091M CPU8     8   4:09  99.87% lto1
11599 root        103    0  1259M  1027M CPU9     9   4:08  99.87% lto1
11588 root        103    0  1263M  1035M CPU7     7   4:09  99.87% lto1
11590 root        103    0  1287M  1058M CPU5     5   4:08  99.87% lto1
11598 root        103    0  1311M  1082M CPU1     1   4:08  99.74% lto1
    0 root         -8    -     0B  2448K -        6   0:03   6.83% kernel{arc_p
    5 root         -8    -     0B  1568K RUN      9   0:03   5.80% zfskern{arc_
    7 root        -16    -     0B    48K psleep   2   0:37   3.11% pagedaemon{d

I'm a bit worried the thing is still reluctant to page out, but otherwise this
looks good.

-- 
You are receiving this mail because:
You are the assignee for the bug.