stable/13: ARC no longer self-tuning?

From: Peter <pmc_at_citylink.dinoex.sub.org>
Date: Wed, 30 Mar 2022 13:07:48 UTC
Hi,

  while up to Rel 12 the ZFS ARC adjusted it's size to the demand, in
Rel. 13 it appears to be locked to a fixed minimum of 100M compressed.
 
Consequentially I just got a machine stall/freeze under moderate load:
no cmdline reaction (except in the guests), no login possible, all
processes in "D" state. Reset button needed, all guests and jails
destroyed:

38378  -  DJ       0:03.36 find -sx / /ext /var /usr/local /usr/ports /usr/obj 
39414  -  DJ       0:00.00 sendmail: running queue: /var/spool/mqueue (sendmail
39415  -  DJ       0:00.00 sendmail: running queue: /var/spool/clientmqueue (se
39416  -  DJ       0:00.00 /usr/local/www/cgit/cgit.cgi
39417  -  D<       0:00.00 /usr/local/bin/ruby /ext/libexec/heatctl.rb (ruby27)
39418  -  DJ       0:00.00 sendmail: running queue: /var/spool/clientmqueue (se
39419  -  DJ       0:00.00 sendmail: running queue: /var/spool/mqueue (sendmail
39420  -  DJ       0:00.00 sendmail: running queue: /var/spool/clientmqueue (se
39421  -  DJ       0:00.00 sendmail: accepting connections (sendmail)
39426  -  D        0:00.00 sendmail: running queue: /var/spool/mqueue (sendmail
39427  -  D        0:00.00 sendmail: running queue: /var/spool/clientmqueue (se
39428  -  DJ       0:00.00 sendmail: Queue runner@00:03:00 for /var/spool/clien
39429  -  DJ       0:00.00 sendmail: accepting connections (sendmail)
39430  -  DJ       0:00.00 sendmail: running queue: /var/spool/clientmqueue (se
39465  -  Ds       0:00.01 newsyslog
39466  -  Ds       0:00.01 /bin/sh /usr/libexec/save-entropy
59365  -  DsJ      0:00.09 /usr/sbin/cron -s

"top", apparently the only process still running, shows this:

last pid: 39657;  load averages:  0.27,  1.24,  4.55    up 0+04:05:42  04:11:54
805 processes: 1 running, 804 sleeping
CPU:  0.1% user,  0.0% nice,  0.9% system,  0.0% interrupt, 99.0% idle
Mem: 16G Active, 5118M Inact, 1985M Laundry, 7144M Wired, 462M Buf, 905M Free
ARC: 1417M Total, 326M MFU, 347M MRU, 8216K Anon, 30M Header, 706M Other
     119M Compressed, 546M Uncompressed, 4.57:1 Ratio
Swap: 36G Total, 995M Used, 35G Free, 2% Inuse, 76K In

This is different to 12.3: there I would expect the ARC near 6G, wired
near 11G, and swap near 5G.

Last message in the log was 20 minutes earlier:
Mar 30 03:45:17 <ntp.warn> edge ntpd[7768]: no peer for too long,
    server running free now

So, strangely, networking has also stalled. I thought networking uses
other device drivers separate from the disk drivers?

The effect appeared slowly, machine became increasingly unresponsive
and laggy (in all regards of I/O) during the "periodic daily". First
night it runs find over a million files in all jails, as these are not
yet in l2arc. Apparently this killed it:

It might be related to the periodic daily running find in every jail:
35944  -  DJ       0:04.71 find -sx / /var /ext /usr/local /usr/obj /usr/ports 
36186  -  DJ       0:04.75 find -sx / /var /usr/local /usr/obj /usr/ports /dev/
37599  -  DJ       0:04.14 find -sx / /var /ext /usr/local /ext/rapp /usr/ports
38378  -  DJ       0:03.36 find -sx / /ext /var /usr/local /usr/ports /usr/obj 
...

This would need a *lot* of inodes, and the arc seems quite small for
that.

I've not seen such behaviour before - I had ZFS running in ~2007 with
384 MB ram installed; now here are 32G (which I wouldn't have bought,
got them by accident), and that doesn't work well.

The ARC is configured in loader.conf:
# kenv
vfs.zfs.arc_max="10240M"
vfs.zfs.arc_min="1024M"

However, sysctl shows:
vfs.zfs.arc.max: 10737418240
vfs.zfs.arc.min: 0

Observing the behaviour, ARC wants to stay at or even below 1G:

last pid: 38718;  load averages:  2.12,  2.93,  2.88    up 0+01:09:08  05:30:25
625 processes: 1 running, 624 sleeping
CPU:  0.0% user,  0.1% nice,  6.3% system,  0.0% interrupt, 93.6% idle
Mem: 12G Active, 1433M Inact, 9987M Wired, 50M Buf, 8237M Free
ARC: 749M Total, 116M MFU, 254M MRU, 2457K Anon, 42M Header, 334M Other
     84M Compressed, 396M Uncompressed, 4.70:1 Ratio
Swap: 36G Total, 36G Free

There are 3 bhyve with 16G + 7G + 2G, these naturally create much 
dirty memory. The point is that these should go to swap, that's
what SSD are for.

The ARC only grows when there is not much activity on the system. That
may be nice for desktops, but is no good for solid workload. I need it
to grow against workload (which it did before, but now doesn't) and
against paging (which not even appears).

Do we have some new knobs to tune?
This one is appears to already be zero by default:
  vfs.zfs.arc.grow_retry: 0
And what is this one doing?
  vfs.zfs.arc.p_dampener_disable=1

Do I need to read all the code? There are lots of other things that
did work on 12.3 and now fail or crash, like net/dhcpcd (crashes now
in libc), or mountd not understanding the zfs exports (syntax changed,
doesn't match the manpage, didn't in 12.3 either, but differently), and
I only have two eyes (and they don't get better with age).

What would be needed for the ARC is an affinity balance: should it
prefer to try and grow towards arc_max even with load (server use with
well-configured arc_max), or should it shrink away as soon as some
serious activity is on the system (gamers and bloated browsers use).