Re: 13.3 troubles under load

From: Edward Sanford Sutton, III <mirror176_at_hotmail.com>
Date: Tue, 02 Apr 2024 07:56:38 UTC
On 4/2/24 00:20, Andrea Venturoli wrote:
 > Hello.
 >
 > Now that 13.3 is out, and given the relatively short overlap support
 > window, I started upgrading my 13.2 machines as soon as I had the chance.
 >
 > However, I'm experiencing some troubles under load (in cases where every
 > version up to 13.2 has always worked without troubles).
 >
 >
 >
 > Scenario 1:
 >
 > Box A is ZFS/SSD based, but has an UFS HD (with only specific data)
 > which is exported via NFSv4.
 > Box B mounts that NFSv4 share and backs in up to an UFS/USB disk via 
rsync.
 > This has always worked fine until I upgraded box A to 13.3.
 > Now, while rsync does it jobs, box A starts crawling: Nagios reports
 > several failures (either daemons which die or daemons which are no
 > longer able to answer timely) and logging in via SSH becomes almost
 > impossible (with already open sessions almost unusable).
 >
 > System is on ZFS so it should not be affected by the load on the UFS HD;
 > besides, a single UFS HD should not be able to provide so much load to
 > halt an 8 core system with 32GiB or RAM.
 > Is it possible that such not so high network traffic (lagg with two em
 > cards) brings this box to an almost halt?
 > Unfortunately, so far I don't have any useful logs.
 >
 >
 >
 > Scenario 1:
 >
 > A box is running with several services (including two clamd instances in
 > two different jails). Once a week, it connects to a NAS via Bacula and
 > copies ~1TB of data to an external UFS HD.
 > As in the previous example, after I upgraded to 13.3 this simple
 > operation (which has worked for several years) has started to be
 > problematic, as daemons are killed all through it:
 >> Apr  1 20:01:31 xxxxxxx kernel: pid 11753 (clamd), jid 3, uid 26, was
 >> killed: a thread waited too long to allocate a page
 >> Apr  1 20:02:18 xxxxxxx kernel: pid 11720 (clamd), jid 5, uid 26, was
 >> killed: a thread waited too long to allocate a page
 >> Apr  1 20:03:16 xxxxxxx kernel: pid 3707 (squid), jid 3, uid 100, was
 >> killed: a thread waited too long to allocate a page
 >> Apr  1 20:03:54 xxxxxxx kernel: pid 7400 (zeek), jid 7, uid 782, was
 >> killed: a thread waited too long to allocate a page
 >> Apr  1 20:04:25 xxxxxxx kernel: pid 1813 (snort), jid 0, uid 0, was
 >> killed: a thread waited too long to allocate a page
 >> Apr  1 20:05:59 xxxxxxx kernel: pid 7399 (zeek), jid 7, uid 782, was
 >> killed: a thread waited too long to allocate a page
 >> Apr  1 20:05:59 xxxxxxx kernel: pid 1820 (snort), jid 0, uid 0, was
 >> killed: a thread waited too long to allocate a page
 >> Apr  1 20:06:48 xxxxxxx kernel: pid 44493 (perl), jid 5, uid 26, was
 >> killed: a thread waited too long to allocate a page
 >> Apr  1 20:07:22 xxxxxxx kernel: pid 44512 (perl), jid 5, uid 26, was
 >> killed: a thread waited too long to allocate a page
 >> Apr  1 20:09:23 xxxxxxx kernel: pid 7254 (zeek), jid 7, uid 782, was
 >> killed: a thread waited too long to allocate a page
 >> Apr  1 20:10:17 xxxxxxx kernel: pid 14462 (mysqld), jid 11, uid 88,
 >> was killed: a thread waited too long to allocate a page
 >> Apr  1 20:10:17 xxxxxxx kernel: pid 83231 (smbd), jid 8, uid 0, was
 >> killed: a thread waited too long to allocate a page
 >> Apr  1 20:10:17 xxxxxxx kernel: pid 28868 (smbd), jid 8, uid 0, was
 >> killed: a thread waited too long to allocate a page
 >> Apr  1 20:10:17 xxxxxxx kernel: pid 92611 (smbd), jid 8, uid 0, was
 >> killed: a thread waited too long to allocate a page
 >> Apr  1 20:12:20 xxxxxxx kernel: pid 77438 (clamd), jid 3, uid 26, was
 >> killed: a thread waited too long to allocate a page
 >> Apr  1 20:13:47 xxxxxxx kernel: pid 77473 (clamd), jid 5, uid 26, was
 >> killed: a thread waited too long to allocate a page
 >
 > Again, system/swap is on a SSD ZFS RAID pool, so disk load on the UFS
 > USB HD shouldn't hamper its throughput.
 > This time network is still a lagg, but with igb cards (so a similar
 > driver).
 >
 >
 >
 > Any hint what to look for?

Look for kernel in arc_prune using a lot of CPU; launch top and press SH 
to display system processes and threads. If it is the issue, consider 
reverting back to 13.2, upgrading to 14, or testing patches from 
applying relevant patches for 13.3 as mentioned in 
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275594

 > Is there some known problem with LAGG, if_em/if_igb, USB, UFS, other?
 >
 >   bye & Thanks
 >      av.
 >