13.3 troubles under load

From: Andrea Venturoli <ml_at_netfence.it>
Date: Tue, 02 Apr 2024 07:20:42 UTC
Hello.

Now that 13.3 is out, and given the relatively short overlap support 
window, I started upgrading my 13.2 machines as soon as I had the chance.

However, I'm experiencing some troubles under load (in cases where every 
version up to 13.2 has always worked without troubles).



Scenario 1:

Box A is ZFS/SSD based, but has an UFS HD (with only specific data) 
which is exported via NFSv4.
Box B mounts that NFSv4 share and backs in up to an UFS/USB disk via rsync.
This has always worked fine until I upgraded box A to 13.3.
Now, while rsync does it jobs, box A starts crawling: Nagios reports 
several failures (either daemons which die or daemons which are no 
longer able to answer timely) and logging in via SSH becomes almost 
impossible (with already open sessions almost unusable).

System is on ZFS so it should not be affected by the load on the UFS HD; 
besides, a single UFS HD should not be able to provide so much load to 
halt an 8 core system with 32GiB or RAM.
Is it possible that such not so high network traffic (lagg with two em 
cards) brings this box to an almost halt?
Unfortunately, so far I don't have any useful logs.



Scenario 1:

A box is running with several services (including two clamd instances in 
two different jails). Once a week, it connects to a NAS via Bacula and 
copies ~1TB of data to an external UFS HD.
As in the previous example, after I upgraded to 13.3 this simple 
operation (which has worked for several years) has started to be 
problematic, as daemons are killed all through it:
> Apr  1 20:01:31 xxxxxxx kernel: pid 11753 (clamd), jid 3, uid 26, was killed: a thread waited too long to allocate a page
> Apr  1 20:02:18 xxxxxxx kernel: pid 11720 (clamd), jid 5, uid 26, was killed: a thread waited too long to allocate a page
> Apr  1 20:03:16 xxxxxxx kernel: pid 3707 (squid), jid 3, uid 100, was killed: a thread waited too long to allocate a page
> Apr  1 20:03:54 xxxxxxx kernel: pid 7400 (zeek), jid 7, uid 782, was killed: a thread waited too long to allocate a page
> Apr  1 20:04:25 xxxxxxx kernel: pid 1813 (snort), jid 0, uid 0, was killed: a thread waited too long to allocate a page
> Apr  1 20:05:59 xxxxxxx kernel: pid 7399 (zeek), jid 7, uid 782, was killed: a thread waited too long to allocate a page
> Apr  1 20:05:59 xxxxxxx kernel: pid 1820 (snort), jid 0, uid 0, was killed: a thread waited too long to allocate a page
> Apr  1 20:06:48 xxxxxxx kernel: pid 44493 (perl), jid 5, uid 26, was killed: a thread waited too long to allocate a page
> Apr  1 20:07:22 xxxxxxx kernel: pid 44512 (perl), jid 5, uid 26, was killed: a thread waited too long to allocate a page
> Apr  1 20:09:23 xxxxxxx kernel: pid 7254 (zeek), jid 7, uid 782, was killed: a thread waited too long to allocate a page
> Apr  1 20:10:17 xxxxxxx kernel: pid 14462 (mysqld), jid 11, uid 88, was killed: a thread waited too long to allocate a page
> Apr  1 20:10:17 xxxxxxx kernel: pid 83231 (smbd), jid 8, uid 0, was killed: a thread waited too long to allocate a page
> Apr  1 20:10:17 xxxxxxx kernel: pid 28868 (smbd), jid 8, uid 0, was killed: a thread waited too long to allocate a page
> Apr  1 20:10:17 xxxxxxx kernel: pid 92611 (smbd), jid 8, uid 0, was killed: a thread waited too long to allocate a page
> Apr  1 20:12:20 xxxxxxx kernel: pid 77438 (clamd), jid 3, uid 26, was killed: a thread waited too long to allocate a page
> Apr  1 20:13:47 xxxxxxx kernel: pid 77473 (clamd), jid 5, uid 26, was killed: a thread waited too long to allocate a page

Again, system/swap is on a SSD ZFS RAID pool, so disk load on the UFS 
USB HD shouldn't hamper its throughput.
This time network is still a lagg, but with igb cards (so a similar driver).



Any hint what to look for?
Is there some known problem with LAGG, if_em/if_igb, USB, UFS, other?

  bye & Thanks
	av.