Re: 13-STABLE high idprio load gives poor responsiveness and excessive CPU time per task

From: <vester.thacker_at_fastmail.fm>
Date: Tue, 27 Feb 2024 01:59:59 UTC
Given the complexity of your situation, a combination of adjustments might be necessary to find a balance that works for your specific needs. It's often a process of trial and error, tweaking one setting at a time and observing the impact on your system's performance and responsiveness.

The issues you're experiencing with system responsiveness while running port builds with Poudriere on FreeBSD, especially with high CPU load conditions and specific system configurations, can be complex and may involve multiple factors. Here are several areas to consider and investigate further:

1. System Configuration and Resource Allocation:
   - With your system having an i7-3820 processor but only 2 enabled cores plus hyperthreading, there's a notable limitation on how much concurrent workload your CPU can handle effectively. While hyperthreading can improve performance for some workloads, it might not be as effective for highly parallel tasks like those encountered during port builds.
   - Consider enabling more cores in BIOS if possible to distribute the workload more effectively across the CPU.

2. Usage of `nice` and `idprio`:
   - Using `nice` and `idprio` to adjust the priority of the build processes is a good strategy to maintain system responsiveness. However, if the system is already under heavy load, even lower-priority processes can contribute to responsiveness issues, especially if they are I/O bound or involve significant memory usage.
   - Experiment with different priority levels and observe the system's behavior. Also, consider limiting the number of concurrent jobs (`-J` option in Poudriere) to a number that's more manageable for your system's capabilities.

3. Monitoring Tools:
   - Utilize tools like `top`, `vmstat`, `iostat`, and `zfs stats` to monitor system performance in real-time. Pay attention to I/O wait times, memory usage, and CPU idle times. These metrics can help identify bottlenecks in your system.
   - For detailed ZFS performance analysis, consider using `DTrace` scripts that can provide insights into how ZFS operations are performing under load.

4. Software and Security Updates:
   - Ensure that your system is up-to-date with the latest patches and updates. While the advisory you mentioned (FreeBSD-EN-23:18.openzfs) may not apply, staying current can help avoid known issues and may include performance improvements or bug fixes that could indirectly address your concerns.

5. Hardware Considerations:
   - Although you're already using an SSD, ensuring it has enough free space and is not near its write endurance limit is essential. SSD performance can degrade as they fill up or age, impacting overall system performance.

If using ZFS, here are other considerations:

1. High Kernel Activity with `arc_prune`:
   - The `arc_prune` process is related to the ZFS Adaptive Replacement Cache (ARC). High CPU usage by `arc_prune` can indicate that the system is actively working to reclaim memory from the ARC, possibly due to memory pressure or intensive disk I/O operations. Since you are using ZFS with possibly high compression (zstd at best compression), the workload on ARC and disk I/O could be significant, especially during intensive operations like port builds.
   - Investigating ZFS and ARC settings may provide some insights. Consider adjusting ARC limits (`vfs.zfs.arc_max` and `vfs.zfs.arc_min`) to optimize memory usage. However, be cautious with adjustments to ensure they fit your workload and system capabilities.

2. OpenZFS from Ports:
   - Trying OpenZFS from ports might offer newer features or performance improvements, but it's crucial to ensure compatibility with your FreeBSD version and to have a reliable backup before making significant changes. Changes in ZFS versions can affect performance and behavior, so this could be a double-edged sword.


-vester


On Tue, Feb 27, 2024, at 09:24, Edward Sanford Sutton, III wrote:
> Currently trying to do port builds with 
> poudriere-devel-3.4.99.20240122 on 13.3-STABLE FreeBSD 13.3-STABLE 
> stable/13-n257396-134580c103b4 GENERIC amd64 and have had poor system 
> responsiveness on this and a -STABLE that was likely at least 2 months 
> before it with `/usr/bin/nice -n 18 /usr/sbin/idprio 31 poudriere bulk 
> -J2:12 -j local -p local -f /root/installed-port-list -f 
> /root/prime-origins` and also launching it with no nice change and just 
> starting with 'idprio 31' (in case it would react different with the 
> builtin idprio. It used to be that under just idprio 31 I could continue 
> to use the machine during builds with minor impact on responsiveness 
> with MAKE_JOBS_NUMBER=4 on a 2 core + hyperthreading i7 which lead to an 
> oversaturation of 8 build processes across the 4 virtual cores yet 
> stayed mostly smooth.
>    Responsiveness seems to have strange effects of laggy mouse/keyboard 
> input and thought I recall logs saying system couldn't keep up with the 
> mouse communication, programs intermittently freezing/unfreezing, and 
> even building becomes unstable with electron28 failing as runaway 
> process after over an hour when killed as runaway process during extract 
> phase. During the lag, programs seem to take many times as long to 
> respond while also getting high cpu use during that time such as tmux 
> switching between windows; ctrl b+ctrl+p getting almost 100% cpu core 
> time for about 5 seconds added to it under top. blacklistd has take 
> about 24 minutes of CPU time for < 3 days uptime per top while managing 
> a usual slow bot break-in attempt on ssh.
>    More recently looked and see top showing threads+system processes 
> shows I have one core getting 100% cpu for kernel{arc_prune} which has 
> 21.2 hours over a 2 hour 23 minute uptime. Previously I know I have seen 
> higher system % time than I'd expected but not always sure when it is 
> justified or not. I started looking to see if 
> https://www.freebsd.org/security/advisories/FreeBSD-EN-23:18.openzfs.asc 
> was available as a fix for 13 but it is not (and doesn't quite sound 
> like it was supposed to apply to this issue). Would a kernel thread time 
> at 100% cpu for only 1 core explain the system becoming unusually 
> unresponsive? cron was stopped after last boot so shouldn't be throwing 
> up any unexpected background work.
>    System has 32GB ddr3 RAM with i7-3820 processor with only 2 enabled 
> cores + hyperthreading in BIOS. Issue appears specifically when CPU load 
> is high and idle% in top is 0.0% but it is only sometimes present under 
> that condition. I usually use ccache and WTIH_META_MODE to try to speed 
> up compiling base and ports and have zstd at best compression for 
> packages in hopes of faster extraction at the tradeoff of more disk space.
>    I haven't tried yet but considered trying OpenZFS from ports. Any 
> suggestions of what else to look at or watch for?
> Thanks,
> Edward Sanford Sutton, III