Re: armv8.2-A+ tuned FreeBSD kernels vs. poudriere bulk and USB3 media: tx->tx_quiesce_done_cv related blocking of processes?

From: Mark Millard <marklmi_at_yahoo.com>
Date: Sun, 30 Apr 2023 07:50:02 UTC
On Apr 29, 2023, at 19:44, Mark Millard <marklmi@yahoo.com> wrote:

> This is based on: main-n262658-b347c2284603-dirty, b347c2284603
> being from late Apr 28, 2023 UTC. (The "-dirty" is from some
> historical patches that I use.) The build is a non-debug build
> (but with symbols not stripped). World or Kernel had been built
> using:
> 
> -mcpu=cortex-a78C+flagm+nofp16fml
> 
> just for testing purposes. (Worked nicely for -j8 buildworld
> buildkernel testing for the 4 cortex-a78c's plus 4 cortex-x1c's
> present.)
> 
> Monitoring poudriere bulk related activity via top and gstat -spod
> I see a lot of the odd result of one process doing something
> like:
> 
> CPU4     4   1:39  99.12% /usr/local/sbin/pkg-static create -f tzst -r /wrkdirs/usr/ports/devel/cmake-core/work/stage
> 
> while other processes sit in the likes of:
> 
> tx->tx
> zcq->z
> zcw->z
> zilog-
> select
> wait
> 
> But sometimes there is no CPU bound process and the top CPU process is
> the likes of:
> 
> 1.24% [usb{usbus0}]
> 
> "gstat -spod" basically  shows da0 dedicated to write activity most
> of the time.
> 
> After: sysctl kern.tty_info_kstacks=1
> Then using ^T at various times, I see a lot of:
> 
> load: 0.48  cmd: sh 93914 [tx->tx_quiesce_done_cv] 7534.91r 11.06u 22.66s 0% 3800k
> #0 0xffff0000004fd564 at mi_switch+0x104
> #1 0xffff000000463f40 at _cv_wait+0x120
> #2 0xffff00000153fa34 at txg_wait_open+0xf4
> #3 0xffff0000014a40bc at dmu_free_long_range+0x17c
> #4 0xffff000001448254 at zfs_rmnode+0x64
> #9 0xffff000001455678 at zfs_freebsd_inactive+0x48
> #10 0xffff0000005fc430 at vinactivef+0x180
> #11 0xffff0000005fba50 at vput_final+0x200
> #12 0xffff00000060c4d0 at kern_funlinkat+0x320
> #13 0xffff00015d6cbbf4 at filemon_wrapper_unlink+0x14
> #14 0xffff0000008f8514 at do_el0_sync+0x594
> #15 0xffff0000008d4904 at handle_el0_sync+0x40
> 
> load: 0.34  cmd: sh 93914 [tx->tx_quiesce_done_cv] 7566.69r 11.06u 22.66s 0% 3800k
> #0 0xffff0000004fd564 at mi_switch+0x104
> #1 0xffff000000463f40 at _cv_wait+0x120
> #2 0xffff00000153fa34 at txg_wait_open+0xf4
> #3 0xffff0000014a40bc at dmu_free_long_range+0x17c
> #4 0xffff000001448254 at zfs_rmnode+0x64
> #5 0xffff0000014557c4 at zfs_freebsd_reclaim+0x34
> #6 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c
> #7 0xffff0000005fd6c0 at vgonel+0x450
> #8 0xffff0000005fde7c at vrecycle+0x9c
> #9 0xffff000001455678 at zfs_freebsd_inactive+0x48
> #10 0xffff0000005fc430 at vinactivef+0x180
> #11 0xffff0000005fba50 at vput_final+0x200
> #12 0xffff00000060c4d0 at kern_funlinkat+0x320
> #13 0xffff00015d6cbbf4 at filemon_wrapper_unlink+0x14
> #14 0xffff0000008f8514 at do_el0_sync+0x594
> #15 0xffff0000008d4904 at handle_el0_sync+0x40
> 
> load: 0.44  cmd: sh 93914 [tx->tx_quiesce_done_cv] 7693.52r 11.24u 23.08s 0% 3800k
> #0 0xffff0000004fd564 at mi_switch+0x104
> #1 0xffff000000463f40 at _cv_wait+0x120
> #2 0xffff00000153fa34 at txg_wait_open+0xf4
> #3 0xffff0000014a40bc at dmu_free_long_range+0x17c
> #4 0xffff000001448254 at zfs_rmnode+0x64
> #5 0xffff0000014557c4 at zfs_freebsd_reclaim+0x34
> #6 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c
> #7 0xffff0000005fd6c0 at vgonel+0x450
> #8 0xffff0000005fde7c at vrecycle+0x9c
> #9 0xffff000001455678 at zfs_freebsd_inactive+0x48
> #10 0xffff0000005fc430 at vinactivef+0x180
> #11 0xffff0000005fba50 at vput_final+0x200
> #12 0xffff00000060c4d0 at kern_funlinkat+0x320
> #13 0xffff00015d6cbbf4 at filemon_wrapper_unlink+0x14
> #14 0xffff0000008f8514 at do_el0_sync+0x594
> #15 0xffff0000008d4904 at handle_el0_sync+0x40
> 
> 
> The system (Windows Dev Kit 2023) has 32 GiBytes of RAM. Example
> output from a top modified to show some "Max[imum]Obs[erved]"
> information:
> 
> last pid: 17198;  load averages:   0.33,   0.58,   1.06 MaxObs:  15.49,   8.73,   5.75                                                                                          up 0+20:48:10  19:14:49
> 426 threads:   9 running, 394 sleeping, 1 stopped, 22 waiting, 50 MaxObsRunning
> CPU:  0.0% user,  0.0% nice,  0.2% system,  0.1% interrupt, 99.7% idle
> Mem: 282760Ki Active, 7716Mi Inact, 23192Ki Laundry, 22444Mi Wired, 2780Ki Buf, 848840Ki Free, 2278Mi MaxObsActive, 22444Mi MaxObsWired, 22752Mi MaxObs(Act+Wir+Lndry)
> ARC: 11359Mi Total, 3375Mi MFU, 5900Mi MRU, 993Mi Anon, 93076Ki Header, 992Mi Other
>     8276Mi Compressed, 19727Mi Uncompressed, 2.38:1 Ratio
> Swap: 120832Mi Total, 120832Mi Free, 2301Mi MaxObs(Act+Lndry+SwapUsed), 22752Mi MaxObs(Act+Wir+Lndry+SwapUsed)
> 
> 
> The poudriere bulk has 8 builders but has ALLOW_MAKE_JOBS=yes
> without any explicit settings for the likes of MAKE_JOBS_NUMBER .
> So it is a configuration that allows a high load average compared
> to the number of hardware threads (here: cores) in the system.
> 
> 
> I've rebooted to do a test with filemon not loaded at the time
> (here it was loaded from prior buildworld buildkernel activity).
> We will see if it still ends up with such problems.

It still ends up with things waiting, but the detailed STATE
valiues generally lsited are somewhat different.

I also tried a chroot into a world from use of -mcpu=cortex-a72
and got similar results. Suggesting that only the
-mcpu=cortex-a78c+flagm+nofp16fml kernel is required to see the
issues. This even got some examples like:

load: 1.20  cmd: sh 95560 [tx->tx_quiesce_done_cv] 2022.55r 3.21u 9.24s 0% 3852k
#0 0xffff0000004fd564 at mi_switch+0x104
#1 0xffff000000463f40 at _cv_wait+0x120
#2 0xffff000001518a34 at txg_wait_open+0xf4
#3 0xffff00000147d0bc at dmu_free_long_range+0x17c
#4 0xffff000001421254 at zfs_rmnode+0x64
#5 0xffff00000142e7c4 at zfs_freebsd_reclaim+0x34
#6 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c
#7 0xffff0000005fd6c0 at vgonel+0x450
#8 0xffff0000005fde7c at vrecycle+0x9c
#9 0xffff00000142e678 at zfs_freebsd_inactive+0x48
#10 0xffff0000005fc430 at vinactivef+0x180
#11 0xffff0000005fba50 at vput_final+0x200
#12 0xffff00015d8ceab4 at null_reclaim+0x154
#13 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c
#14 0xffff0000005fd6c0 at vgonel+0x450
#15 0xffff0000005fde7c at vrecycle+0x9c
#16 0xffff00015d8ce8e8 at null_inactive+0x38
#17 0xffff0000005fc430 at vinactivef+0x180

(The chroot use involves null mounts.)

which is sort of analogous to the filemon related
backtraces I showed earlier. The common part 
across the examples looks to be #0..#11:

#0 0xffff0000004fd564 at mi_switch+0x104
#1 0xffff000000463f40 at _cv_wait+0x120
#2 0xffff00000153fa34 at txg_wait_open+0xf4
#3 0xffff0000014a40bc at dmu_free_long_range+0x17c
#4 0xffff000001448254 at zfs_rmnode+0x64
#5 0xffff0000014557c4 at zfs_freebsd_reclaim+0x34
#6 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c
#7 0xffff0000005fd6c0 at vgonel+0x450
#8 0xffff0000005fde7c at vrecycle+0x9c
#9 0xffff000001455678 at zfs_freebsd_inactive+0x48
#10 0xffff0000005fc430 at vinactivef+0x180
#11 0xffff0000005fba50 at vput_final+0x200

There were a lot more nanslp examples in all
the alter testing (i.e, those avoided filemon.ko
being loaded).

Starting from having pkgclean -A'd the ports, the
experiments got about the same number of ports built
as of the end of the 1st hour.



UFS vs. ZFS? Different media types? . . .

So I decided to create and try a UFS test context
instead of a ZFS one. But the media that was best
to update was a U2 960GB Optane in a USB3
adapter, something that would perform noticeably
better than my normal USB3 NVMe drives, even with
USB involved.

This combination maintained reasonable load averages
(instead of having long periods of <1) and finish
building 172 ports in the 1st hour, far more than the
around 83 each time I tried the other device/ZFS
combination. NO evdience of the earlier reported
oddities.

I should also time a from-scratch buildworld
buildkernel.

I'll look into setting up another U2 960GB Optane
for use in the USB3 adpater, but with ZFS. That
should help isolate media vs. file system
contributions to the varying behaviors.

===
Mark Millard
marklmi at yahoo.com