Re: armv8.2-A+ tuned FreeBSD kernels vs. poudriere bulk and USB3 media: tx->tx_quiesce_done_cv related blocking of processes?

From: Mark Millard <marklmi_at_yahoo.com>
Date: Mon, 01 May 2023 00:33:36 UTC
As the evidence this time is largely independent of the details
reported previousy, I'm top posting this.

The previous ZFS on USB3 results were based on poudriere using
"USE_TMPFS=data", meaning that almost all file I/O was via ZFS
to the USB3 media.

The UFS on U2 960GB Optane via USB3 adapter results did not not
suffer the reported problems, despite "USE_TMPFS=data". (I let
it run to completion.) But this had both a media and a file
system difference.

This time the results are for just changing poudriere to use
"USE_TMPFS=all" instead but back on the original ZFS on
USB3 media. The implication is that the vast majority of the
file I/O is not via ZFS to the USB3 media.

The context has 32 GiBytes of RAM and about 118 GiBytes of
swap/paging space. It would need to page if pet run to
completion.

Observing, the load average is behaving normally: Things are
not stuck waiting. "gstat -spod" indicates the ZFS I/O is
not sustained (no paging in swap space use yet).

First 1 hr: 262 ports built.

But this had both a media and a file system difference again.

I'm stopping after this, in order to set up the next just-
ZFS experiments.


Next experiments:

I expect to move the ZFS context to U2 960GB Optane media
used with the USB3 adapter and to test both "USE_TMPFS=data"
and "USE_TMPSF=all", probably letting USE_TMPFS=all run to
completion.

If Optane based USE_TMPFS=data context still has the problem,
even the more performance media would have been not enough to
avoid what would then appear to be a ZFS problem, two other
file systems not having the problem.

The Optane based USE_TMPFS=all context should just handle the
paging and more rare ZFS I/O quicker. I do not expect problems
for this combination, given the UFS on Optane USB3 results
and the partial USE_TMPFS=all non-Optane USB3 results. Even
with ZFS working, this likely is the more performant type of
context for the Windows Dev Kit 2023, given that I'm leaving
Windows 11 Pro in place on the internal media.


Hypothesis for the original problem:

I wonder if ZFS write activity to the USB3 media was largely
blocking the ZFS read activity to the same media, causing
lots of things to have to spend much time waiting for data
instead of making progress, leading to long periods of low
load averages.


Older material:

On Apr 30, 2023, at 00:50, Mark Millard <marklmi@yahoo.com> wrote:

> On Apr 29, 2023, at 19:44, Mark Millard <marklmi@yahoo.com> wrote:
> 
>> This is based on: main-n262658-b347c2284603-dirty, b347c2284603
>> being from late Apr 28, 2023 UTC. (The "-dirty" is from some
>> historical patches that I use.) The build is a non-debug build
>> (but with symbols not stripped). World or Kernel had been built
>> using:
>> 
>> -mcpu=cortex-a78C+flagm+nofp16fml
>> 
>> just for testing purposes. (Worked nicely for -j8 buildworld
>> buildkernel testing for the 4 cortex-a78c's plus 4 cortex-x1c's
>> present.)
>> 
>> Monitoring poudriere bulk related activity via top and gstat -spod
>> I see a lot of the odd result of one process doing something
>> like:
>> 
>> CPU4     4   1:39  99.12% /usr/local/sbin/pkg-static create -f tzst -r /wrkdirs/usr/ports/devel/cmake-core/work/stage
>> 
>> while other processes sit in the likes of:
>> 
>> tx->tx
>> zcq->z
>> zcw->z
>> zilog-
>> select
>> wait
>> 
>> But sometimes there is no CPU bound process and the top CPU process is
>> the likes of:
>> 
>> 1.24% [usb{usbus0}]
>> 
>> "gstat -spod" basically  shows da0 dedicated to write activity most
>> of the time.
>> 
>> After: sysctl kern.tty_info_kstacks=1
>> Then using ^T at various times, I see a lot of:
>> 
>> load: 0.48  cmd: sh 93914 [tx->tx_quiesce_done_cv] 7534.91r 11.06u 22.66s 0% 3800k
>> #0 0xffff0000004fd564 at mi_switch+0x104
>> #1 0xffff000000463f40 at _cv_wait+0x120
>> #2 0xffff00000153fa34 at txg_wait_open+0xf4
>> #3 0xffff0000014a40bc at dmu_free_long_range+0x17c
>> #4 0xffff000001448254 at zfs_rmnode+0x64
>> #9 0xffff000001455678 at zfs_freebsd_inactive+0x48
>> #10 0xffff0000005fc430 at vinactivef+0x180
>> #11 0xffff0000005fba50 at vput_final+0x200
>> #12 0xffff00000060c4d0 at kern_funlinkat+0x320
>> #13 0xffff00015d6cbbf4 at filemon_wrapper_unlink+0x14
>> #14 0xffff0000008f8514 at do_el0_sync+0x594
>> #15 0xffff0000008d4904 at handle_el0_sync+0x40
>> 
>> load: 0.34  cmd: sh 93914 [tx->tx_quiesce_done_cv] 7566.69r 11.06u 22.66s 0% 3800k
>> #0 0xffff0000004fd564 at mi_switch+0x104
>> #1 0xffff000000463f40 at _cv_wait+0x120
>> #2 0xffff00000153fa34 at txg_wait_open+0xf4
>> #3 0xffff0000014a40bc at dmu_free_long_range+0x17c
>> #4 0xffff000001448254 at zfs_rmnode+0x64
>> #5 0xffff0000014557c4 at zfs_freebsd_reclaim+0x34
>> #6 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c
>> #7 0xffff0000005fd6c0 at vgonel+0x450
>> #8 0xffff0000005fde7c at vrecycle+0x9c
>> #9 0xffff000001455678 at zfs_freebsd_inactive+0x48
>> #10 0xffff0000005fc430 at vinactivef+0x180
>> #11 0xffff0000005fba50 at vput_final+0x200
>> #12 0xffff00000060c4d0 at kern_funlinkat+0x320
>> #13 0xffff00015d6cbbf4 at filemon_wrapper_unlink+0x14
>> #14 0xffff0000008f8514 at do_el0_sync+0x594
>> #15 0xffff0000008d4904 at handle_el0_sync+0x40
>> 
>> load: 0.44  cmd: sh 93914 [tx->tx_quiesce_done_cv] 7693.52r 11.24u 23.08s 0% 3800k
>> #0 0xffff0000004fd564 at mi_switch+0x104
>> #1 0xffff000000463f40 at _cv_wait+0x120
>> #2 0xffff00000153fa34 at txg_wait_open+0xf4
>> #3 0xffff0000014a40bc at dmu_free_long_range+0x17c
>> #4 0xffff000001448254 at zfs_rmnode+0x64
>> #5 0xffff0000014557c4 at zfs_freebsd_reclaim+0x34
>> #6 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c
>> #7 0xffff0000005fd6c0 at vgonel+0x450
>> #8 0xffff0000005fde7c at vrecycle+0x9c
>> #9 0xffff000001455678 at zfs_freebsd_inactive+0x48
>> #10 0xffff0000005fc430 at vinactivef+0x180
>> #11 0xffff0000005fba50 at vput_final+0x200
>> #12 0xffff00000060c4d0 at kern_funlinkat+0x320
>> #13 0xffff00015d6cbbf4 at filemon_wrapper_unlink+0x14
>> #14 0xffff0000008f8514 at do_el0_sync+0x594
>> #15 0xffff0000008d4904 at handle_el0_sync+0x40
>> 
>> 
>> The system (Windows Dev Kit 2023) has 32 GiBytes of RAM. Example
>> output from a top modified to show some "Max[imum]Obs[erved]"
>> information:
>> 
>> last pid: 17198;  load averages:   0.33,   0.58,   1.06 MaxObs:  15.49,   8.73,   5.75                                                                                          up 0+20:48:10  19:14:49
>> 426 threads:   9 running, 394 sleeping, 1 stopped, 22 waiting, 50 MaxObsRunning
>> CPU:  0.0% user,  0.0% nice,  0.2% system,  0.1% interrupt, 99.7% idle
>> Mem: 282760Ki Active, 7716Mi Inact, 23192Ki Laundry, 22444Mi Wired, 2780Ki Buf, 848840Ki Free, 2278Mi MaxObsActive, 22444Mi MaxObsWired, 22752Mi MaxObs(Act+Wir+Lndry)
>> ARC: 11359Mi Total, 3375Mi MFU, 5900Mi MRU, 993Mi Anon, 93076Ki Header, 992Mi Other
>>    8276Mi Compressed, 19727Mi Uncompressed, 2.38:1 Ratio
>> Swap: 120832Mi Total, 120832Mi Free, 2301Mi MaxObs(Act+Lndry+SwapUsed), 22752Mi MaxObs(Act+Wir+Lndry+SwapUsed)
>> 
>> 
>> The poudriere bulk has 8 builders but has ALLOW_MAKE_JOBS=yes
>> without any explicit settings for the likes of MAKE_JOBS_NUMBER .
>> So it is a configuration that allows a high load average compared
>> to the number of hardware threads (here: cores) in the system.
>> 
>> 
>> I've rebooted to do a test with filemon not loaded at the time
>> (here it was loaded from prior buildworld buildkernel activity).
>> We will see if it still ends up with such problems.
> 
> It still ends up with things waiting, but the detailed STATE
> valiues generally lsited are somewhat different.
> 
> I also tried a chroot into a world from use of -mcpu=cortex-a72
> and got similar results. Suggesting that only the
> -mcpu=cortex-a78c+flagm+nofp16fml kernel is required to see the
> issues. This even got some examples like:
> 
> load: 1.20  cmd: sh 95560 [tx->tx_quiesce_done_cv] 2022.55r 3.21u 9.24s 0% 3852k
> #0 0xffff0000004fd564 at mi_switch+0x104
> #1 0xffff000000463f40 at _cv_wait+0x120
> #2 0xffff000001518a34 at txg_wait_open+0xf4
> #3 0xffff00000147d0bc at dmu_free_long_range+0x17c
> #4 0xffff000001421254 at zfs_rmnode+0x64
> #5 0xffff00000142e7c4 at zfs_freebsd_reclaim+0x34
> #6 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c
> #7 0xffff0000005fd6c0 at vgonel+0x450
> #8 0xffff0000005fde7c at vrecycle+0x9c
> #9 0xffff00000142e678 at zfs_freebsd_inactive+0x48
> #10 0xffff0000005fc430 at vinactivef+0x180
> #11 0xffff0000005fba50 at vput_final+0x200
> #12 0xffff00015d8ceab4 at null_reclaim+0x154
> #13 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c
> #14 0xffff0000005fd6c0 at vgonel+0x450
> #15 0xffff0000005fde7c at vrecycle+0x9c
> #16 0xffff00015d8ce8e8 at null_inactive+0x38
> #17 0xffff0000005fc430 at vinactivef+0x180
> 
> (The chroot use involves null mounts.)
> 
> which is sort of analogous to the filemon related
> backtraces I showed earlier. The common part 
> across the examples looks to be #0..#11:
> 
> #0 0xffff0000004fd564 at mi_switch+0x104
> #1 0xffff000000463f40 at _cv_wait+0x120
> #2 0xffff00000153fa34 at txg_wait_open+0xf4
> #3 0xffff0000014a40bc at dmu_free_long_range+0x17c
> #4 0xffff000001448254 at zfs_rmnode+0x64
> #5 0xffff0000014557c4 at zfs_freebsd_reclaim+0x34
> #6 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c
> #7 0xffff0000005fd6c0 at vgonel+0x450
> #8 0xffff0000005fde7c at vrecycle+0x9c
> #9 0xffff000001455678 at zfs_freebsd_inactive+0x48
> #10 0xffff0000005fc430 at vinactivef+0x180
> #11 0xffff0000005fba50 at vput_final+0x200
> 
> There were a lot more nanslp examples in all
> the alter testing (i.e, those avoided filemon.ko
> being loaded).
> 
> Starting from having pkgclean -A'd the ports, the
> experiments got about the same number of ports built
> as of the end of the 1st hour.
> 
> 
> 
> UFS vs. ZFS? Different media types? . . .
> 
> So I decided to create and try a UFS test context
> instead of a ZFS one. But the media that was best
> to update was a U2 960GB Optane in a USB3
> adapter, something that would perform noticeably
> better than my normal USB3 NVMe drives, even with
> USB involved.
> 
> This combination maintained reasonable load averages
> (instead of having long periods of <1) and finish
> building 172 ports in the 1st hour, far more than the
> around 83 each time I tried the other device/ZFS
> combination. NO evdience of the earlier reported
> oddities.
> 
> I should also time a from-scratch buildworld
> buildkernel.
> 
> I'll look into setting up another U2 960GB Optane
> for use in the USB3 adpater, but with ZFS. That
> should help isolate media vs. file system
> contributions to the varying behaviors.



===
Mark Millard
marklmi at yahoo.com