Re: armv8.2-A+ tuned FreeBSD kernels vs. poudriere bulk and USB3 media: tx->tx_quiesce_done_cv related blocking of processes?

From: Mateusz Guzik <mjguzik_at_gmail.com>
Date: Mon, 01 May 2023 00:44:10 UTC
can you redo zfs test with:
sysctl vfs.zfs.per_txg_dirty_frees_percent=5

On 5/1/23, Mark Millard <marklmi@yahoo.com> wrote:
> As the evidence this time is largely independent of the details
> reported previousy, I'm top posting this.
>
> The previous ZFS on USB3 results were based on poudriere using
> "USE_TMPFS=data", meaning that almost all file I/O was via ZFS
> to the USB3 media.
>
> The UFS on U2 960GB Optane via USB3 adapter results did not not
> suffer the reported problems, despite "USE_TMPFS=data". (I let
> it run to completion.) But this had both a media and a file
> system difference.
>
> This time the results are for just changing poudriere to use
> "USE_TMPFS=all" instead but back on the original ZFS on
> USB3 media. The implication is that the vast majority of the
> file I/O is not via ZFS to the USB3 media.
>
> The context has 32 GiBytes of RAM and about 118 GiBytes of
> swap/paging space. It would need to page if pet run to
> completion.
>
> Observing, the load average is behaving normally: Things are
> not stuck waiting. "gstat -spod" indicates the ZFS I/O is
> not sustained (no paging in swap space use yet).
>
> First 1 hr: 262 ports built.
>
> But this had both a media and a file system difference again.
>
> I'm stopping after this, in order to set up the next just-
> ZFS experiments.
>
>
> Next experiments:
>
> I expect to move the ZFS context to U2 960GB Optane media
> used with the USB3 adapter and to test both "USE_TMPFS=data"
> and "USE_TMPSF=all", probably letting USE_TMPFS=all run to
> completion.
>
> If Optane based USE_TMPFS=data context still has the problem,
> even the more performance media would have been not enough to
> avoid what would then appear to be a ZFS problem, two other
> file systems not having the problem.
>
> The Optane based USE_TMPFS=all context should just handle the
> paging and more rare ZFS I/O quicker. I do not expect problems
> for this combination, given the UFS on Optane USB3 results
> and the partial USE_TMPFS=all non-Optane USB3 results. Even
> with ZFS working, this likely is the more performant type of
> context for the Windows Dev Kit 2023, given that I'm leaving
> Windows 11 Pro in place on the internal media.
>
>
> Hypothesis for the original problem:
>
> I wonder if ZFS write activity to the USB3 media was largely
> blocking the ZFS read activity to the same media, causing
> lots of things to have to spend much time waiting for data
> instead of making progress, leading to long periods of low
> load averages.
>
>
> Older material:
>
> On Apr 30, 2023, at 00:50, Mark Millard <marklmi@yahoo.com> wrote:
>
>> On Apr 29, 2023, at 19:44, Mark Millard <marklmi@yahoo.com> wrote:
>>
>>> This is based on: main-n262658-b347c2284603-dirty, b347c2284603
>>> being from late Apr 28, 2023 UTC. (The "-dirty" is from some
>>> historical patches that I use.) The build is a non-debug build
>>> (but with symbols not stripped). World or Kernel had been built
>>> using:
>>>
>>> -mcpu=cortex-a78C+flagm+nofp16fml
>>>
>>> just for testing purposes. (Worked nicely for -j8 buildworld
>>> buildkernel testing for the 4 cortex-a78c's plus 4 cortex-x1c's
>>> present.)
>>>
>>> Monitoring poudriere bulk related activity via top and gstat -spod
>>> I see a lot of the odd result of one process doing something
>>> like:
>>>
>>> CPU4     4   1:39  99.12% /usr/local/sbin/pkg-static create -f tzst -r
>>> /wrkdirs/usr/ports/devel/cmake-core/work/stage
>>>
>>> while other processes sit in the likes of:
>>>
>>> tx->tx
>>> zcq->z
>>> zcw->z
>>> zilog-
>>> select
>>> wait
>>>
>>> But sometimes there is no CPU bound process and the top CPU process is
>>> the likes of:
>>>
>>> 1.24% [usb{usbus0}]
>>>
>>> "gstat -spod" basically  shows da0 dedicated to write activity most
>>> of the time.
>>>
>>> After: sysctl kern.tty_info_kstacks=1
>>> Then using ^T at various times, I see a lot of:
>>>
>>> load: 0.48  cmd: sh 93914 [tx->tx_quiesce_done_cv] 7534.91r 11.06u 22.66s
>>> 0% 3800k
>>> #0 0xffff0000004fd564 at mi_switch+0x104
>>> #1 0xffff000000463f40 at _cv_wait+0x120
>>> #2 0xffff00000153fa34 at txg_wait_open+0xf4
>>> #3 0xffff0000014a40bc at dmu_free_long_range+0x17c
>>> #4 0xffff000001448254 at zfs_rmnode+0x64
>>> #9 0xffff000001455678 at zfs_freebsd_inactive+0x48
>>> #10 0xffff0000005fc430 at vinactivef+0x180
>>> #11 0xffff0000005fba50 at vput_final+0x200
>>> #12 0xffff00000060c4d0 at kern_funlinkat+0x320
>>> #13 0xffff00015d6cbbf4 at filemon_wrapper_unlink+0x14
>>> #14 0xffff0000008f8514 at do_el0_sync+0x594
>>> #15 0xffff0000008d4904 at handle_el0_sync+0x40
>>>
>>> load: 0.34  cmd: sh 93914 [tx->tx_quiesce_done_cv] 7566.69r 11.06u 22.66s
>>> 0% 3800k
>>> #0 0xffff0000004fd564 at mi_switch+0x104
>>> #1 0xffff000000463f40 at _cv_wait+0x120
>>> #2 0xffff00000153fa34 at txg_wait_open+0xf4
>>> #3 0xffff0000014a40bc at dmu_free_long_range+0x17c
>>> #4 0xffff000001448254 at zfs_rmnode+0x64
>>> #5 0xffff0000014557c4 at zfs_freebsd_reclaim+0x34
>>> #6 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c
>>> #7 0xffff0000005fd6c0 at vgonel+0x450
>>> #8 0xffff0000005fde7c at vrecycle+0x9c
>>> #9 0xffff000001455678 at zfs_freebsd_inactive+0x48
>>> #10 0xffff0000005fc430 at vinactivef+0x180
>>> #11 0xffff0000005fba50 at vput_final+0x200
>>> #12 0xffff00000060c4d0 at kern_funlinkat+0x320
>>> #13 0xffff00015d6cbbf4 at filemon_wrapper_unlink+0x14
>>> #14 0xffff0000008f8514 at do_el0_sync+0x594
>>> #15 0xffff0000008d4904 at handle_el0_sync+0x40
>>>
>>> load: 0.44  cmd: sh 93914 [tx->tx_quiesce_done_cv] 7693.52r 11.24u 23.08s
>>> 0% 3800k
>>> #0 0xffff0000004fd564 at mi_switch+0x104
>>> #1 0xffff000000463f40 at _cv_wait+0x120
>>> #2 0xffff00000153fa34 at txg_wait_open+0xf4
>>> #3 0xffff0000014a40bc at dmu_free_long_range+0x17c
>>> #4 0xffff000001448254 at zfs_rmnode+0x64
>>> #5 0xffff0000014557c4 at zfs_freebsd_reclaim+0x34
>>> #6 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c
>>> #7 0xffff0000005fd6c0 at vgonel+0x450
>>> #8 0xffff0000005fde7c at vrecycle+0x9c
>>> #9 0xffff000001455678 at zfs_freebsd_inactive+0x48
>>> #10 0xffff0000005fc430 at vinactivef+0x180
>>> #11 0xffff0000005fba50 at vput_final+0x200
>>> #12 0xffff00000060c4d0 at kern_funlinkat+0x320
>>> #13 0xffff00015d6cbbf4 at filemon_wrapper_unlink+0x14
>>> #14 0xffff0000008f8514 at do_el0_sync+0x594
>>> #15 0xffff0000008d4904 at handle_el0_sync+0x40
>>>
>>>
>>> The system (Windows Dev Kit 2023) has 32 GiBytes of RAM. Example
>>> output from a top modified to show some "Max[imum]Obs[erved]"
>>> information:
>>>
>>> last pid: 17198;  load averages:   0.33,   0.58,   1.06 MaxObs:  15.49,
>>> 8.73,   5.75
>>>                             up 0+20:48:10  19:14:49
>>> 426 threads:   9 running, 394 sleeping, 1 stopped, 22 waiting, 50
>>> MaxObsRunning
>>> CPU:  0.0% user,  0.0% nice,  0.2% system,  0.1% interrupt, 99.7% idle
>>> Mem: 282760Ki Active, 7716Mi Inact, 23192Ki Laundry, 22444Mi Wired,
>>> 2780Ki Buf, 848840Ki Free, 2278Mi MaxObsActive, 22444Mi MaxObsWired,
>>> 22752Mi MaxObs(Act+Wir+Lndry)
>>> ARC: 11359Mi Total, 3375Mi MFU, 5900Mi MRU, 993Mi Anon, 93076Ki Header,
>>> 992Mi Other
>>>    8276Mi Compressed, 19727Mi Uncompressed, 2.38:1 Ratio
>>> Swap: 120832Mi Total, 120832Mi Free, 2301Mi MaxObs(Act+Lndry+SwapUsed),
>>> 22752Mi MaxObs(Act+Wir+Lndry+SwapUsed)
>>>
>>>
>>> The poudriere bulk has 8 builders but has ALLOW_MAKE_JOBS=yes
>>> without any explicit settings for the likes of MAKE_JOBS_NUMBER .
>>> So it is a configuration that allows a high load average compared
>>> to the number of hardware threads (here: cores) in the system.
>>>
>>>
>>> I've rebooted to do a test with filemon not loaded at the time
>>> (here it was loaded from prior buildworld buildkernel activity).
>>> We will see if it still ends up with such problems.
>>
>> It still ends up with things waiting, but the detailed STATE
>> valiues generally lsited are somewhat different.
>>
>> I also tried a chroot into a world from use of -mcpu=cortex-a72
>> and got similar results. Suggesting that only the
>> -mcpu=cortex-a78c+flagm+nofp16fml kernel is required to see the
>> issues. This even got some examples like:
>>
>> load: 1.20  cmd: sh 95560 [tx->tx_quiesce_done_cv] 2022.55r 3.21u 9.24s 0%
>> 3852k
>> #0 0xffff0000004fd564 at mi_switch+0x104
>> #1 0xffff000000463f40 at _cv_wait+0x120
>> #2 0xffff000001518a34 at txg_wait_open+0xf4
>> #3 0xffff00000147d0bc at dmu_free_long_range+0x17c
>> #4 0xffff000001421254 at zfs_rmnode+0x64
>> #5 0xffff00000142e7c4 at zfs_freebsd_reclaim+0x34
>> #6 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c
>> #7 0xffff0000005fd6c0 at vgonel+0x450
>> #8 0xffff0000005fde7c at vrecycle+0x9c
>> #9 0xffff00000142e678 at zfs_freebsd_inactive+0x48
>> #10 0xffff0000005fc430 at vinactivef+0x180
>> #11 0xffff0000005fba50 at vput_final+0x200
>> #12 0xffff00015d8ceab4 at null_reclaim+0x154
>> #13 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c
>> #14 0xffff0000005fd6c0 at vgonel+0x450
>> #15 0xffff0000005fde7c at vrecycle+0x9c
>> #16 0xffff00015d8ce8e8 at null_inactive+0x38
>> #17 0xffff0000005fc430 at vinactivef+0x180
>>
>> (The chroot use involves null mounts.)
>>
>> which is sort of analogous to the filemon related
>> backtraces I showed earlier. The common part
>> across the examples looks to be #0..#11:
>>
>> #0 0xffff0000004fd564 at mi_switch+0x104
>> #1 0xffff000000463f40 at _cv_wait+0x120
>> #2 0xffff00000153fa34 at txg_wait_open+0xf4
>> #3 0xffff0000014a40bc at dmu_free_long_range+0x17c
>> #4 0xffff000001448254 at zfs_rmnode+0x64
>> #5 0xffff0000014557c4 at zfs_freebsd_reclaim+0x34
>> #6 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c
>> #7 0xffff0000005fd6c0 at vgonel+0x450
>> #8 0xffff0000005fde7c at vrecycle+0x9c
>> #9 0xffff000001455678 at zfs_freebsd_inactive+0x48
>> #10 0xffff0000005fc430 at vinactivef+0x180
>> #11 0xffff0000005fba50 at vput_final+0x200
>>
>> There were a lot more nanslp examples in all
>> the alter testing (i.e, those avoided filemon.ko
>> being loaded).
>>
>> Starting from having pkgclean -A'd the ports, the
>> experiments got about the same number of ports built
>> as of the end of the 1st hour.
>>
>>
>>
>> UFS vs. ZFS? Different media types? . . .
>>
>> So I decided to create and try a UFS test context
>> instead of a ZFS one. But the media that was best
>> to update was a U2 960GB Optane in a USB3
>> adapter, something that would perform noticeably
>> better than my normal USB3 NVMe drives, even with
>> USB involved.
>>
>> This combination maintained reasonable load averages
>> (instead of having long periods of <1) and finish
>> building 172 ports in the 1st hour, far more than the
>> around 83 each time I tried the other device/ZFS
>> combination. NO evdience of the earlier reported
>> oddities.
>>
>> I should also time a from-scratch buildworld
>> buildkernel.
>>
>> I'll look into setting up another U2 960GB Optane
>> for use in the USB3 adpater, but with ZFS. That
>> should help isolate media vs. file system
>> contributions to the varying behaviors.
>
>
>
> ===
> Mark Millard
> marklmi at yahoo.com
>
>
>


-- 
Mateusz Guzik <mjguzik gmail.com>