Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned

From: Mark Millard <marklmi_at_yahoo.com>
Date: Tue, 05 Sep 2023 07:00:51 UTC
On Sep 4, 2023, at 22:06, Mark Millard <marklmi@yahoo.com> wrote:

> On Sep 4, 2023, at 18:39, Mark Millard <marklmi@yahoo.com> wrote:
> 
>> On Sep 4, 2023, at 10:05, Alexander Motin <mav@FreeBSD.org> wrote:
>> 
>>> On 04.09.2023 11:45, Mark Millard wrote:
>>>> On Sep 4, 2023, at 06:09, Alexander Motin <mav@FreeBSD.org> wrote:
>>>>> per_txg_dirty_frees_percent is directly related to the delete delays we see here.  You are forcing ZFS to commit transactions each 5% of dirty ARC limit, which is 5% of 10% or memory size.  I haven't looked on that code recently, but I guess setting it too low can make ZFS commit transactions too often, increasing write inflation for the underlying storage.  I would propose you to restore the default and try again.
>>>> While this machine is different, the original problem was worse than
>>>> the issue here: the load average was less than 1 for the most part
>>>> the parallel bulk build when 30 was used. The fraction of time waiting
>>>> was much longer than with 5. If I understand right, both too high and
>>>> too low for a type of context can lead to increased elapsed time and
>>>> getting it set to a near optimal is a non-obvious exploration.
>>> 
>>> IIRC this limit was modified several times since originally implemented.  May be it could benefit from another look, if default 30% is not good.  It would be good if generic ZFS issues like this were reported to OpenZFS upstream to be visible to a wider public.  Unfortunately I have several other project I must work on, so if it is not a regression I can't promise I'll take it right now, so anybody else is welcome.
>> 
>> As I understand, there are contexts were 5 is inappropriate
>> and 30 works fairly well: no good single answer as to what
>> value range will avoid problems.
>> 
>>>> An overall point for the goal of my activity is: what makes a
>>>> good test context for checking if ZFS is again safe to use?
>>>> May be other tradeoffs make, say, 4 hardware threads more
>>>> reasonable than 32.
>>> 
>>> Thank you for your testing.  The best test is one that nobody else run. It also correlates with the topic of "safe to use", which also depends on what it is used for. :)
>> 
>> Looks like use of a M.2 Samsung SSD 960 PRO 1TB with a
>> non-debug FreeBSD build is suitable for the bulk -a -J128
>> test (no ALLOW_MAKE_JOBS variants enabled, USE_TMPFS=no in
>> use) on the 32 hardware thread system. (The swap partition
>> in use is from the normal environment's PCIe Optane media.)
>> The %idle and the load averages and %user stayed reasonable
>> in a preliminary test. One thing it does introduce is trim
>> management (both available and potentially useful). (Optane
>> media does not support or need it.) No
>> per_txg_dirty_frees_percent adjustment involved (still 5).
>> 
>> I've learned to not use ^T for fear of /bin/sh aborting
>> and messing up poudriere's context. So I now monitor with:
>> 
>> # poudriere status -b
>> 
>> in a separate ssh session.
>> 
>> I'll note that I doubt I'd try for a complete bulk -a .
>> I'd probably stop it if I notice that the number of
>> active builders drops off for a notable time (normal
>> waiting for prerequisites appearing to be why).
>> 
>> 
> 
> So much for that idea. It has reached a state of staying
> under 3500 w/s and up to 4.5ms/w (normally above 2ms/w).
> %busy wondering in the range 85% to 101%. Lots of top
> showing tx->tx. There is some read and other activity as
> well. Of course the kBps figures are larger than the
> earlier USB3 context (larger kB figures).
> 
> It reached about 1350 port->package builds over the first
> hour after the 2nd "Buildee started".
> 
> autotrim is off. Doing a "zpool trim -w zamd64" leads to
> . . . larger w/s figures during the process. So
> more exploring to do at some point. Possibly:
> 
> autotrim
> per_txg_dirty_frees_percent
> 
> For now, I'm just running "zpool trim -w zamd64" once
> and a while so the process continues better.
> 
> Still no evidence of deadlocks. No evidence of builds
> failing for corruptions.
> 
> . . . At around the end of 2nd hour: 2920 or so built. 
> 
> Still no evidence of deadlocks. No evidence of builds
> failing for corruptions.
> 
> . . . I've turned on autotrim without stopping the bulk
> build.
> 
> . . . At around the end of 3rd hour: 4080 or so built. 
> 
> Still no evidence of deadlocks. No evidence of builds
> failing for corruptions.
> 
> Looks like the % idle has been high for a significant
> time. I think I'll stop this specific test and clean
> things out.
> 
> Looks like lang/guile* are examples of not respecting
> the lack of ALLOW_MAKE_JOBS use.
> 
> Hmm. The ^C ended up with:
> 
> ^C[03:41:07] Error: Signal SIGINT caught, cleaning up and exiting
> [main-amd64-bulk_a-default] [2023-09-04_17h49m28s] [sigint:] Queued: 34588 Built: 4331  Failed: 12    Skipped: 303   Ignored: 335   Fetched: 0     Tobuild: 29607  Time: 03:41:05
> [03:41:07] Logs: /usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-09-04_17h49m28s
> [03:41:19] [95] [00:06:59] Finished databases/php80-pdo_sqlite | php80-pdo_sqlite-8.0.30: Failed: build-depends
> [03:41:20] [60] [00:10:53] Finished www/php82-session | php82-session-8.2.9: Success
> [03:41:20] [19] [00:06:42] Finished databases/pecl-memcache@php80 | php80-pecl-memcache-8.2: Failed: build-depends
> [03:41:23] [62] [00:06:40] Finished net-mgmt/icingaweb2-module-incubator@php81 | icingaweb2-module-incubator-php81-0.20.0: Success
> [03:41:24] Cleaning up
> exit: cannot open ./var/run/116_nohang.pid: No such file or directory
> exit: cannot open ./var/run/93_nohang.pid: No such file or directory
> 
> and JID 253's processes are stuck in the STOP state, along with the
> related JID 0 processes. These are for x11-toolkits/libXaw . I do
> not know how long it might have been stuck.
> 
> /usr/local/libexec/poudriere/sh -e /usr/local/share/poudriere/bulk.sh -J128 -jmain-amd64-bulk_a -a
> 
> is stuck in killpg state:
> 
> 3896 101784 sh                  -                   mi_switch+0xbb _sx_xlock_hard+0x3e1 killpg1+0x193 kern_kill+0x23c amd64_syscall+0x109 fast_syscall_common+0xf8 
> 
> Most recent process birth time listed-first order for the STOP's:
> 
> 86826 634820 sh                  -                   mi_switch+0xbb thread_suspend_switch+0xc3 cursig+0x7b9 ast_sig+0x1df ast_handler+0x88 ast+0x20 doreti_ast+0x1c 
> 86825 321240 sh                  -                   mi_switch+0xbb thread_suspend_check+0x260 sig_intr+0x4c fork1+0x186 sys_fork+0x54 amd64_syscall+0x109 fast_syscall_common+0xf8 
> 86550 308525 sh                  -                   mi_switch+0xbb sleepq_catch_signals+0x2ab sleepq_wait_sig+0x9 _sleep+0x1e3 pipe_read+0x420 dofileread+0x86 sys_read+0xbe amd64_syscall+0x109 fast_syscall_common+0xf8 
> 3998 637224 gmake               -                   mi_switch+0xbb sleepq_catch_signals+0x2ab sleepq_wait_sig+0x9 _sleep+0x1e3 kern_wait6+0x51c sys_wait4+0x7d amd64_syscall+0x109 fast_syscall_common+0xf8 
> 3976 298566 sh                  -                   mi_switch+0xbb sleepq_catch_signals+0x2ab sleepq_wait_sig+0x9 _sleep+0x1e3 kern_wait6+0x51c sys_wait4+0x7d amd64_syscall+0x109 fast_syscall_common+0xf8 
> 3974 309336 gmake               -                   mi_switch+0xbb sleepq_catch_signals+0x2ab sleepq_wait_sig+0x9 _sleep+0x1e3 kern_wait6+0x51c sys_wait4+0x7d amd64_syscall+0x109 fast_syscall_common+0xf8 
> 3962 577230 gmake               -                   mi_switch+0xbb sleepq_catch_signals+0x2ab sleepq_wait_sig+0x9 _sleep+0x1e3 kern_wait6+0x51c sys_wait4+0x7d amd64_syscall+0x109 fast_syscall_common+0xf8 
> 3958 635598 sh                  -                   mi_switch+0xbb sleepq_catch_signals+0x2ab sleepq_wait_sig+0x9 _sleep+0x1e3 kern_wait6+0x51c sys_wait4+0x7d amd64_syscall+0x109 fast_syscall_common+
> 0xf8 
> 3915 668148 make                -                   mi_switch+0xbb sleepq_catch_signals+0x2ab sleepq_wait_sig+0x9 _sleep+0x1e3 kern_wait6+0x51c sys_wait4+0x7d amd64_syscall+0x109 fast_syscall_common+
> 0xf8 
> 3912 638108 sh                  -                   mi_switch+0xbb sleepq_catch_signals+0x2ab sleepq_wait_sig+0x9 _sleep+0x1e3 kern_wait6+0x51c sys_wait4+0x7d amd64_syscall+0x109 fast_syscall_common+
> 0xf
> 46142 212631 sh                  -                   mi_switch+0xbb sleepq_catch_signals+0x2ab sleepq_timedwait_sig+0x12 _cv_timedwait_sig_sbt+0x10c seltdwait+0x75 kern_select+0x95e sys_select+0x57 amd64_syscall+0x109 fast_syscall_common+0xf8 

So I tried 30 for per_txg_dirty_frees_percent for 2 contexts:
autotrim on
vs.
autotrim off

where there was 100 GiByte+ to delete after a poudriere
bulk run.

autotrim on: takes a fair time to delete even 1 GiByte of the 100+ GiByte
vs.
autotrim off: takes less time to delete more.

The difference is very visible via "gstat -spod" use.

autotrim on likely makes things less concurrent, somewhat
like USB3 storage having only one command to the device
at a time for FreeBSD. autotrim on seems to prevent the
larger unit of work from being an effective way to
decrease the time required, instead possibly increasing
the time requirement.

That may be an example of the context dependendency for
what value of per_txg_dirty_frees_percent to use to
avoid wasting much time.


===
Mark Millard
marklmi at yahoo.com