An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned

From: Mark Millard <marklmi_at_yahoo.com>
Date: Mon, 04 Sep 2023 02:54:05 UTC
ThreadRipper 1950X (32 hardware threads) doing bulk -J128
with USE_TMPFS=no , no ALLOW_MAKE_JOBS , no
ALLOW_MAKE_JOBS_PACKAGES , USB3 NVMe SSD storage/ZFS-boot-media,
debug system build in use :

[00:03:44] Building 34214 packages using up to 128 builders
[00:03:44] Hit CTRL+t at any time to see build progress and stats
[00:03:44] [01] [00:00:00] Builder starting
[00:04:37] [01] [00:00:53] Builder started
[00:04:37] [01] [00:00:00] Building ports-mgmt/pkg | pkg-1.20.6
[00:05:53] [01] [00:01:16] Finished ports-mgmt/pkg | pkg-1.20.6: Success
[00:06:15] [01] [00:00:00] Building print/indexinfo | indexinfo-0.3.1
[00:06:15] [02] [00:00:00] Builder starting
. . .
[00:06:18] [128] [00:00:00] Builder starting
[00:07:42] [01] [00:01:27] Finished print/indexinfo | indexinfo-0.3.1: Success
[00:07:45] [01] [00:00:00] Building devel/gettext-runtime | gettext-runtime-0.22_1
[00:18:45] [01] [00:11:00] Finished devel/gettext-runtime | gettext-runtime-0.22_1: Success
[00:19:06] [01] [00:00:00] Building devel/gmake | gmake-4.3_2
[00:24:13] [01] [00:05:07] Finished devel/gmake | gmake-4.3_2: Success
[00:24:39] [01] [00:00:00] Building devel/libtextstyle | libtextstyle-0.22
[00:31:08] [125] [00:24:50] Builder started
[00:31:08] [125] [00:00:00] Building print/t1utils | t1utils-1.32
[00:31:15] [33] [00:25:00] Builder started
[00:31:15] [81] [00:24:59] Builder started
[00:31:15] [33] [00:00:00] Building databases/xapian-core | xapian-core-1.4.23,1
[00:31:15] [13] [00:25:00] Builder started
[00:31:15] [81] [00:00:00] Building devel/bmake | bmake-20230723
[00:31:15] [13] [00:00:00] Building devel/evdev-proto | evdev-proto-5.8
[00:31:16] [41] [00:25:00] Builder started
[00:31:16] [41] [00:00:00] Building devel/pcre | pcre-8.45_3
. . .

(Looks like lang/go120 ignores the lack of ALLOW_MAKE_JOBS .
There may be others that still have signficant parallel
activity.)

[main-amd64-bulk_a-default] [2023-09-03_13h48m45s] [parallel_build:] Queued: 34588 Built: 727   Failed: 1     Skipped: 40    Ignored: 335   Fetched: 0     Tobuild: 33485  Time: 01:36:51

(So about 1 hr after the last "Builder starting" it had
built 727.)

The vast majority of the time: lots of cpdup's with tx->tx
showing most of the time for STATE but showing having some
CPU time.

^T commonly showed various Builders in starting PHASE for
3min..6min.

Around 66% mean Idle time (guess from watching top).

After ^C "gstat -spod" reports it is almost always writing
2200 to 2500 writes per second or so for *hours* (still
going on).

ztop reports 1500 to 3200 d/s or so almost always for
Dataset zamd64/poudriere/data/.m instead (also still going
on). Normally no other Dataset is shown.

With all the disk I/O activity, this is definitely "live"
in some sense. But I've no clue if it is just repeating
itself over and over vs. if it making some sort of progress.

For reference for the ^C and after:

^C[01:39:00] [20] [00:00:03] Building sysutils/linux-c7-dosfstools | linux-c7-dosfstools-3.0.20
[01:39:00] [93] [00:07:12] Finished science/dimod | dimod-0.12.11: Success
[01:39:00] Error: Signal SIGINT caught, cleaning up and exiting
[01:39:02] [63] [00:06:34] Finished archivers/unarj | unarj-2.65_2: Success
[01:39:03] [128] [00:07:47] Finished sysutils/shuf | shuf-3.0: Success
[01:39:04] [113] [00:07:06] Finished devel/bsddialog | bsddialog-0.4.1: Success
[main-amd64-bulk_a-default] [2023-09-03_13h48m45s] [sigint:] Queued: 34588 Built: 752   Failed: 1     Skipped: 40    Ignored: 335   Fetched: 0     Tobuild: 33460  Time: 01:38:56
[01:39:06] Logs: /usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-09-03_13h48m45s
[01:39:14] [12] [00:09:07] Finished archivers/rzip | rzip-2.1_1: Success
[01:39:14] Cleaning up
exit: cannot open ./var/run/49_nohang.pid: No such file or directory
exit: cannot open ./var/run/87_nohang.pid: No such file or directory

After that ^t produced the likes of:

load: 6.39  cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u 121.32s 1% 13004k
#0 0xffffffff80b6f103 at mi_switch+0x173
#1 0xffffffff80bc0f24 at sleepq_switch+0x104
#2 0xffffffff80aec4c5 at _cv_wait+0x165
#3 0xffffffff82aba365 at txg_wait_open+0xf5
#4 0xffffffff82a11b81 at dmu_free_long_range+0x151
#5 0xffffffff829a87d2 at zfs_rmnode+0x72
#6 0xffffffff829b658d at zfs_freebsd_reclaim+0x3d
#7 0xffffffff8113a495 at VOP_RECLAIM_APV+0x35
#8 0xffffffff80c5a7d9 at vgonel+0x3a9
#9 0xffffffff80c5af7f at vrecycle+0x3f
#10 0xffffffff829b643e at zfs_freebsd_inactive+0x4e
#11 0xffffffff80c598cf at vinactivef+0xbf
#12 0xffffffff80c590da at vput_final+0x2aa
#13 0xffffffff80c68886 at kern_funlinkat+0x2f6
#14 0xffffffff80c68588 at sys_unlink+0x28
#15 0xffffffff8106323f at amd64_syscall+0x14f
#16 0xffffffff8103512b at fast_syscall_common+0xf8

The console/logs do report "witness exhausted":

. . .
Sep  3 13:41:08 amd64-ZFS login[1751]: ROOT LOGIN (root) ON ttyv0
Sep  3 13:51:35 amd64-ZFS kernel: witness_lock_list_get: witness exhausted
Sep  3 14:26:38 amd64-ZFS kernel: pid 27418 (conftest), jid 245, uid 0: exited on signal 11 (core dumped)
. . .

So it did not take long for the "witness exhausted" to
happen.

# uname -apKU
FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT amd64 1500000 #74 main-n265143-525bc87f54f2-dirty: Sun Sep  3 13:35:04 PDT 2023     root@amd64_ZFS:/usr/obj/BUILDs/main-amd64-dbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-DBG amd64 amd64 1500000 1500000


Looks like I'll be forcing the machine to reboot or
to power off. The media was deliberately set up for
doing risky tests. It is not my normal environment.

===
Mark Millard
marklmi at yahoo.com