Re: ZFS deadlock in 14 [USE_TMPFS=no poudriere messed up from the start, lots of "vlruwk"]
Date: Sat, 19 Aug 2023 23:27:26 UTC
On Aug 19, 2023, at 15:41, Mark Millard <marklmi@yahoo.com> wrote: > On Aug 19, 2023, at 13:41, Mark Millard <marklmi@yahoo.com> wrote: > >> [I forgot to adjust USE_TMPFS for the purpose of the test. >> So I'll later be starting over.] >> >> . . . > > I finally got around to starting a from-scratch bulk -a > again (based on USE_TMPFS=no this time). This is with > 15107.patch and 15122.patch applied. This is a non-debug > kernel experiment. > > Interstingly it got: > > [00:01:34] [01] [00:00:00] Builder starting > [00:01:57] [01] [00:00:23] Builder started > [00:01:57] [01] [00:00:00] Building ports-mgmt/pkg | pkg-1.20.4 > [00:03:09] [01] [00:01:12] Finished ports-mgmt/pkg | pkg-1.20.4: Success > [00:03:21] [01] [00:00:00] Building print/indexinfo | indexinfo-0.3.1 > [00:03:21] [02] [00:00:00] Builder starting > [00:03:21] [03] [00:00:00] Builder starting > [00:03:21] [04] [00:00:00] Builder starting > [00:03:21] [05] [00:00:00] Builder starting > [00:03:21] [06] [00:00:00] Builder starting > [00:03:21] [07] [00:00:00] Builder starting > [00:03:22] [08] [00:00:00] Builder starting > [00:03:22] [09] [00:00:00] Builder starting > [00:03:22] [10] [00:00:00] Builder starting > [00:03:22] [11] [00:00:00] Builder starting > [00:03:22] [12] [00:00:00] Builder starting > [00:03:22] [13] [00:00:00] Builder starting > [00:03:22] [14] [00:00:00] Builder starting > [00:03:22] [15] [00:00:00] Builder starting > [00:03:22] [16] [00:00:00] Builder starting > [00:03:22] [17] [00:00:00] Builder starting > [00:03:22] [18] [00:00:00] Builder starting > [00:03:22] [19] [00:00:00] Builder starting > [00:03:22] [20] [00:00:00] Builder starting > [00:03:22] [21] [00:00:00] Builder starting > [00:03:22] [22] [00:00:00] Builder starting > [00:03:22] [23] [00:00:00] Builder starting > [00:03:22] [24] [00:00:00] Builder starting > [00:03:22] [25] [00:00:00] Builder starting > [00:03:22] [26] [00:00:00] Builder starting > [00:03:22] [27] [00:00:00] Builder starting > [00:03:22] [28] [00:00:00] Builder starting > [00:03:22] [29] [00:00:00] Builder starting > [00:03:22] [30] [00:00:00] Builder starting > [00:03:22] [31] [00:00:00] Builder starting > [00:03:22] [32] [00:00:00] Builder starting > [00:03:30] [01] [00:00:09] Finished print/indexinfo | indexinfo-0.3.1: Success > [00:03:31] [01] [00:00:00] Building devel/gettext-runtime | gettext-runtime-0.22 > > and is still that way minutes later. > > ^T shows: > > [00:03:31] [01] [00:00:00] Building devel/gettext-runtime | gettext-runtime-0.22 > load: 13.02 cmd: sh 2187 [vlruwk] 570.19r 0.62u 38.60s 9% 3948k > #0 0xffffffff80b7701b at mi_switch+0xbb > #1 0xffffffff80bc976f at sleepq_timedwait+0x2f > #2 0xffffffff80b76770 at _sleep+0x1d0 > #3 0xffffffff80c5b435 at vn_alloc_hard+0x2a5 > #4 0xffffffff80c50b72 at getnewvnode_reserve+0x92 > #5 0xffffffff829b9b12 at zfs_zget+0x22 > #6 0xffffffff829a6a8d at zfs_dirent_lookup+0x16d > #7 0xffffffff829a6b5f at zfs_dirlook+0x7f > #8 0xffffffff829b6410 at zfs_lookup+0x350 > #9 0xffffffff829b182a at zfs_freebsd_cachedlookup+0x6a > #10 0xffffffff80c36a0d at vfs_cache_lookup+0xad > #11 0xffffffff80c45141 at vfs_lookup+0x581 > #12 0xffffffff80c44238 at namei+0x238 > #13 0xffffffff80c63b5e at kern_statat+0xee > #14 0xffffffff80c64237 at sys_fstatat+0x27 > #15 0xffffffff81049a79 at amd64_syscall+0x109 > #16 0xffffffff8101f11b at fast_syscall_common+0xf8 > [main-amd64-bulk_a-default] [2023-08-19_15h14m10s] [parallel_build:] Queued: 34435 Built: 2 Failed: 0 Skipped: 35 Ignored: 358 Fetched: 0 Tobuild: 34040 Time: 00:10:52 > ID TOTAL ORIGIN PKGNAME PHASE PHASE TMPFS CPU% MEM% > [01] 00:07:29 devel/gettext-runtime | gettext-runtime-0.22 build 00:06:32 25.4% 0% > [00:11:25] Logs: /usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-08-19_15h14m10s > > Note the 3:31->11:25 . > > Top is showing lots of "vlruwk". For example: > > 362 0 root 40 0 27076Ki 13776Ki CPU19 19 4:23 0.00% cpdup -i0 -o ref 32 > 349 0 root 53 0 27076Ki 13776Ki vlruwk 22 4:20 0.01% cpdup -i0 -o ref 31 > 328 0 root 68 0 27076Ki 13804Ki vlruwk 8 4:30 0.01% cpdup -i0 -o ref 30 > 304 0 root 37 0 27076Ki 13792Ki vlruwk 6 4:18 0.01% cpdup -i0 -o ref 29 > 282 0 root 42 0 33220Ki 13956Ki vlruwk 8 4:33 0.01% cpdup -i0 -o ref 28 > 242 0 root 56 0 27076Ki 13796Ki vlruwk 4 4:28 0.00% cpdup -i0 -o ref 27 > > In other words, it is messed up from the start, not > just later. > > It does suggest that the dbg kernel should not end up with > resource problems: not that much gets very far. So I'll > probably stop it and substitute the debug kernel, reboot > and try again. Still for nodbg kernel . . . The "vlruwk" processes do occasionally instead show a CPU?? . Nothing seems stuck in only one STATE. (Live lock?) As for using the dbg kernel instead . . . Most of the time that processes are showing CPU?? more progress is made in building, but basically one builder. vlruwk dooes show up, gradually showing a larger fracion of the time. ref 02 .. ref 32 are still in cpdup -i0 -o . *vnode is showing up some as well. N process looks to be stuck in just one of those. (Live lock?) The debug kernel is not reporting anything during this so far. (some time goes by) At this point vlruwk is fairly commonly what mostlt display for the cpdup's that are not finishing --but none are stuck in vlruuwk . Looks like I should try without the 2 patches (15107 and 15122). === Mark Millard marklmi at yahoo.com