Re: main aarch64: poudriere-devel [UFS context] cpdup stuck in pgnslp state [any more useful information to extract via kgdb?]

From: Mark Millard <marklmi_at_yahoo.com>
Date: Fri, 22 Mar 2024 17:51:29 UTC
[Adding some information from kgdb. Switch to freebsd-hackers and
(still) freebsd-arm. I also ask if there is more useful information
to extract via kgdb.]

On Mar 21, 2024, at 17:34, Mark Millard <marklmi@yahoo.com> wrote:

> Note, more recent process creations towards top, older ones towards bottom:
> 
>  PID   JID USERNAME    PRI NICE     SIZE       RES STATE    C   TIME     CPU COMMAND
> . . .
> 33693    19 root         68    0   6524Ki    3252Ki wait     3   0:00   0.00% /usr/bin/make -C /usr/ports/lang/gcc13 build
> 33692     0 root         68    0  15728Ki    3552Ki wait     0   0:00   0.00% sh: poudriere[main-CA7-default][02]: build_pkg (gcc13-13.2.0_4) (sh)
> 30174     0 root         68    0  15728Ki    3564Ki select   3   0:00   0.00% sh: poudriere[main-CA7-default][02]: build_pkg (gcc13-13.2.0_4) (sh)
> 26338     0 root         66    0  17740Ki    5044Ki pgnslp   0   0:01   0.00% cpdup -i0 -s0 -f -x ref 01
> 26308     0 root         68    0  15728Ki    3556Ki wait     0   0:00   0.00% sh: poudriere[main-CA7-default][01]: build_pkg (boost-libs-1.84.0) (sh)
> 33592     0 root         26    0  15728Ki    3388Ki piperd   2   0:01   0.00% sh: poudriere[main-CA7-default]: pkg_cacher_main (sh)
> 29205     0 root         68    0  15728Ki    3392Ki nanslp   2   1:52   0.14% sh: poudriere[main-CA7-default]: html_json_main (sh)
> 28834     0 root         20    0  15728Ki    3548Ki select   3   0:01   0.00% /usr/local/libexec/poudriere/sh -e /usr/local/share/poudriere/bulk.sh -jmain-CA7 -c -f /root/origins/CA7-origins.txt
> 28833     0 root         20    0  13560Ki    1924Ki wait     3   0:00   0.00% /bin/sh /root/build-ports-main-CA7.sh -c
> . . .
> 
> pgnslp seems to be from: vm_page_acquire_unlocked in sys/vm/vm_page.c .
> That in turn looks to be using vm_page_grab_sleep :
> 
>                if (!vm_page_grab_sleep(object, m, pindex, "pgnslp",
>                    allocflags, false))
>                        return (false);
> 
> and:
> 
> /*
> *      vm_page_grab_sleep
> *
> *      Sleep for busy according to VM_ALLOC_ parameters.  Returns true
> *      if the caller should retry and false otherwise.
> *
> *      If the object is locked on entry the object will be unlocked with
> *      false returns and still locked but possibly having been dropped
> *      with true returns.
> */
> static bool
> vm_page_grab_sleep(vm_object_t object, vm_page_t m, vm_pindex_t pindex,
>    const char *wmesg, int allocflags, bool locked)
> {
> 
>        if ((allocflags & VM_ALLOC_NOWAIT) != 0)
>                return (false);
> 
>        /*
>         * Reference the page before unlocking and sleeping so that
>         * the page daemon is less likely to reclaim it.
>         */
>        if (locked && (allocflags & VM_ALLOC_NOCREAT) == 0)
>                vm_page_reference(m);
> 
>        if (_vm_page_busy_sleep(object, m, pindex, wmesg, allocflags, locked) &&
>            locked)
>                VM_OBJECT_WLOCK(object);
>        if ((allocflags & VM_ALLOC_WAITFAIL) != 0)
>                return (false);
> 
>        return (true);
> }
> 
> . . .
> [10:08:06] [01] [00:00:00] Building devel/boost-libs | boost-libs-1.84.0
> . . .
> 
> # poudriere status -b
> [main-CA7-default] [2024-03-21_06h23m31s] [parallel_build] Queued: 265 Built: 213 Failed: 0   Skipped: 0   Ignored: 0   Fetched: 0   Tobuild: 52   Time: 10:50:40
> ID  TOTAL                              ORIGIN   PKGNAME                            PHASE TIME     TMPFS      CPU% MEM%
> [01] 00:42:40                 devel/boost-libs | boost-libs-1.84.0               starting 00:42:40 951.54 MiB          
> . . .
> 
> Unfortunately:
> 
> A) The booted kernel is my personal build based on -mcpu=cortex-a76
>   and LSE_ATOMICS . (It is in use on a RPi5 booted via EDK2.)
> 
> B) The booted world is a PkgBase world.
> 
> C) The poudriere jail's world directory tree is my personal armv7
>   world build based on -mcpu=cortex-a7 .
> 
> All are based on: main-n268827-75464941dc17 . (Well, PkgBase
> commit identification/verification for world does not exist.
> I happened to update PkgBase during a long lull for commits
> to main. In the context, the boot-world seems unlikely to be
> involved here.)
> 
> The boot media is a U2 Optane 960 GB used via a USB3 adaptor.
> 
> I've done bunches of builds in the (A)-(C) context on the RPi5
> and have not seen this before, so: does not look to be readily
> repeatable.
> 
> (Unfortunately, the purpose of the build was to find out how long
> the particular build configuration took to finish building the
> 265 packages from scratch, for comparison to other builds.)
> 
> I may wait for the system to become fairly idle and then see about
> forcing a crash dump. It may be a while before the poudriere bulk
> runs out of packages it can build, absent building boost-libs .
> 
> 
> Side note:
> As far as I can tell, how to identify a context that allows
> identification of what commit vintage a PkgBase world is based on
> is unspecified so far. For a PkgBase kernel uname -apKU may well
> report the kernel-commit identification well. (Hard to verify.)

The cpdup kernel thread backtrace follows. The kernel build was
NODEBUG and optimatized, but with debug information still produced.
This can lead to somewhat messy information but is better than
not having the information.

(kgdb) thread 145
[Switching to thread 145 (Thread 361925)]
#0  sched_switch (td=td@entry=0xffff0000dfaf8c80, flags=flags@entry=259) at /usr/main-src/sys/kern/sched_ule.c:2290
2290 cpuid = td->td_oncpu = PCPU_GET(cpuid);
(kgdb) bt
#0  sched_switch (td=td@entry=0xffff0000dfaf8c80, flags=flags@entry=259) at /usr/main-src/sys/kern/sched_ule.c:2290
#1  0xffff0000004f5f2c in mi_switch (flags=flags@entry=259) at /usr/main-src/sys/kern/kern_synch.c:546
#2  0xffff00000054ea28 in sleepq_switch (wchan=<optimized out>, pri=<optimized out>) at /usr/main-src/sys/kern/subr_sleepqueue.c:607
#3  0xffff00000054e8f8 in sleepq_wait (wchan=<unavailable>, wchan@entry=0xffffa001fae26630, pri=<unavailable>, pri@entry=52) at /usr/main-src/sys/kern/subr_sleepqueue.c:658
#4  0xffff00000084e768 in _vm_page_busy_sleep (obj=<optimized out>, obj@entry=0xffffa0012aab9528, m=m@entry=0xffffa001fae26630, pindex=pindex@entry=0, wmesg=0xffff0000009c6297 "pgnslp",      allocflags=allocflags@entry=21504, locked=<optimized out>) at /usr/main-src/sys/vm/vm_page.c:1100
#5  0xffff0000008548e4 in vm_page_grab_sleep (object=0xffffa0012aab9528, m=0xffffa001fae26630, pindex=0, allocflags=21504, locked=false, wmesg=<optimized out>) at /usr/main-src/sys/vm/vm_page.c:4547
#6  vm_page_acquire_unlocked (object=object@entry=0xffffa0012aab9528, pindex=pindex@entry=0, prev=<optimized out>, prev@entry=0xffff0000e04c5570, mp=mp@entry=0xffff0000e04c55b8, allocflags=21504)
    at /usr/main-src/sys/vm/vm_page.c:4690
#7  0xffff0000008552c4 in vm_page_grab_valid_unlocked (mp=mp@entry=0xffff0000e04c55b8, object=object@entry=0xffffa0012aab9528, pindex=pindex@entry=0, allocflags=allocflags@entry=21504)
    at /usr/main-src/sys/vm/vm_page.c:4871
#8  0xffff0000005949e8 in uiomove_object_page (obj=0xffffa0012aab9528, len=<unavailable>, uio=0xffff0000e04c57d8) at /usr/main-src/sys/kern/uipc_shm.c:207
#9  uiomove_object (obj=0xffffa0012aab9528, obj@entry=<error reading variable: value is not available>, obj_size=29352, obj_size@entry=<error reading variable: value is not available>,      uio=0xffff0000e04c57d8, uio@entry=<error reading variable: value is not available>) at /usr/main-src/sys/kern/uipc_shm.c:268
#10 0xffff0000009b8a54 in VOP_READ_APV (vop=0xffff000000c8c038 <tmpfs_vnodeop_entries>, a=a@entry=0xffff0000e04c5658) at vnode_if.c:1048
#11 0xffff0000005fd630 in VOP_READ (vp=0xffffa00129280540, uio=0xffff0000e04c57d8, ioflag=327680, cred=<unavailable>) at ./vnode_if.h:537
#12 vn_read (fp=fp@entry=0xffffa0007492bdc0, uio=uio@entry=0xffff0000e04c57d8, active_cred=active_cred@entry=0xffffa0010ab5f400, flags=<optimized out>, td=<optimized out>)
    at /usr/main-src/sys/kern/vfs_vnops.c:1123
#13 0xffff0000005f5598 in vn_io_fault (fp=0xffffa0007492bdc0, fp@entry=<error reading variable: value is not available>, uio=0xffff0000e04c57d8,      uio@entry=<error reading variable: value is not available>, active_cred=0xffffa0010ab5f400, active_cred@entry=<error reading variable: value is not available>, flags=0,      flags@entry=<error reading variable: value is not available>, td=0xffff0000dfaf8c80, td@entry=<error reading variable: value is not available>) at /usr/main-src/sys/kern/vfs_vnops.c:1504
#14 0xffff000000562ed4 in fo_read (fp=0xffffa0007492bdc0, uio=0xffff0000e04c57d8, active_cred=<unavailable>, flags=0, td=0xffff0000dfaf8c80) at /usr/main-src/sys/sys/file.h:344
#15 dofileread (td=td@entry=0xffff0000dfaf8c80, fd=fd@entry=3, fp=0xffffa0007492bdc0, auio=auio@entry=0xffff0000e04c57d8, offset=<optimized out>, flags=0, flags@entry=-531867712)
    at /usr/main-src/sys/kern/sys_generic.c:366
#16 0xffff000000562a00 in kern_readv (td=0xffff0000dfaf8c80, fd=3, auio=0xffff0000e04c57d8) at /usr/main-src/sys/kern/sys_generic.c:287
#17 sys_read (td=0xffff0000dfaf8c80, td@entry=<error reading variable: value is not available>, uap=<unavailable>, uap@entry=<error reading variable: value is not available>)
    at /usr/main-src/sys/kern/sys_generic.c:203
#18 0xffff00000089f9b4 in syscallenter (td=0xffff0000dfaf8c80) at /usr/main-src/sys/arm64/arm64/../../kern/subr_syscall.c:186
#19 svc_handler (td=0xffff0000dfaf8c80, frame=<optimized out>) at /usr/main-src/sys/arm64/arm64/trap.c:198
#20 do_el0_sync (td=0xffff0000dfaf8c80, frame=<optimized out>) at /usr/main-src/sys/arm64/arm64/trap.c:640
#21 <signal handler called>
#22 0x00000000271dd394 in ?? ()
#23 0x0000000026e3071c in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(kgdb) frame 4
#4  0xffff00000084e768 in _vm_page_busy_sleep (obj=<optimized out>, obj@entry=0xffffa0012aab9528, m=m@entry=0xffffa001fae26630, pindex=pindex@entry=0, wmesg=0xffff0000009c6297 "pgnslp",      allocflags=allocflags@entry=21504, locked=<optimized out>) at /usr/main-src/sys/vm/vm_page.c:1100
1100 sleepq_wait(m, PVM);

FYI:

. . .
       } while (!atomic_fcmpset_int(&m->busy_lock, &x, x | VPB_BIT_WAITERS));
        if (locked)
                VM_OBJECT_DROP(obj);
        DROP_GIANT();
        sleepq_add(m, NULL, wmesg, 0, 0);
        sleepq_wait(m, PVM);
        PICKUP_GIANT();
        return (true);
}
. . .

(kgdb) print /x *m
$2 = {plinks = {q = {tqe_next = 0xffffa001fc7c31c0, tqe_prev = 0xffffa001ff85d2d0}, s = {ss = {sle_next = 0xffffa001fc7c31c0}}, memguard = {p = 0xffffa001fc7c31c0, v = 0xffffa001ff85d2d0}, uma = {
      slab = 0xffffa001fc7c31c0, zone = 0xffffa001ff85d2d0}}, listq = {tqe_next = 0xffffa001ff85d2d0, tqe_prev = 0xffffa0012aab9570}, object = 0xffffa0012aab9528, pindex = 0x0,    phys_addr = 0x12a816000, md = {pv_list = {tqh_first = 0x0, tqh_last = 0xffffa001fae26668}, pv_gen = 0x15fd, pv_memattr = 0x2}, ref_count = 0x80000000, busy_lock = 0x1, a = {{flags = 0x208,        queue = 0x1, act_count = 0x1}, _bits = 0x1010208}, order = 0xd, pool = 0x0, flags = 0x1, oflags = 0x0, psind = 0x0, segind = 0x8, valid = 0xff, dirty = 0x0}

Any more potentially useful information to extract via kgdb?


===
Mark Millard
marklmi at yahoo.com