Re: regression: memory issues on main/arm64 over sched/runq changes

From: Bjoern A. Zeeb <bzeeb-lists_at_lists.zabbadoz.net>
Date: Fri, 27 Jun 2025 15:02:35 UTC
On Wed, 25 Jun 2025, Zhenlei Huang wrote:

Hi,

I appplied olce's change from the review but it didn't make a difference
on my arm64 and now on a tree with local changes (wifi bits, user sapce
bits, etc).

Now I netbooted that tree on X86 hardware (an old Lenovo Laptop) and ran
into something else (the same tree boots in a bhyve instance on a
different machine from a local disk image).

At the end of if_addgroup() I had added the following for local
debugging (really crude sorry):

...

+       atomic_thread_fence_seq_cst();
         IF_ADDR_WLOCK(ifp);
         CK_STAILQ_INSERT_TAIL(&ifg->ifg_members, ifgm, ifgm_next);
         CK_STAILQ_INSERT_TAIL(&ifp->if_groups, ifgl, ifgl_next);
         IF_ADDR_WUNLOCK(ifp);

         IFNET_WUNLOCK();	// excl unlock

         if (new)
                 EVENTHANDLER_INVOKE(group_attach_event, ifg);
         EVENTHANDLER_INVOKE(group_change_event, groupname);

+       IFNET_RLOCK();  // shared, panic
+       CK_STAILQ_FOREACH(ifgl, &ifp->if_groups, ifgl_next) {
+               if (bz_debug_groups) if_printf(ifp, "XXXXXXXXXXXXXXXXXXXXXXXXXXX-BZ %s:%d: ifgl %p, ifgl_group %p, ifg_group %p\n", __func__, __LINE__, ifgl, (ifgl != NULL) ? ifgl->ifgl_group : NULL, (ifgl != NULL && ifgl->ifgl_group != NULL) ? ifgl->ifgl_group->ifg_group : NULL);
+       }
+       IFNET_RUNLOCK();
+
         return (0);
  }



You see the anotation //shared ?

I got a panic: excl->share with that.

The excl. is the
         IFNET_WLOCK();          // excl
at the top of the function after the groupname check.
But that gets unlocked before the event handler above
so how can this happen?

Sadly I cannot even dump or anything as the keyboard is as dead
as the rest of the laptop.  Have to power cycle it hard.

Apart from the debugging I added I have no local changes in sys/net
in that tree.  sys/kern seems to have no relevant changes either
(added a bus func, toggle link_elf_leak_locals default, and a printf
got an extra argument to print %d error when modules fail to load).


I'll try a plain main (hopefully tonight) on that machine too but I am
really at a loss here now that it's also happening on X86 and only for me
and always around the same code there...

I'll also try to boot this tree from a USB pen drive or something;  not
that my problem comes in from netbooing...

I'll keep you posted...
/bz

-- 
Bjoern A. Zeeb                                                     r15:7