Re: regression: memory issues on main/arm64 over sched/runq changes
Date: Wed, 25 Jun 2025 10:20:58 UTC
> On Jun 21, 2025, at 11:49 PM, Bjoern A. Zeeb <bzeeb-lists@lists.zabbadoz.net> wrote: > > Hi, > > it's too early for stab-week but ... > > I had interfave groups ("all") disappear from the interface between > inteerface creation and ifconfig prints during rc stage: > > if7: XXXXXXXXXXXXXXXXXXXXXXXXXXX-BZ if_getgroup:1647: ifgl 0xffffa080011aec90, ifgl_group 0, ifg_group 0 > > panic: vm_fault failed: 0xffff0000005e19c8 error 1 > cpuid = 0 > time = 8 > KDB: stack backtrace: > db_trace_self() at db_trace_self > db_trace_self_wrapper() at db_trace_self_wrapper+0x38 > vpanic() at vpanic+0x1a0 > panic() at panic+0x48 > data_abort() at data_abort+0x28c > handle_el1h_sync() at handle_el1h_sync+0x18 > --- exception, esr 0x96000004 > strlcpy() at strlcpy+0x20 > ifhwioctl() at ifhwioctl+0x998 > ifioctl() at ifioctl+0x8bc > kern_ioctl() at kern_ioctl+0x2e4 > sys_ioctl() at sys_ioctl+0x140 > do_el0_sync() at do_el0_sync+0x618 > handle_el0_sync() at handle_el0_sync+0x4c > --- exception, esr 0x56000000 > KDB: enter: panic > [ thread pid 635 tid 100249 ] > Stopped at kdb_enter+0x48: str xzr, [x19, #2432] > > > I intrumented the kernel and could not find any deletions. It was more > strange given the machine has 10 physical interfaces + lo and only for > #7 and #8 it happened. Does that happen every time, or only sometime ? What is the driver of #7 and #8 interfaces ? > > I added guards to the struct and that did not reveal any memory > corruption. > > Added a loop right at the end of if_addgroup() to make sure the list was > coherent and it was (incl. lo which has two groups). > > Then I started over-allocating the structs (size * 3) for ifgl and ifg > and put the actual value in the middle. That worked and the two guard > structs showed no sign of memory corruptions. So the larger allocation > apparently helped or changed timing (which the printfs had not). So the arch is aarch64 which has much weak memory model. I'm recently overhaul the attaching / detaching process of interfaces, and rely heavily on the mean of synchronization. More preciously, I'd expect this order, All writes to softc / ifnet ( including if_addgroup() ) > if_link_ifnet() > ifunit() . You can read the > as 'happens before'. Best regards, Zhenlei > > > Then I undid the changes and backed out to b93161a7e38d and that works > just fine. > > Went to c29459f901dc which shows the problem and panics again. > Reduced it to eebc148f25c3. > > So it's in the range of: > > % git log --oneline b93161a7e38d..eebc148f25c3 > eebc148f25c3 sched_4bsd: ESTCPULIM(): Allow any value in the timeshare range > 51a4ae05abe6 sched_4bsd: Remove RQ_PPQ from ESTCPULIM()'s formula > a454ff6b0440 sched_4bsd: Move ESTCPULIM() after its macro dependencies > a33225efb4bc sched_ule: Sanitize CPU's use and priority computations, and ticks storage > 6792f3411f6d sched_ule: Recover previous nice and anti-starvation behaviors > dee257c28d93 sched: Internal priority ranges: Reduce kernel, increase timeshare > d710acecc00f runq: Add copyright > 055b5b5f850d runq: Restrict <sys/runq.h> to kernel only > a2d1c3bc2bb4 epoch_test: Assign different priorities using offset 1 > b2a9ee2a72ea runq: Remove userland references to RQ_PPQ in rtprio contexts > e3a4b989d7f7 runq: Bump __FreeBSD_version after switching to 256 levels > af8de65ef23e runq: Switch to 256 levels > fd141584cf89 zfs: spa: ZIO_TASKQ_ISSUE: Use symbolic priority > 8ecc41918066 Internal scheduling priorities: Always use symbolic ones > baecdea10eb5 sched_ule: Use a single runqueue per CPU > fdf31d274769 sched_ule: runq_steal_from(): Suppress first thread special case > f4be333bc567 sched_ule: Re-implement stealing on top of runq common-code > 9c3f4682bb90 runq: New runq_findq(), common low-level search implementation > a31193172cb9 runq: New function runq_is_queue_empty(); Use it in ULE > 757bab06fb59 runq: Tidy up and rename runq_setbit() and runq_clrbit() > de78657a3aef runq: runq_check(): Re-implement on top of runq_findq() > 439dc920f2d8 runq: Revamp runq_find*(), new runq_find_range() > 200fc93dace7 runq: Re-order functions more logically > 7e2502e3dec9 runq: More macros; Better and more consistent naming > 57540a0666f6 runq: Clarity and style pass > a11926f2a5f0 runq: API tidy up: 'pri' => 'idx', 'idx' as int, remove runq_remove_idx() > 28b54827f5c1 runq: Hide function prototypes under _KERNEL > c21c24adde98 runq: More selective includes of <sys/runq.h> to reduce pollution > 2fefe2c88b31 runq: Deduce most parameters, remove machine headers > > > I do not know if it's feasible or doable to bi-sect those chanes further? > > /bz > > > -- > Bjoern A. Zeeb r15:7 >