regression: memory issues on main/arm64 over sched/runq changes
Date: Sat, 21 Jun 2025 15:49:13 UTC
Hi,
it's too early for stab-week but ...
I had interfave groups ("all") disappear from the interface between
inteerface creation and ifconfig prints during rc stage:
if7: XXXXXXXXXXXXXXXXXXXXXXXXXXX-BZ if_getgroup:1647: ifgl 0xffffa080011aec90, ifgl_group 0, ifg_group 0
panic: vm_fault failed: 0xffff0000005e19c8 error 1
cpuid = 0
time = 8
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x38
vpanic() at vpanic+0x1a0
panic() at panic+0x48
data_abort() at data_abort+0x28c
handle_el1h_sync() at handle_el1h_sync+0x18
--- exception, esr 0x96000004
strlcpy() at strlcpy+0x20
ifhwioctl() at ifhwioctl+0x998
ifioctl() at ifioctl+0x8bc
kern_ioctl() at kern_ioctl+0x2e4
sys_ioctl() at sys_ioctl+0x140
do_el0_sync() at do_el0_sync+0x618
handle_el0_sync() at handle_el0_sync+0x4c
--- exception, esr 0x56000000
KDB: enter: panic
[ thread pid 635 tid 100249 ]
Stopped at kdb_enter+0x48: str xzr, [x19, #2432]
I intrumented the kernel and could not find any deletions. It was more
strange given the machine has 10 physical interfaces + lo and only for
#7 and #8 it happened.
I added guards to the struct and that did not reveal any memory
corruption.
Added a loop right at the end of if_addgroup() to make sure the list was
coherent and it was (incl. lo which has two groups).
Then I started over-allocating the structs (size * 3) for ifgl and ifg
and put the actual value in the middle. That worked and the two guard
structs showed no sign of memory corruptions. So the larger allocation
apparently helped or changed timing (which the printfs had not).
Then I undid the changes and backed out to b93161a7e38d and that works
just fine.
Went to c29459f901dc which shows the problem and panics again.
Reduced it to eebc148f25c3.
So it's in the range of:
% git log --oneline b93161a7e38d..eebc148f25c3
eebc148f25c3 sched_4bsd: ESTCPULIM(): Allow any value in the timeshare range
51a4ae05abe6 sched_4bsd: Remove RQ_PPQ from ESTCPULIM()'s formula
a454ff6b0440 sched_4bsd: Move ESTCPULIM() after its macro dependencies
a33225efb4bc sched_ule: Sanitize CPU's use and priority computations, and ticks storage
6792f3411f6d sched_ule: Recover previous nice and anti-starvation behaviors
dee257c28d93 sched: Internal priority ranges: Reduce kernel, increase timeshare
d710acecc00f runq: Add copyright
055b5b5f850d runq: Restrict <sys/runq.h> to kernel only
a2d1c3bc2bb4 epoch_test: Assign different priorities using offset 1
b2a9ee2a72ea runq: Remove userland references to RQ_PPQ in rtprio contexts
e3a4b989d7f7 runq: Bump __FreeBSD_version after switching to 256 levels
af8de65ef23e runq: Switch to 256 levels
fd141584cf89 zfs: spa: ZIO_TASKQ_ISSUE: Use symbolic priority
8ecc41918066 Internal scheduling priorities: Always use symbolic ones
baecdea10eb5 sched_ule: Use a single runqueue per CPU
fdf31d274769 sched_ule: runq_steal_from(): Suppress first thread special case
f4be333bc567 sched_ule: Re-implement stealing on top of runq common-code
9c3f4682bb90 runq: New runq_findq(), common low-level search implementation
a31193172cb9 runq: New function runq_is_queue_empty(); Use it in ULE
757bab06fb59 runq: Tidy up and rename runq_setbit() and runq_clrbit()
de78657a3aef runq: runq_check(): Re-implement on top of runq_findq()
439dc920f2d8 runq: Revamp runq_find*(), new runq_find_range()
200fc93dace7 runq: Re-order functions more logically
7e2502e3dec9 runq: More macros; Better and more consistent naming
57540a0666f6 runq: Clarity and style pass
a11926f2a5f0 runq: API tidy up: 'pri' => 'idx', 'idx' as int, remove runq_remove_idx()
28b54827f5c1 runq: Hide function prototypes under _KERNEL
c21c24adde98 runq: More selective includes of <sys/runq.h> to reduce pollution
2fefe2c88b31 runq: Deduce most parameters, remove machine headers
I do not know if it's feasible or doable to bi-sect those chanes further?
/bz
--
Bjoern A. Zeeb r15:7