From nobody Wed Jun 25 10:20:58 2025 X-Original-To: current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4bRyX869KTz5yrC8 for ; Wed, 25 Jun 2025 10:21:12 +0000 (UTC) (envelope-from zlei@FreeBSD.org) Received: from smtp.freebsd.org (smtp.freebsd.org [96.47.72.83]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R11" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4bRyX63VKFz3h6q; Wed, 25 Jun 2025 10:21:10 +0000 (UTC) (envelope-from zlei@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1750846870; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=IOlHfZAaPs6SEHKhof2ExMfzTD4drE++9WlRUn7+5eM=; b=S1MLZQF0QbuzR/DC2oZm3hM0DkNvTphYmmNsAPJK5ATehdHio245hP6q1Ko5ECLVEzPRAh oN6rvom6QgdUQpBid1V79NpQbkQwn8Wqq8s7Gq54p3RolFgwcY2JKDuMfV1fMeBExZNt3V YNJExSF6FCZxIxPa5naTSsGX0Mvsen8A8Dei73WfY9LLUPzE4u3Fl8J5+WZ4Qpgnxlui4D lVdaxfXkpChg1imhZeK0yAAZQfWPt/STNEO2X7QONoVfMYWt69kwQ6aeKkQCt6b7z+fOYD 67FIuY4WS7+2OdZwwyrr0ejfli0ThLRrHB9KxDjWbs6aRZlC/bvXFILy/bSQXg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1750846870; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=IOlHfZAaPs6SEHKhof2ExMfzTD4drE++9WlRUn7+5eM=; b=gQe5uCjLfyV0C03AUgSUDh5KrH0sXjYLGhQOyWkOAO3cym9LxdL/ZPFZe9obieIff/oOkJ aewtmQfSnCSfGY7+v1jdPyHiNnhvvwscUdfC+6W3dX2PO+SvzMAnUhhD9NC+OAja19aO/y o8qOmhpK8JmnolqojkTx9/Mwu/LUGnCCmK5mzJcQ4CWVeGE3t0JQbdT82OqNGrJ7ddEZjt MDOq43xxz4D0AalbYdngm5UeYpT4xK45APN7ZfmnbYXzk+y26eme6D2kriQJrqqEpj//fr sKfNwHDU+S/v8QkH+iq+IxavcpXUN6jEn2LnrEZ52U72LL0Xqwxs2B4g0YUnbg== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1750846870; a=rsa-sha256; cv=none; b=uW73Iqnjmmk8xQuFZ5XshzLxvBd/EMqVKXK4LMjYiiaVL/wkrrlXQ3U16MdkVoyEiLx7dp MTFVBxRBAkrklibtqISUyfcrXg8YQlbh9kQXGTvsAq2ach2L9jX+Cg58T8HMwuxaSCo5hc vqVn/e8qlsecvDbTwyCCOzSSeojXASzTOUf6R7CGqntwNlpW4dkH6/biPGF9wy0To8lJhG mzsE5E7D4UCgZDcvVbaWLdCgb/EH7CGvNaq5YZSIrVi8d7Ww0OkwXBau8qwXP/0Tyc4NEB rAcU/+ISqmGe9of1aqWWOr41iI5ZwTtUECWElXQSlgjT/j25ugcFX68piXtTLg== Received: from smtpclient.apple (ns1.oxydns.net [45.32.91.63]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) (Authenticated sender: zlei/mail) by smtp.freebsd.org (Postfix) with ESMTPSA id 4bRyX525VYz195X; Wed, 25 Jun 2025 10:21:08 +0000 (UTC) (envelope-from zlei@FreeBSD.org) Content-Type: text/plain; charset=us-ascii List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@FreeBSD.org Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.120.41.1.10\)) Subject: Re: regression: memory issues on main/arm64 over sched/runq changes From: Zhenlei Huang In-Reply-To: <43005447-2rq0-6nn2-pnr5-4939s112npr4@yvfgf.mnoonqbm.arg> Date: Wed, 25 Jun 2025 18:20:58 +0800 Cc: FreeBSD Current , Olivier Certner Content-Transfer-Encoding: quoted-printable Message-Id: <0A01B9F5-C49C-41D8-BAB7-4378DEDBF647@FreeBSD.org> References: <43005447-2rq0-6nn2-pnr5-4939s112npr4@yvfgf.mnoonqbm.arg> To: "Bjoern A. Zeeb" X-Mailer: Apple Mail (2.3696.120.41.1.10) > On Jun 21, 2025, at 11:49 PM, Bjoern A. Zeeb = wrote: >=20 > Hi, >=20 > it's too early for stab-week but ... >=20 > I had interfave groups ("all") disappear from the interface between > inteerface creation and ifconfig prints during rc stage: >=20 > if7: XXXXXXXXXXXXXXXXXXXXXXXXXXX-BZ if_getgroup:1647: ifgl = 0xffffa080011aec90, ifgl_group 0, ifg_group 0 >=20 > panic: vm_fault failed: 0xffff0000005e19c8 error 1 > cpuid =3D 0 > time =3D 8 > KDB: stack backtrace: > db_trace_self() at db_trace_self > db_trace_self_wrapper() at db_trace_self_wrapper+0x38 > vpanic() at vpanic+0x1a0 > panic() at panic+0x48 > data_abort() at data_abort+0x28c > handle_el1h_sync() at handle_el1h_sync+0x18 > --- exception, esr 0x96000004 > strlcpy() at strlcpy+0x20 > ifhwioctl() at ifhwioctl+0x998 > ifioctl() at ifioctl+0x8bc > kern_ioctl() at kern_ioctl+0x2e4 > sys_ioctl() at sys_ioctl+0x140 > do_el0_sync() at do_el0_sync+0x618 > handle_el0_sync() at handle_el0_sync+0x4c > --- exception, esr 0x56000000 > KDB: enter: panic > [ thread pid 635 tid 100249 ] > Stopped at kdb_enter+0x48: str xzr, [x19, #2432] >=20 >=20 > I intrumented the kernel and could not find any deletions. It was = more > strange given the machine has 10 physical interfaces + lo and only for > #7 and #8 it happened. Does that happen every time, or only sometime ? What is the driver of #7 and #8 interfaces ? >=20 > I added guards to the struct and that did not reveal any memory > corruption. >=20 > Added a loop right at the end of if_addgroup() to make sure the list = was > coherent and it was (incl. lo which has two groups). >=20 > Then I started over-allocating the structs (size * 3) for ifgl and ifg > and put the actual value in the middle. That worked and the two guard > structs showed no sign of memory corruptions. So the larger = allocation > apparently helped or changed timing (which the printfs had not). So the arch is aarch64 which has much weak memory model. I'm recently = overhaul the attaching / detaching process of interfaces, and rely = heavily on the mean of synchronization. More preciously, I'd expect this = order, All writes to softc / ifnet ( including if_addgroup() ) > = if_link_ifnet() > ifunit() . You can read the > as 'happens before'. Best regards, Zhenlei >=20 >=20 > Then I undid the changes and backed out to b93161a7e38d and that works > just fine. >=20 > Went to c29459f901dc which shows the problem and panics again. > Reduced it to eebc148f25c3. >=20 > So it's in the range of: >=20 > % git log --oneline b93161a7e38d..eebc148f25c3 > eebc148f25c3 sched_4bsd: ESTCPULIM(): Allow any value in the timeshare = range > 51a4ae05abe6 sched_4bsd: Remove RQ_PPQ from ESTCPULIM()'s formula > a454ff6b0440 sched_4bsd: Move ESTCPULIM() after its macro dependencies > a33225efb4bc sched_ule: Sanitize CPU's use and priority computations, = and ticks storage > 6792f3411f6d sched_ule: Recover previous nice and anti-starvation = behaviors > dee257c28d93 sched: Internal priority ranges: Reduce kernel, increase = timeshare > d710acecc00f runq: Add copyright > 055b5b5f850d runq: Restrict to kernel only > a2d1c3bc2bb4 epoch_test: Assign different priorities using offset 1 > b2a9ee2a72ea runq: Remove userland references to RQ_PPQ in rtprio = contexts > e3a4b989d7f7 runq: Bump __FreeBSD_version after switching to 256 = levels > af8de65ef23e runq: Switch to 256 levels > fd141584cf89 zfs: spa: ZIO_TASKQ_ISSUE: Use symbolic priority > 8ecc41918066 Internal scheduling priorities: Always use symbolic ones > baecdea10eb5 sched_ule: Use a single runqueue per CPU > fdf31d274769 sched_ule: runq_steal_from(): Suppress first thread = special case > f4be333bc567 sched_ule: Re-implement stealing on top of runq = common-code > 9c3f4682bb90 runq: New runq_findq(), common low-level search = implementation > a31193172cb9 runq: New function runq_is_queue_empty(); Use it in ULE > 757bab06fb59 runq: Tidy up and rename runq_setbit() and runq_clrbit() > de78657a3aef runq: runq_check(): Re-implement on top of runq_findq() > 439dc920f2d8 runq: Revamp runq_find*(), new runq_find_range() > 200fc93dace7 runq: Re-order functions more logically > 7e2502e3dec9 runq: More macros; Better and more consistent naming > 57540a0666f6 runq: Clarity and style pass > a11926f2a5f0 runq: API tidy up: 'pri' =3D> 'idx', 'idx' as int, remove = runq_remove_idx() > 28b54827f5c1 runq: Hide function prototypes under _KERNEL > c21c24adde98 runq: More selective includes of to reduce = pollution > 2fefe2c88b31 runq: Deduce most parameters, remove machine headers >=20 >=20 > I do not know if it's feasible or doable to bi-sect those chanes = further? >=20 > /bz >=20 >=20 > --=20 > Bjoern A. Zeeb = r15:7 >=20