Re: aarch64 main-n263493-4e8d558c9d1c-dirty (so: 2023-Jun-10) Kyuafile run: "Fatal data abort" crash during vnet_register_sysinit
Date: Mon, 26 Jun 2023 15:59:14 UTC
On Jun 26, 2023, at 07:29, John F Carr <jfc@mit.edu> wrote: > > >> On Jun 26, 2023, at 04:32, Mark Millard <marklmi@yahoo.com> wrote: >> >> On Jun 24, 2023, at 17:25, Mark Millard <marklmi@yahoo.com> wrote: >> >>> On Jun 24, 2023, at 14:26, John F Carr <jfc@mit.edu> wrote: >>> >>>> >>>>> On Jun 24, 2023, at 13:00, Mark Millard <marklmi@yahoo.com> wrote: >>>>> >>>>> The running system build is a non-debug build (but >>>>> with symbols not stripped). >>>>> >>>>> The HoneyComb's console log shows: >>>>> >>>>> . . . >>>>> GEOM_STRIPE: Device stripe.IMfBZr destroyed. >>>>> GEOM_NOP: Device md0.nop created. >>>>> g_vfs_done():md0.nop[READ(offset=5885952, length=8192)]error = 5 >>>>> GEOM_NOP: Device md0.nop removed. >>>>> GEOM_NOP: Device md0.nop created. >>>>> g_vfs_done():md0.nop[READ(offset=5935104, length=4096)]error = 5 >>>>> g_vfs_done():md0.nop[READ(offset=5935104, length=4096)]error = 5 >>>>> GEOM_NOP: Device md0.nop removed. >>>>> GEOM_NOP: Device md0.nop created. >>>>> GEOM_NOP: Device md0.nop removed. >>>>> Fatal data abort: >>>>> x0: ffffa02506e64400 >>>>> x1: ffff0001ea401880 (g_raid3_post_sync + 3a145f8) >>>>> x2: 4b >>>>> x3: a343932b0b22fb30 >>>>> x4: 0 >>>>> x5: 3310b0d062d0e1d >>>>> x6: 1d0e2d060d0b3103 >>>>> x7: 0 >>>>> x8: ea325df8 >>>>> x9: ffff0001eec946d0 ($d.6 + 0) >>>>> x10: ffff0001ea401880 (g_raid3_post_sync + 3a145f8) >>>>> x11: 0 >>>>> x12: 0 >>>>> x13: ffff000000cd8960 (lock_class_mtx_sleep + 0) >>>>> x14: 0 >>>>> x15: ffffa02506e64405 >>>>> x16: ffff0001eec94860 (_DYNAMIC + 160) >>>>> x17: ffff00000063a450 (ifc_attach_cloner + 0) >>>>> x18: ffff0001eb290400 (g_raid3_post_sync + 48a3178) >>>>> x19: ffff0001eec94600 (vnet_epair_init_vnet_init + 0) >>>>> x20: ffff000000fa5b68 (vnet_sysinit_sxlock + 18) >>>>> x21: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0) >>>>> x22: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0) >>>>> x23: ffffa0000042e500 >>>>> x24: ffffa0000042e500 >>>>> x25: ffff000000ce0788 (linker_lookup_set_desc + 0) >>>>> x26: ffffa0203cdef780 >>>>> x27: ffff0001eec94698 (__set_sysinit_set_sym_if_epairmodule_sys_init + 0) >>>>> x28: ffff000000d8e000 (sdt_vfs_vop_vop_spare4_return + 0) >>>>> x29: ffff0001eb290430 (g_raid3_post_sync + 48a31a8) >>>>> sp: ffff0001eb290400 >>>>> lr: ffff0001eec82a4c ($x.1 + 3c) >>>>> elr: ffff0001eec82a60 ($x.1 + 50) >>>>> spsr: 60000045 >>>>> far: ffff0002d8fba4c8 >>>>> esr: 96000046 >>>>> panic: vm_fault failed: ffff0001eec82a60 error 1 >>>>> cpuid = 14 >>>>> time = 1687625470 >>>>> KDB: stack backtrace: >>>>> db_trace_self() at db_trace_self >>>>> db_trace_self_wrapper() at db_trace_self_wrapper+0x30 >>>>> vpanic() at vpanic+0x13c >>>>> panic() at panic+0x44 >>>>> data_abort() at data_abort+0x2fc >>>>> handle_el1h_sync() at handle_el1h_sync+0x14 >>>>> --- exception, esr 0x96000046 >>>>> $x.1() at $x.1+0x50 >>>>> vnet_register_sysinit() at vnet_register_sysinit+0x114 >>>>> linker_load_module() at linker_load_module+0xae4 >>>>> kern_kldload() at kern_kldload+0xfc >>>>> sys_kldload() at sys_kldload+0x60 >>>>> do_el0_sync() at do_el0_sync+0x608 >>>>> handle_el0_sync() at handle_el0_sync+0x44 >>>>> --- exception, esr 0x56000000 >>>>> KDB: enter: panic >>>>> [ thread pid 70419 tid 101003 ] >>>>> Stopped at kdb_enter+0x44: str xzr, [x19, #3200] >>>>> db> >>>> >>>> The failure appears to be initializing module if_epair. >>> >>> Yep: trying: >>> >>> # kldload if_epair.ko >>> >>> was enough to cause the crash. (Just a HoneyComb context at >>> that point.) >>> >>> I tried media dd'd from the recent main snapshot, booting the >>> same system. No crash. I moved my build boot media to some >>> other systems and tested them: crashes. I tried my boot media >>> built optimized for Cortex-A53 or Cortex-X1C/Cortex-A78C >>> instead of Cortex-A72: no crashes. (But only one system can >>> use the X1C/A78C code in that build.) >>> >>> So variation testing only gets the crashes for my builds >>> that are code-optimized for Cortex-A72's. The same source >>> tree vintage built for cortex-53 or Cortex-X1C/Cortex-A78C >>> optimization does not get the crashes. But I also >>> demonstrated an optmized for Cortex-A72 build from 2023-Mar >>> that gets the crash. >>> >>> The last time I ran into one of these "crashes tied to >>> cortex-a72 code optimization" examples it turned out to be >>> some missing memory-model management code in FreeBSD's USB >>> code. But being lucky enough to help identify a FreeBSD >>> source code problem again seems not that likely. It could >>> easily be a code generation error by clang for all I know. >>> >>> So, unless at some point I produce fairly solid evidence >>> that the code actually running is messed up by FreeBSD >>> source code, this should likely be treated as "blame the >>> operator" and should likely be largely ignored as things >>> are. (Just My Problem, as I want the Cortex-A72 optimized >>> builds.) >> >> Turns out that the source code in question is the >> assignment to V_epair_cloner below: >> >> static void >> vnet_epair_init(const void *unused __unused) >> { >> struct if_clone_addreq req = { >> .match_f = epair_clone_match, >> .create_f = epair_clone_create, >> .destroy_f = epair_clone_destroy, >> }; >> V_epair_cloner = ifc_attach_cloner(epairname, &req); >> } >> VNET_SYSINIT(vnet_epair_init, SI_SUB_PSEUDO, SI_ORDER_ANY, >> vnet_epair_init, NULL); >> >> Example code when not optimizing for the Cortex-A72: >> >> 11a4c: d0000089 adrp x9, 0x23000 >> 11a50: f9400248 ldr x8, [x18] >> 11a54: f942c508 ldr x8, [x8, #1416] >> 11a58: f943d929 ldr x9, [x9, #1968] >> 11a5c: a9437bfd ldp x29, x30, [sp, #48] >> 11a60: f9401508 ldr x8, [x8, #40] >> 11a64: f8296900 str x0, [x8, x9] >> >> The code when optmizing for the Cortex-A72: >> >> 11a4c: f9400248 ldr x8, [x18] >> 11a50: f942c508 ldr x8, [x8, #1416] >> 11a54: d503201f nop >> 11a58: 1008e3c9 adr x9, #72824 >> 11a5c: f9401508 ldr x8, [x8, #40] >> 11a60: f8296900 str x0, [x8, x9] >> 11a64: a9437bfd ldp x29, x30, [sp, #48] >> >> It is the "str x0, [x8, x9]" that vm_fault's for >> the optimized code. >> >> So: >> >> 11a4c: d0000089 adrp x9, 0x23000 >> 11a58: f943d929 ldr x9, [x9, #1968] >> >> was optimized via replacement by: >> >> 11a58: 1008e3c9 adr x9, #72824 >> >> I.e., the optimization is based on the offset from >> the instruction being fixed in order to produce the >> value in x9, even if the instruction is relocated. >> >> This resulted in the specific x9 value shown in >> the x8/x9 pair: >> >> x8: ea325df8 >> x9: ffff0001eec946d0 >> >> which total's to the fault address (value >> in far): >> >> far: ffff0002d8fba4c8 >> >> > Is this the same as bug 264094? > > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=264094 Well, the not Cortex-A72 optimized .o stage code vs. the Cortex-A72 optimized .o stage code looks like: (not Cortex-A72 optimized) 3c: 90000009 adrp x9, 0x0 <vnet_epair_init+0x3c> 40: f9400248 ldr x8, [x18] 44: f942c508 ldr x8, [x8, #1416] 48: f9400129 ldr x9, [x9] 4c: a9437bfd ldp x29, x30, [sp, #48] 50: f9401508 ldr x8, [x8, #40] 54: f8296900 str x0, [x8, x9] vs. (Cortex-A72 optimized) 3c: f9400248 ldr x8, [x18] 40: f942c508 ldr x8, [x8, #1416] 44: 90000009 adrp x9, 0x0 <vnet_epair_init+0x44> 48: f9400129 ldr x9, [x9] 4c: f9401508 ldr x8, [x8, #40] 50: f8296900 str x0, [x8, x9] 54: a9437bfd ldp x29, x30, [sp, #48] (The x29 lines have a different purpose but I show the sequencing as shown by objdump to show that it is basically an ordering difference at the .o stage.) As for if_epair.kld production the .meta files show: CMD ld -m aarch64elf -warn-common --build-id=sha1 -r -o if_epair.kld if_epair.o CMD ctfmerge -L VERSION -g -o if_epair.kld if_epair.o CMD :> export_syms CMD awk -f /usr/main-src/sys/conf/kmod_syms.awk if_epair.kld export_syms | xargs -J% objcopy % if_epair.kld CWD /usr/obj/BUILDs/main-CA72-nodbg-clang-alt/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72/modules/usr/main-src/sys/modules/if_epair vs. CMD ld -m aarch64elf -warn-common --build-id=sha1 -r -o if_epair.kld if_epair.o CMD ctfmerge -L VERSION -g -o if_epair.kld if_epair.o CMD :> export_syms CMD awk -f /usr/main-src/sys/conf/kmod_syms.awk if_epair.kld export_syms | xargs -J% objcopy % if_epair.kld CWD /usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72/modules/usr/main-src/sys/modules/if_epair It looks to me like the code ordering differences in the .o's may be all that lead to the differing .kld results for setting x9 . If so, it is not good to be that dependent on minor .o stage code generation differences for if things will be operational vs. not. === Mark Millard marklmi at yahoo.com