Re: I could use some help

From: Mark Johnston <markj_at_freebsd.org>
Date: Sun, 13 Nov 2022 19:20:30 UTC
On Wed, Nov 09, 2022 at 10:18:02PM -0800, Christopher Bowman wrote:
> OK I’m really confused and I could use some help:
> 
> 13.0 runs fine on my Xilinx Zynq based board (DIgilent Arty Z20).  However when I compile 13.1 release it doesn’t boot.  The kernel stops during boot as follows:
> 
> Using DTB from loaded file '/boot/dtb/zynq-artyz7.dtb'.
> 
> 
> Loading DTB overlays: 'artyz7_ssd_overlay.dtb'
> 
> 
> /boot/dtb/overlays/artyz7_ssd_overlay.dtb size=0x1a1
> 
> 
> Kernel entry at 0x17a00200...
> 
> 
> Kernel args: (null)
> 
> 
> applying DTB overlay '/boot/dtb/overlays/artyz7_ssd_overlay.dtb'
> 
> 
> WARNING: Cannot find freebsd,dts-version property, cannot check DTB compliance
> ---<<BOOT>>---
> Copyright (c) 1992-2021 The FreeBSD Project.
> Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
> .The Regents of the University of California. All rights reserved.
> FreeBSD is a registered trademark of The FreeBSD Foundation.
> FreeBSD 13.0-STABLE #22 n248064-ce9c3848ff3: Wed Nov  9 22:04:45 PST 2022
>     crb@eclipse.ChrisBowman.com:/usr/obj/usr/src/arm.armv7/sys/ARTYZ7 arm
> FreeBSD clang version 11.0.1 (git@github.com:llvm/llvm-project.git llvmorg-11.0.1-0-g43ff75f2c3fe)
> CPU: ARM Cortex-A9 r3p0 (ECO: 0x00000000)
> CPU Features: 
>   Multiprocessing, Thumb2, Security, VMSAv7, Coherent Walk
> Optional instructions: 
>   UMULL, SMULL, SIMD(ext)
> LoUU:2 LoC:2 LoUIS:2 
> Cache level 1:
>  32KB/32B 4-way data cache WB Read-Alloc Write-Alloc
>  32KB/32B 4-way instruction cache Read-Alloc
> real memory  = 536866816 (511 MB)
> avail memory = 515162112 (491 MB)
> FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
> random: unblocking device.
> random: entropy device external interface
> ofwbus0: <Open Firmware Device Tree>
> simplebus0: <Flattened device tree simple bus> on ofwbus0
> simplebus1: <Flattened device tree simple bus> on ofwbus0
> l2cache0: <PL310 L2 cache controller> mem 0xf8f02000-0xf8f02fff irq 8 on simplebus1
> l2cache0: cannot allocate IRQ, not using interrupt
> l2cache0: Part number: 0x3, release: 0x8
> l2cache0: L2 Cache enabled: 512KB/32B 8 ways
> gic0: <ARM Generic Interrupt Controller> mem 0xf8f01000-0xf8f01fff,0xf8f00100-0xf8f001ff on simplebus1
> gic0: pn 0x39, arch 0x1, rev 0x2, implementer 0x43b irqs 96
> mp_tmr0: <ARM MPCore Timers> mem 0xf8f00200-0xf8f0021f irq 29 on simplebus1
> Timecounter "MPCore" frequency 50000000 Hz quality 800
> mp_tmr1: <ARM MPCore Timers> mem 0xf8f00600-0xf8f0061f irq 36 on simplebus1
> Event timer "MPCore" frequency 50000000 Hz quality 1000
> cpulist0: <Open Firmware CPU Group> on ofwbus0
> cpu0: <Open Firmware CPU> on cpulist0
> cpu1: <Open Firmware CPU> on cpulist0
> uart0: <Cadence UART> mem 0xe0000000-0xe0000fff irq 9 on simplebus1
> uart0: console (-1,n,8,1)
> zy7_qspi0: <Zynq Quad-SPI Flash Controller> mem 0xe000d000-0xe000dfff irq 13 on simplebus1
> zy7_qspi0: must have ref-clock property
> device_attach: zy7_qspi0 attach returned 6
> cgem0: <Cadence CGEM Gigabit Ethernet Interface> mem 0xe000b000-0xe000bfff irq 15 on simplebus1
> miibus0: <MII bus> on cgem0
> rgephy0: <RTL8169S/8110S/8211 1000BASE-T media interface> PHY 0 on miibus0
> rgephy0:  none, 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT-FDX, 1000baseT-FDX-master, auto
> rgephy1: <RTL8169S/8110S/8211 1000BASE-T media interface> PHY 1 on miibus0
> rgephy1:  none, 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT-FDX, 1000baseT-FDX-master, auto
> cgem0: Ethernet address: 56:99:3e:50:9a:8e
> sdhci_fdt0: <Zynq-7000 generic fdt SDHCI controller> mem 0xe0100000-0xe0100fff irq 17 on simplebus1
> sdhci_fdt0: 1 slot(s) allocated
> mmc0: <MMC/SD bus> on sdhci_fdt0
> zy7_devcfg0: <Zynq devcfg block> mem 0xf8007000-0xf80070ff irq 28 on simplebus1
> Timecounters tick every 1.000 msec
> sdhci_fdt0-slot0: Controller timeout
> sdhci_fdt0-slot0: ============== REGISTER DUMP ==============
> sdhci_fdt0-slot0: Sys addr: 0x00060000 | Version:  0x00008901
> sdhci_fdt0-slot0: Blk size: 0x00005008 | Blk cnt:  0x00000001
> sdhci_fdt0-slot0: Argument: 0x00000000 | Trn mode: 0x00000013
> sdhci_fdt0-slot0: Present:  0x01ff0202 | Host ctl: 0x00000001
> sdhci_fdt0-slot0: Power:    0x0000000f | Blk gap:  0x00000000
> sdhci_fdt0-slot0: Wake-up:  0x00000000 | Clock:    0x00004007
> sdhci_fdt0-slot0: Timeout:  0x00000006 | Int stat: 0x00000001
> sdhci_fdt0-slot0: Int enab: 0x01ff00fb | Sig enab: 0x01ff00fa
> sdhci_fdt0-slot0: AC12 err: 0x00000000 | Host ctl2:0x00000000
> sdhci_fdt0-slot0: Caps:     0x69ec0080 | Caps2:    0x00000000
> sdhci_fdt0-slot0: Max curr: 0x00000001 | ADMA err: 0x00000000
> sdhci_fdt0-slot0: ADMA addr:0x00000000 | Slot int: 0x00000000
> sdhci_fdt0-slot0: ===========================================
> sdhci_fdt0-slot0: Controller timeout
> sdhci_fdt0-slot0: ============== REGISTER DUMP ==============
> sdhci_fdt0-slot0: Sys addr: 0x00000000 | Version:  0x00008901
> sdhci_fdt0-slot0: Blk size: 0x00005008 | Blk cnt:  0x00000001
> sdhci_fdt0-slot0: Argument: 0xaaaa0000 | Trn mode: 0x00000013
> sdhci_fdt0-slot0: Present:  0x01ff0000 | Host ctl: 0x00000001
> sdhci_fdt0-slot0: Power:    0x0000000f | Blk gap:  0x00000000
> sdhci_fdt0-slot0: Wake-up:  0x00000000 | Clock:    0x00004007
> sdhci_fdt0-slot0: Timeout:  0x00000006 | Int stat: 0x00000001
> sdhci_fdt0-slot0: Int enab: 0x01ff00fb | Sig enab: 0x01ff00fa
> sdhci_fdt0-slot0: AC12 err: 0x00000000 | Host ctl2:0x00000000
> sdhci_fdt0-slot0: Caps:     0x69ec0080 | Caps2:    0x00000000
> sdhci_fdt0-slot0: Max curr: 0x00000001 | ADMA err: 0x00000000
> sdhci_fdt0-slot0: ADMA addr:0x00000000 | Slot int: 0x00000000
> sdhci_fdt0-slot0: ===========================================
> 
> The controller register dumps continue for a while and then I get:
> 
> mmc0: CMD7 failed, RESULT: 1
> Release APs
> Trying to mount root from ufs:/dev/mmcsd0s2a [rw]...
> mountroot: waiting for device /dev/mmcsd0s2a...
> Mounting from ufs:/dev/mmcsd0s2a failed with error 19.
> Trying to mount root from ufs:mmcsd0s2a []...
> mountroot: waiting for device mmcsd0s2a...
> Mounting from ufs:mmcsd0s2a failed with error 19.
> 
> Loader variables:
>   vfs.root.mountfrom=ufs:/dev/mmcsd0s2a
>   vfs.root.mountfrom.options=rw
> 
> Manual root filesystem specification:
>   <fstype>:<device> [options]
>       Mount <device> using filesystem <fstype>
>       and with the specified (optional) option list.
> 
>     eg. ufs:/dev/da0s1a
>         zfs:zroot/ROOT/default
>         cd9660:/dev/cd0 ro
>           (which is equivalent to: mount -t cd9660 -o ro /dev/cd0 /)
> 
>   ?               List valid disk boot devices
>   .               Yield 1 second (for background tasks)
>   <empty line>    Abort manual input
> 
> mountroot> 
> 
> If I replace the kernel with a 13.0 kernel, it boots just fine.
> 
> So I go and do a git bisect between release/13.0.0 and release/13.1.0
> 
> and I get the following:
> 
> git bisect start '--first-parent'
> # status: waiting for both good and bad commits
> # good: [ea31abc261ffc01b6ff5671bffb15cf910a07f4b] 13.0: update to RELEASE
> git bisect good ea31abc261ffc01b6ff5671bffb15cf910a07f4b
> # status: waiting for bad commit, 1 good commit known
> # bad: [fc952ac2212b121aa6eefc273f5960ec3e0a466d] Update in preparation of 13.1-RELEASE
> git bisect bad fc952ac2212b121aa6eefc273f5960ec3e0a466d
> # skip: [4c44dbde5491516eba8725dc51d39c1dcc817472] MFC jail: Handle a parent jail when a child is added to it
> git bisect skip 4c44dbde5491516eba8725dc51d39c1dcc817472
> # good: [476f87219f408343846254743c7189076be80c04] wpi: Fix a lock leak in an error path in wpi_run()
> git bisect good 476f87219f408343846254743c7189076be80c04
> # bad: [05bf7d68c56830e52dee14dc87c07d6716e8195e] aesni: Fix an out-of-bounds read in AES_GCM_decrypt()
> git bisect bad 05bf7d68c56830e52dee14dc87c07d6716e8195e
> # good: [014ae00ef6edca2687d618e0bda138086a1e1230] date: Capitalize seconds string in synopses
> git bisect good 014ae00ef6edca2687d618e0bda138086a1e1230
> # bad: [08d995ca8f6f1008a10e4bf4d924824c040f842a] swapoff_one(): only check free pages count manually turning swap off
> git bisect bad 08d995ca8f6f1008a10e4bf4d924824c040f842a
> # bad: [81b6dba1a08b031bdf7463c1704d27ae1e0daa0f] ktls: Fix assertion for TLS 1.0 CBC when using non-zero starting seqno.
> git bisect bad 81b6dba1a08b031bdf7463c1704d27ae1e0daa0f
> # bad: [67efa8b29930f12dae2bf237fa7c2ce1dafbd6b1] net80211: add a driver-private pointer to struct ieee80211_node
> git bisect bad 67efa8b29930f12dae2bf237fa7c2ce1dafbd6b1
> # good: [109330155000bfec215ee39148254d2a0b628798] module(9): Document that evhand can be NULL
> git bisect good 109330155000bfec215ee39148254d2a0b628798
> # bad: [4c8e29637456bbbe709425f691f637914658009f] LinuxKPI: add module_pci_driver() and pci_alloc_irq_vectors()
> git bisect bad 4c8e29637456bbbe709425f691f637914658009f
> # bad: [4a03ae8d17ddf3d3b57ca281000fd98e200b92cc] nfscl: Fix use after free for forced dismount
> git bisect bad 4a03ae8d17ddf3d3b57ca281000fd98e200b92cc
> # bad: [de957de097857fabb69a59a9ba36276c5e735de5] bhyve: Fix the WITH_BHYVE_SNAPSHOT build
> git bisect bad de957de097857fabb69a59a9ba36276c5e735de5
> # bad: [5c2e6d9610f1b3f1d7c5d69b925212a7b1fd9391] hwpmc: initialize arm64 counter/interrupt state
> git bisect bad 5c2e6d9610f1b3f1d7c5d69b925212a7b1fd9391
> # bad: [ce9c3848ff369467749f59fd24f8b9f1241e725c] uma: Fix handling of reserves in zone_import()
> git bisect bad ce9c3848ff369467749f59fd24f8b9f1241e725c
> # good: [d5ebaa6f8f850bb6f6273f01386832efcb295827] uma: Improve M_USE_RESERVE handling in keg_fetch_slab()
> git bisect good d5ebaa6f8f850bb6f6273f01386832efcb295827
> # first bad commit: [ce9c3848ff369467749f59fd24f8b9f1241e725c] uma: Fix handling of reserves in zone_import()
> 
> If I do git log ce9c3848ff369467749f59fd24f8b9f1241e725c it does seem that d5ebaa6f8f850bb6f6273f01386832efcb295827 is the previous commit and that it works just fine but ce9c3848ff369467749f59fd24f8b9f1241e725c doesn’t boot.  It’s literally the same file system and DTB, the only difference is the kernel installed.
> 
> What’s confusing to me is that looking at ce9c3848ff369467749f59fd24f8b9f1241e725c I can’t see how that commit would result in my kernel hanging while booting.

It's indeed a bit hard to see why commit
ce9c3848ff369467749f59fd24f8b9f1241e725c might cause an sdhci command
timeout.  I suspect the commit itself is innocent, but happens to
uncover a problem elsewhere by changing the order in which page
allocations are distributed to different consumers.  At the time of the
command timeouts, the kernel is still running on a single CPU so the
order in which pages are allocated should be fairly deterministic and
thus consistent across multiple boots.

One thing you can try is to take a releng/13.1 tree and revert just
ce9c3848ff369467749f59fd24f8b9f1241e725c, and/or take releng/13.0 and
apply just that commit.  It would be useful to know whether either of
the resulting kernels can boot successfully.

Could you also please post your kernel configuration (ARTYZ7) somewhere,
together with a dmesg of a successful boot?