Re: MMCCAM hang

From: Bjoern A. Zeeb <bzeeb-lists_at_lists.zabbadoz.net>
Date: Tue, 09 Jan 2024 16:12:09 UTC
On Tue, 9 Jan 2024, Emmanuel Vadot wrote:

> On Tue, 9 Jan 2024 11:36:32 +0100
> Søren Schmidt <soren.schmidt@gmail.com> wrote:
>
>>> On 28 Dec 2023, at 02.08, Warner Losh <imp@bsdimp.com> wrote:
>>> On Wed, Dec 27, 2023, 4:55?PM Bjoern A. Zeeb <bzeeb-lists@lists.zabbadoz.net <mailto:bzeeb-lists@lists.zabbadoz.net>> wrote:
>>>> Hi,
>>>>
>>>> sdhci_fsl_fdt0: Desired SD/MMC freq: 50000000, actual: 50000000; base 700000000 prescale 1 divisor 14
>>>> GEOM: new disk sdda0
>>>> sdda0 at sdhci_slot0 bus 0 scbus0 target 0 lun 0
>>>> sdda0: Relative addr: 00000002
>>>> Card features: <MMC Memory High-Capacity>
>>>> Card random: unblocking device.
>>>> GEOM: new disk sdda0boot0
>>>> memory OCR: 00ff8080
>>>> sdda0: Serial Number .......
>>>> sdda0: MMCHC .................................. by 17 0x0000
>>>> GEOM: new disk sdda0boot1
>>>> uhub0: 2 ports with 2 removable, self powered
>>>>
>>>> at which point basically anything hangs.  In auto-boot it is
>>>> before/during file-system checks.
>>>> In single user mode camcontrol devlist will show sdda0
>>>> but
>>>>
>>>> root@:/ # gpart show sdda0
>>>> load: 6.06  cmd: gpart 24 [g_waitfor_event] 1.28r 0.00u 0.00s 0% 2088k
>>>> {forever}
>>>>
>>>>
>>>> Unclear at which point I broke to debugger and this is where it seems to
>>>> hang:
>>>>
>>>> db> trace 100088
>>>> Tracing pid 4 tid 100088 td 0xffff0000dc527000
>>>> ipi_stop() at ipi_stop+0x34
>>>> arm_gic_v3_intr() at arm_gic_v3_intr+0xe4
>>>> intr_irq_handler() at intr_irq_handler+0x80
>>>> handle_el1h_irq() at handle_el1h_irq+0x14
>>>> --- interrupt
>>>> spinlock_exit() at spinlock_exit+0x44
>>>> callout_reset_sbt_on() at callout_reset_sbt_on+0x210
>>>> sdhci_cam_action() at sdhci_cam_action+0x284
>>>> xpt_run_devq() at xpt_run_devq+0x4c8
>>>> xpt_action_default() at xpt_action_default+0x470
>>>> sddastart() at sddastart+0x1bc
>>>> xpt_run_allocq() at xpt_run_allocq+0xa8
>>>> xpt_done_process() at xpt_done_process+0x610
>>>> xpt_done_td() at xpt_done_td+0x1a8
>>>> fork_exit() at fork_exit+0x8c
>>>> fork_trampoline() at fork_trampoline+0x18
>>>>
>>>>
>>>> Anyone an idea?
>>>
>>>
>>>
>>> Looks like deadlock with another thread. Anybody else in the time keeping / callout code?
>>
>> I think this is related to the MMC driver having issues (MMCCAM or not).
>> If I try to use a MMC sdcard on any of my rk35X8 boards as the disk device it will eventually hang on first access to the MMC controlled media.
>> I thought I had an issue here with my dev setup but clealy I'm not alone :)
>
> SDCard on RK356X don't use sdhci but dwmmc so it's not related to what
> bz@ is seeing.
> That being said I have no problem using dwmmc as the root device on my
> nanopi r5s or quartz64.

For what is worth my current feeling seems to be it is related to the
boot[01] disks on the eMMC.

I see geom tasting on boot0 but the consumer for boot1 never shows up in
  ddb> show geom
I disabled the graid and then the same observation moved on to gpart.

Also once the error starts the fsl is never ecovering; eventually the
ccb and curcmd stay the same pointers even.  It seems to just roto-tile,
which makes me wonder if some error propagation is missing/gone.

If I enable kern.cam.boot_delay="30000" and have my root on an md(4)
I get to Login: -- strangely but then the nda and the sdda show up and
then typing gpart show or whatever else geom-ish a few commands go
through and then we are in the error again.

I haven't been able to dig much further; no other locks held in debug
kernels (just a malloc WAITOK complaint early on during "attach").

I'd still be happy to hear for more possible cases; especially if other
sdhci devices are working with MMCCAM?  It kept me from doing the actual
work I wanted to do with mmccam over the holidays sadly.


Feature request: somehow I wished we could enable/disable FDT/OFW based
devices like we do for PCI with devctl ... can we?  Like have it
disabled in FDT at boot but later enable/probe/attach...


With SD cards and dwmmc I had mostly mixed results in the past; they
worked for quite a while but after 600 days of uptime they were gone
(problem probably long fixed but I am at 900 days now for the last
running RK device and then won't bother for a long while I hope).

-- 
Bjoern A. Zeeb                                                     r15:7