Re: Rock64 configuration fails to boot for main 22c4ab6cb015 but worked for main 06bd74e1e39c (Nov 21): e.MMC mishandled?
- Reply: Mark Millard via freebsd-arm : "Re: Rock64 configuration fails to boot for main 22c4ab6cb015 but worked for main 06bd74e1e39c (Nov 21): e.MMC mishandled?"
- In reply to: Mark Millard via freebsd-arm : "Re: Rock64 configuration fails to boot for main 22c4ab6cb015 but worked for main 06bd74e1e39c (Nov 21): e.MMC mishandled?"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Fri, 17 Dec 2021 11:17:12 UTC
> >>> [I've cut out the history: just presenting some new evidence.]
> >>>
> >>> First, a little context from getting to the db> prompt.
> >>>
> >>> db> ps
> >>> pid ppid pgrp uid state wmesg wchan cmd
> >>> 18 0 0 0 DL syncer 0xffff000000eca5a8 [syncer]
> >>> 17 0 0 0 DL vlruwt 0xffffa00007d2ea60 [vnlru]
> >>> 16 0 0 0 DL (threaded) [bufdaemon]
> >>> 100089 D qsleep 0xffff000000ec9478 [bufdaemon]
> >>> 100092 D - 0xffff000000c11100 [bufspacedaemon-0]
> >>> 100093 D - 0xffff000000c21680 [bufspacedaemon-1]
> >>> 9 0 0 0 DL psleep 0xffff000000ef0650 [vmdaemon]
> >>> 8 0 0 0 DL (threaded) [pagedaemon]
> >>> 100087 D psleep 0xffff000000ee2b38 [dom0]
> >>> 100094 D launds 0xffff000000ee2b44 [laundry: dom0]
> >>> 100095 D umarcl 0xffff0000007b38d8 [uma]
> >>> 7 0 0 0 DL mmcsd d 0xffffa00007b72e00 [mmcsd0boot1: mmc/sd]
> >>> 6 0 0 0 DL mmcsd d 0xffffa00007b71300 [mmcsd0boot0: mmc/sd]
> >>> 5 0 0 0 DL mmcreq 0xffff00009b5d0710 [mmcsd0: mmc/sd card]
> >>> 4 0 0 0 DL - 0xffff000000ccc020 [rand_harvestq]
> >>> 15 0 0 0 DL (threaded) [usb]
> >>> . . .
> >>>
> >>> and "mmcreq" is from the while loop in:
> >>>
> >>> static int
> >>> mmc_wait_for_req(struct mmc_softc *sc, struct mmc_request *req)
> >>> {
> >>>
> >>> req->done = mmc_wakeup;
> >>> req->done_data = sc;
> >>> if (__predict_false(mmc_debug > 1)) {
> >>> device_printf(sc->dev, "REQUEST: CMD%d arg %#x flags %#x",
> >>> req->cmd->opcode, req->cmd->arg, req->cmd->flags);
> >>> if (req->cmd->data) {
> >>> printf(" data %d\n", (int)req->cmd->data->len);
> >>> } else
> >>> printf("\n");
> >>> }
> >>> MMCBR_REQUEST(device_get_parent(sc->dev), sc->dev, req);
> >>> MMC_LOCK(sc);
> >>> while ((req->flags & MMC_REQ_DONE) == 0)
> >>> msleep(req, &sc->sc_mtx, 0, "mmcreq", 0);
> >>> MMC_UNLOCK(sc);
> >>> if (__predict_false(mmc_debug > 2 || (mmc_debug > 0 &&
> >>> req->cmd->error != MMC_ERR_NONE)))
> >>> device_printf(sc->dev, "CMD%d RESULT: %d\n",
> >>> req->cmd->opcode, req->cmd->error);
> >>> return (0);
> >>> }
> >>>
> >>> So it appears that the error report:
> >>>
> >>> mmcsd0: Error indicated: 4 Failed
> >>>
> >>> ends up associated with (req->flags & MMC_REQ_DONE) == 0 staying
> >>> true in the above code: an unbounded loop with MMC_LOCK(sc) active.
> >>> The "4" in the error report seems to be from:
> >>>
> >>> #define MMC_ERR_FAILED 4
> >>>
> >>> It looks like there are some problems with handling errors, problems
> >>> such that it gets stuck looping (no panic, no progress).
> >>>
> >>> That seems to be separate from why the MMC_ERR_FAILED was generated
> >>> in the first place. So: 2 problems, not just one. Thus it may be a
> >>> good context for tackling the looping problem with a known example
> >>> failure to look at.
> >>>
> >>>
> >>>
> >>> Just for reference, I tried "boot -v" with debug.verbose_sysinit=1 in place,
> >>> just to capture and report the tail of the output for the boot failure.
> >>>
> >>> . . .
> >>> subsystem f000000
> >>> release_aps(0)... Release APs...done
> >>> done.
> >>> intr_irq_shuffle(0)... Trying to mount root from ufs:/dev/gpt/Rock64root []...
> >>> done.
> >>> netisr_start(0)... done.
> >>> taskqgroup_bind_softirq(0)... done.
> >>> GEOM: new disk mmcsd0
> >>> GEOM: new disk mmcsd0boot0
> >>> GEOM: new disk mmcsd0boot1
> >>> smp_after_idle_runnable(0)... done.
> >>> taskqgroup_bind_if_config_tqg(0)... done.
> >>> taskqgroup_bind_if_io_tqg(0)... done.
> >>> tmr_setup_user_access(0)... done.
> >>> subsystem f000001
> >>> mmcsd0: Error indicated: 4 Failed
> >>> epoch_init_smp(0)... done.
> >>> subsystem f100000
> >>> racctd_init(0)... done.
> >>> subsystem fffffff
> >>> start_periodic_resettodr(0)... done.
> >>> oktousecallout(0)... done.
> >>> clknode_finish(0)... Unresolved linked clock found: hdmi_phy
> >>> Unresolved linked clock found: usb480m_phy
> >>> done.
> >>> regulator_constraint(0)... done.
> >>> regulator_shutdown(0)... regulator: shutting down unused regulators
> >>> regulator: shutting down vcc_sd... busy
> >>> done.
> >>> uhub0: 1 port with 1 removable, self powered
> >>> uhub2: 2 ports with 2 removable, self powered
> >>> uhub3: 1 port with 1 removable, self powered
> >>> uhub1: 1 port with 1 removable, self powered
> >>> ugen4.2: <Samsung PSSD T7 Touch> at usbus4
> >>> umass0 on uhub2
> >>> umass0: <Samsung PSSD T7 Touch, class 0/0, rev 3.20/1.00, addr 1> on usbus4
> >>> umass0: SCSI over Bulk-Only; quirks = 0x0000
> >>> umass0:0:0: Attached to scbus0
> >>> pass0 at umass-sim0 bus 0 scbus0 target 0 lun 0
> >>> pass0: <Samsung PSSD T7 Touch 0> Fixed Direct Access SPC-4 SCSI device
> >>> pass0: Serial Number REPLACED
> >>> pass0: 400.000MB/s transfers
> >>> da0 at umass-sim0 bus 0 scbus0 target 0 lun 0
> >>> da0: <Samsung PSSD T7 Touch 0> Fixed Direct Access SPC-4 SCSI device
> >>> da0: Serial Number REPLACED
> >>> da0: 400.000MB/s transfers
> >>> da0: 953869MB (1953525168 512 byte sectors)
> >>> da0: quirks=0x2<NO_6_BYTE>
> >>> da0: Delete methods: <NONE(*),ZERO>
> >>> random: unblocking device.
> >>>
> >>> No more output after that.
> >>
> >> As for why MMC_ERR_FAILED results, the following code diff is
> >> intended to suggest what I think may be incomplete about sticking
> >> to what the device-specific code supports vs. does not support
> >> (not supporting HS200 here). The code does compile in my context
> >> but is untested.
> >
> > It is now tested (at least to be a useful hack): no longer am I
> > running an older 1400042 kernel. For reference,
> >
> > # uname -apKU
> > FreeBSD Rock64_RPi_4_3_2v1p2 14.0-CURRENT FreeBSD 14.0-CURRENT #18 main-n251456-22c4ab6cb015-dirty: Sun Dec 12 00:34:53 PST 2021 root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA53-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA53 arm64 aarch64 1400043 1400043
> >
> > And it reports during the boot (other than the "REPLACED"):
> >
> > mmcsd0: 125GB <MMCHC DJNB4R 0.7 SN REPLACED MFG 06/2016 by 21 0x0000> at mmc0 52.0MHz/8bit/1016-block
> >
> > So it no longer sets up a mode that the rk3328-specific-code does not
> > actually support.
> >
> > (Nothing that I've done here deals with the looping issue when there
> > is a MMC_ERR_FAILED or the like.)
> >
> >> The email handling may mess up some leading
> >> whitespace --but, again, I'm only trying to suggest a type of
> >> change.
> >>
> >> # git -C /usr/main-src/ diff /usr/main-src/sys/dev/mmc
> >> diff --git a/sys/dev/mmc/mmc.c b/sys/dev/mmc/mmc.c
> >> index 9c73dfd57ce0..dffd1c382684 100644
> >> --- a/sys/dev/mmc/mmc.c
> >> +++ b/sys/dev/mmc/mmc.c
> >> @@ -59,6 +59,7 @@ __FBSDID("$FreeBSD$");
> >> #include <sys/param.h>
> >> #include <sys/systm.h>
> >> #include <sys/kernel.h>
> >> +#include <sys/kobj.h>
> >> #include <sys/malloc.h>
> >> #include <sys/lock.h>
> >> #include <sys/module.h>
> >> @@ -1512,6 +1513,8 @@ mmc_timing_to_string(enum mmc_bus_timing timing)
> >> static bool
> >> mmc_host_timing(device_t dev, enum mmc_bus_timing timing)
> >> {
> >> + kobjop_desc_t kobj_desc;
> >> + kobj_method_t *kobj_method;
> >> int host_caps;
> >>
> >> host_caps = mmcbr_get_caps(dev);
> >> @@ -1543,14 +1546,37 @@ mmc_host_timing(device_t dev, enum mmc_bus_timing timing)
> >> case bus_timing_mmc_ddr52:
> >> return (HOST_TIMING_CAP(host_caps, MMC_CAP_MMC_DDR52));
> >> case bus_timing_mmc_hs200:
> >> - return (HOST_TIMING_CAP(host_caps, MMC_CAP_MMC_HS200_120) ||
> >> - HOST_TIMING_CAP(host_caps, MMC_CAP_MMC_HS200_180));
> >> case bus_timing_mmc_hs400:
> >> - return (HOST_TIMING_CAP(host_caps, MMC_CAP_MMC_HS400_120) ||
> >> - HOST_TIMING_CAP(host_caps, MMC_CAP_MMC_HS400_180));
> >> case bus_timing_mmc_hs400es:
> >> - return (HOST_TIMING_CAP(host_caps, MMC_CAP_MMC_HS400 |
> >> - MMC_CAP_MMC_ENH_STROBE));
> >> + /*
> >> + * Disable eMMC modes that require use of
> >> + * MMC_SEND_TUNING_BLOCK_HS200 to set things up if either the
> >> + * tune or re-tune method is the default NULL implementation.
> >> + */
> >> + kobj_desc = &mmcbr_tune_desc;
> >> + kobj_method = kobj_lookup_method(((kobj_t)dev)->ops->cls, NULL,
> >> + kobj_desc);
> >> + if (kobj_method == &kobj_desc->deflt)
> >> + return (false);
> >> + kobj_desc = &mmcbr_retune_desc;
> >> + kobj_method = kobj_lookup_method(((kobj_t)dev)->ops->cls, NULL,
> >> + kobj_desc);
> >> + if (kobj_method == &kobj_desc->deflt) {
> >> + return (false);
> >> + }
> >> +
> >> + /*
> >> + * Otherwise track the host capabilities.
> >> + */
> >> + if (timing == bus_timing_mmc_hs200)
> >> + return (HOST_TIMING_CAP(host_caps, MMC_CAP_MMC_HS200_120) ||
> >> + HOST_TIMING_CAP(host_caps, MMC_CAP_MMC_HS200_180));
> >> + if (timing == bus_timing_mmc_hs400)
> >> + return (HOST_TIMING_CAP(host_caps, MMC_CAP_MMC_HS400_120) ||
> >> + HOST_TIMING_CAP(host_caps, MMC_CAP_MMC_HS400_180));
> >> + if (timing == bus_timing_mmc_hs400es)
> >> + return (HOST_TIMING_CAP(host_caps, MMC_CAP_MMC_HS400 |
> >> + MMC_CAP_MMC_ENH_STROBE));
> >> }
> >>
> >> #undef HOST_TIMING_CAP
> >>
> >>
> >> In other words: have mmc_host_timing avoid returning true for some
> >> combinations that definitely do not have sufficient software support
> >> present at the time. (So far as I can tell, the rk3328's get the
> >> NULL-implementations as things are.)
> >>
> >> I expect that this sort of thing would go back to using
> >> MMC_CAP_MMC_DDR52 for the rk3328's, as an example. Working, but in a
> >> slower mode, the same mode as FreeBSD was previously using.
> >>
> >> A possible incompleteness in the suggestion is that there is also a
> >> drive-strength setting involved. If that also had "kobj" interfacing
> >> and NULL-implementation possibilities, then in the future there would
> >> be more to test for possibly forcing return-false than I did above.
> >>
> >> Hopefully this sort of thing would help, possibly more than just for
> >> rk3328's.
> >
> >
>
> As for what was happening without my patch . . .
>
> sys/dev/mmc/mmcbr_if.m defines:
>
> static int
> null_retune(device_t brdev __unused, device_t reqdev __unused,
> bool reset __unused)
> {
>
> return (0);
> }
>
> static int
> null_tune(device_t brdev __unused, device_t reqdev __unused,
> bool hs400 __unused)
> {
>
> return (0);
> }
> . . .
> #
> # Called by the mmcbus with the bridge claimed to execute initial tuning.
> #
> METHOD int tune {
> device_t brdev;
> device_t reqdev;
> bool hs400;
> } DEFAULT null_tune;
>
> #
> # Called by the mmcbus with the bridge claimed to execute re-tuning.
> #
> METHOD int retune {
> device_t brdev;
> device_t reqdev;
> bool reset;
> } DEFAULT null_retune;
> . . .
>
> It is these success-reporting no-op routines that were being
> used to attempt the tuning: so there was no tuning done.
>
> The code that I added detects that these routines would be
> used and avoids allowing contexts that would involve putting
> them to use with HS200 mode.
>
> I'll note that there is another such null_* routine that the
> code (even with my patch) does not deal with avoiding the use
> of:
>
> . . .
> static int
> null_switch_vccq(device_t brdev __unused, device_t reqdev __unused)
> {
>
> return (0);
> }
> . . .
> #
> # Called by the mmcbus to switch the signaling voltage (VCCQ).
> #
> METHOD int switch_vccq {
> device_t brdev;
> device_t reqdev;
> } DEFAULT null_switch_vccq;
> . . .
>
> /usr/main-src/sys/dev/sdhci/sdhci.c has somewhat analogous code for
> somewhat analogous null_* routines. null_set_uhs_timing for that is
> from sys/dev/sdhci/sdhci_if.m (but the other two are again the above
> null_tune and null_retune routines, so not repeated here):
>
> . . .
> static void
> null_set_uhs_timing(device_t brdev __unused,
> struct sdhci_slot *slot __unused)
> {
>
> }
> . . .
> METHOD void set_uhs_timing {
> device_t brdev;
> struct sdhci_slot *slot;
> } DEFAULT null_set_uhs_timing;
> . . .
>
> sdhci_init_slot(device_t dev, struct sdhci_slot *slot, int num)
> in sdhci.c looks like (in part):
>
> . . .
> /*
> * Disable UHS-I and eMMC modes if the set_uhs_timing method is the
> * default NULL implementation.
> */
> kobj_desc = &sdhci_set_uhs_timing_desc;
> kobj_method = kobj_lookup_method(((kobj_t)dev)->ops->cls, NULL,
> kobj_desc);
> if (kobj_method == &kobj_desc->deflt)
> host_caps &= ~(MMC_CAP_UHS_SDR12 | MMC_CAP_UHS_SDR25 |
> MMC_CAP_UHS_SDR50 | MMC_CAP_UHS_DDR50 | MMC_CAP_UHS_SDR104 |
> MMC_CAP_MMC_DDR52 | MMC_CAP_MMC_HS200 | MMC_CAP_MMC_HS400);
>
> #define SDHCI_CAP_MODES_TUNING(caps2) \
> (((caps2) & SDHCI_TUNE_SDR50 ? MMC_CAP_UHS_SDR50 : 0) | \
> MMC_CAP_UHS_DDR50 | MMC_CAP_UHS_SDR104 | MMC_CAP_MMC_HS200 | \
> MMC_CAP_MMC_HS400)
>
> /*
> * Disable UHS-I and eMMC modes that require (re-)tuning if either
> * the tune or re-tune method is the default NULL implementation.
> */
> kobj_desc = &mmcbr_tune_desc;
> kobj_method = kobj_lookup_method(((kobj_t)dev)->ops->cls, NULL,
> kobj_desc);
> if (kobj_method == &kobj_desc->deflt)
> goto no_tuning;
> kobj_desc = &mmcbr_retune_desc;
> kobj_method = kobj_lookup_method(((kobj_t)dev)->ops->cls, NULL,
> kobj_desc);
> if (kobj_method == &kobj_desc->deflt) {
> no_tuning:
> host_caps &= ~(SDHCI_CAP_MODES_TUNING(caps2));
> }
> . . .
>
> What I've done in my patch is analogous to what the the code shown
> after the #define SDHCI_CAP_MODES_TUNING above does, translated to
> fit the mmc's pre-existing code structure.
Good catch. For some reason mmcbr_tune/mmcbr_retune are not
implemented in sdhci_fdt.c.
This looks like a separate bug/issue.
Note that pretty much all other SDHCI drivers (sdhci_pci.c,
sdhci_xenon.c, sdhci_fsl_fdt.c, sdhci_acpi.c) provide some
implementation of mmcbr_tune/mmcbr_retune.
I agree that the logic in sdhci.c should disable HS200 if those
methods are implemented.
Could you try hooking mmcbr_tune and mmcbr_retune to their generic
sdhci implementations?
If we're lucky this could be enough to make HS200 work on rock64,
though it's unlikely.