Re: Rock64 configuration fails to boot for main 22c4ab6cb015 but worked for main 06bd74e1e39c (Nov 21): e.MMC mishandled?

From: Mark Millard via freebsd-arm <freebsd-arm_at_freebsd.org>
Date: Sun, 12 Dec 2021 08:29:25 UTC
On 2021-Dec-11, at 16:19, Mark Millard <marklmi@yahoo.com> wrote:

> [I've cut out the history: just presenting some new evidence.]
> 
> First, a little context from getting to the db> prompt.
> 
> db> ps
>  pid  ppid  pgrp   uid  state   wmesg   wchan               cmd
>   18     0     0     0  DL      syncer  0xffff000000eca5a8  [syncer]
>   17     0     0     0  DL      vlruwt  0xffffa00007d2ea60  [vnlru]
>   16     0     0     0  DL      (threaded)                  [bufdaemon]
> 100089                   D       qsleep  0xffff000000ec9478  [bufdaemon]
> 100092                   D       -       0xffff000000c11100  [bufspacedaemon-0]
> 100093                   D       -       0xffff000000c21680  [bufspacedaemon-1]
>    9     0     0     0  DL      psleep  0xffff000000ef0650  [vmdaemon]
>    8     0     0     0  DL      (threaded)                  [pagedaemon]
> 100087                   D       psleep  0xffff000000ee2b38  [dom0]
> 100094                   D       launds  0xffff000000ee2b44  [laundry: dom0]
> 100095                   D       umarcl  0xffff0000007b38d8  [uma]
>    7     0     0     0  DL      mmcsd d 0xffffa00007b72e00  [mmcsd0boot1: mmc/sd]
>    6     0     0     0  DL      mmcsd d 0xffffa00007b71300  [mmcsd0boot0: mmc/sd]
>    5     0     0     0  DL      mmcreq  0xffff00009b5d0710  [mmcsd0: mmc/sd card]
>    4     0     0     0  DL      -       0xffff000000ccc020  [rand_harvestq]
>   15     0     0     0  DL      (threaded)                  [usb]
> . . .
> 
> and "mmcreq" is from the while loop in:
> 
> static int
> mmc_wait_for_req(struct mmc_softc *sc, struct mmc_request *req)
> {
> 
>        req->done = mmc_wakeup;
>        req->done_data = sc;
>        if (__predict_false(mmc_debug > 1)) {
>                device_printf(sc->dev, "REQUEST: CMD%d arg %#x flags %#x",
>                    req->cmd->opcode, req->cmd->arg, req->cmd->flags);     
>                if (req->cmd->data) {
>                        printf(" data %d\n", (int)req->cmd->data->len); 
>                } else
>                        printf("\n");
>        }
>        MMCBR_REQUEST(device_get_parent(sc->dev), sc->dev, req);
>        MMC_LOCK(sc);
>        while ((req->flags & MMC_REQ_DONE) == 0)
>                msleep(req, &sc->sc_mtx, 0, "mmcreq", 0);
>        MMC_UNLOCK(sc);
>        if (__predict_false(mmc_debug > 2 || (mmc_debug > 0 &&
>            req->cmd->error != MMC_ERR_NONE)))
>                device_printf(sc->dev, "CMD%d RESULT: %d\n",
>                    req->cmd->opcode, req->cmd->error);
>        return (0);
> }
> 
> So it appears that the error report:
> 
> mmcsd0: Error indicated: 4 Failed
> 
> ends up associated with (req->flags & MMC_REQ_DONE) == 0 staying
> true in the above code: an unbounded loop with MMC_LOCK(sc) active.
> The "4" in the error report seems to be from:
> 
> #define MMC_ERR_FAILED  4
> 
> It looks like there are some problems with handling errors, problems
> such that it gets stuck looping (no panic, no progress).
> 
> That seems to be separate from why the MMC_ERR_FAILED was generated
> in the first place. So: 2 problems, not just one. Thus it may be a
> good context for tackling the looping problem with a known example
> failure to look at.
> 
> 
> 
> Just for reference, I tried "boot -v" with debug.verbose_sysinit=1 in place,
> just to capture and report the tail of the output for the boot failure.
> 
> . . .
> subsystem f000000
>   release_aps(0)... Release APs...done
> done.
>   intr_irq_shuffle(0)... Trying to mount root from ufs:/dev/gpt/Rock64root []...
> done.
>   netisr_start(0)... done.
>   taskqgroup_bind_softirq(0)... done.
> GEOM: new disk mmcsd0
> GEOM: new disk mmcsd0boot0
> GEOM: new disk mmcsd0boot1
>   smp_after_idle_runnable(0)... done.
>   taskqgroup_bind_if_config_tqg(0)... done.
>   taskqgroup_bind_if_io_tqg(0)... done.
>   tmr_setup_user_access(0)... done.
> subsystem f000001
> mmcsd0: Error indicated: 4 Failed
>   epoch_init_smp(0)... done.
> subsystem f100000
>   racctd_init(0)... done.
> subsystem fffffff
>   start_periodic_resettodr(0)... done.
>   oktousecallout(0)... done.
>   clknode_finish(0)... Unresolved linked clock found: hdmi_phy
> Unresolved linked clock found: usb480m_phy
> done.
>   regulator_constraint(0)... done.
>   regulator_shutdown(0)... regulator: shutting down unused regulators
> regulator: shutting down vcc_sd... busy
> done.
> uhub0: 1 port with 1 removable, self powered
> uhub2: 2 ports with 2 removable, self powered
> uhub3: 1 port with 1 removable, self powered
> uhub1: 1 port with 1 removable, self powered
> ugen4.2: <Samsung PSSD T7 Touch> at usbus4
> umass0 on uhub2
> umass0: <Samsung PSSD T7 Touch, class 0/0, rev 3.20/1.00, addr 1> on usbus4
> umass0:  SCSI over Bulk-Only; quirks = 0x0000
> umass0:0:0: Attached to scbus0
> pass0 at umass-sim0 bus 0 scbus0 target 0 lun 0
> pass0: <Samsung PSSD T7 Touch 0> Fixed Direct Access SPC-4 SCSI device
> pass0: Serial Number REPLACED
> pass0: 400.000MB/s transfers
> da0 at umass-sim0 bus 0 scbus0 target 0 lun 0
> da0: <Samsung PSSD T7 Touch 0> Fixed Direct Access SPC-4 SCSI device
> da0: Serial Number REPLACED
> da0: 400.000MB/s transfers
> da0: 953869MB (1953525168 512 byte sectors)
> da0: quirks=0x2<NO_6_BYTE>
> da0: Delete methods: <NONE(*),ZERO>
> random: unblocking device.
> 
> No more output after that.

As for why MMC_ERR_FAILED results, the following code diff is
intended to suggest what I think may be incomplete about sticking
to what the device-specific code supports vs. does not support
(not supporting HS200 here). The code does compile in my context
but is untested. The email handling may mess up some leading
whitespace --but, again, I'm only trying to suggest a type of
change.

# git -C /usr/main-src/ diff /usr/main-src/sys/dev/mmc
diff --git a/sys/dev/mmc/mmc.c b/sys/dev/mmc/mmc.c
index 9c73dfd57ce0..dffd1c382684 100644
--- a/sys/dev/mmc/mmc.c
+++ b/sys/dev/mmc/mmc.c
@@ -59,6 +59,7 @@ __FBSDID("$FreeBSD$");
 #include <sys/param.h>
 #include <sys/systm.h>
 #include <sys/kernel.h>
+#include <sys/kobj.h>
 #include <sys/malloc.h>
 #include <sys/lock.h>
 #include <sys/module.h>
@@ -1512,6 +1513,8 @@ mmc_timing_to_string(enum mmc_bus_timing timing)
 static bool
 mmc_host_timing(device_t dev, enum mmc_bus_timing timing)
 {
+       kobjop_desc_t kobj_desc;
+       kobj_method_t *kobj_method;
        int host_caps;
 
        host_caps = mmcbr_get_caps(dev);
@@ -1543,14 +1546,37 @@ mmc_host_timing(device_t dev, enum mmc_bus_timing timing)
        case bus_timing_mmc_ddr52:
                return (HOST_TIMING_CAP(host_caps, MMC_CAP_MMC_DDR52));
        case bus_timing_mmc_hs200:
-               return (HOST_TIMING_CAP(host_caps, MMC_CAP_MMC_HS200_120) ||
-                       HOST_TIMING_CAP(host_caps, MMC_CAP_MMC_HS200_180));
        case bus_timing_mmc_hs400:
-               return (HOST_TIMING_CAP(host_caps, MMC_CAP_MMC_HS400_120) ||
-                       HOST_TIMING_CAP(host_caps, MMC_CAP_MMC_HS400_180));
        case bus_timing_mmc_hs400es:
-               return (HOST_TIMING_CAP(host_caps, MMC_CAP_MMC_HS400 |
-                   MMC_CAP_MMC_ENH_STROBE));
+               /*
+                * Disable eMMC modes that require use of
+                * MMC_SEND_TUNING_BLOCK_HS200 to set things up if either the
+                * tune or re-tune method is the default NULL implementation.
+                */
+               kobj_desc = &mmcbr_tune_desc;
+               kobj_method = kobj_lookup_method(((kobj_t)dev)->ops->cls, NULL,
+                   kobj_desc);
+               if (kobj_method == &kobj_desc->deflt)
+                       return (false);
+               kobj_desc = &mmcbr_retune_desc;
+               kobj_method = kobj_lookup_method(((kobj_t)dev)->ops->cls, NULL,
+                   kobj_desc);
+               if (kobj_method == &kobj_desc->deflt) {
+                       return (false);
+               }
+
+               /*
+                * Otherwise track the host capabilities.
+                */
+               if (timing == bus_timing_mmc_hs200)
+                       return (HOST_TIMING_CAP(host_caps, MMC_CAP_MMC_HS200_120) ||
+                               HOST_TIMING_CAP(host_caps, MMC_CAP_MMC_HS200_180));
+               if (timing == bus_timing_mmc_hs400)
+                       return (HOST_TIMING_CAP(host_caps, MMC_CAP_MMC_HS400_120) ||
+                               HOST_TIMING_CAP(host_caps, MMC_CAP_MMC_HS400_180));
+               if (timing == bus_timing_mmc_hs400es)
+                       return (HOST_TIMING_CAP(host_caps, MMC_CAP_MMC_HS400 |
+                               MMC_CAP_MMC_ENH_STROBE));
        }
 
 #undef HOST_TIMING_CAP


In other words: have mmc_host_timing avoid returning true for some
combinations that definitely do not have sufficient software support
present at the time. (So far as I can tell, the rk3328's get the
NULL-implementations as things are.)

I expect that this sort of thing would go back to using
MMC_CAP_MMC_DDR52 for the rk3328's, as an example. Working, but in a
slower mode, the same mode as FreeBSD was previously using.

A possible incompleteness in the suggestion is that there is also a
drive-strength setting involved. If that also had "kobj" interfacing
and NULL-implementation possibilities, then in the future there would
be more to test for possibly forcing return-false than I did above.

Hopefully this sort of thing would help, possibly more than just for
rk3328's.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)