RPi4B's DMA11 (DMA4 engine example) vs. xHCI/pcie

Robert Crowston crowston at protonmail.com
Wed Sep 30 21:15:47 UTC 2020


Very interesting analysis. Certainly uncovered a few things I wasn't aware of.

By default sc->sc_bus.dma_bits in xhci_init is 64 bits; I toggle it back to 32 bits in the xhci shim I wrote for the Pi 4. You can see that output in a verbose dmesg.

    — RHC.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Wednesday, 30 September 2020 19:13, Mark Millard <marklmi at yahoo.com> wrote:

>
>
> On 2020-Sep-29, at 10:35, Mark Millard <marklmi atyahoo.com> wrote:
>
> > On 2020-Sep-28, at 21:45, Mark Millard <marklmi at yahoo.com> wrote:
> >
> > > On 2020-Sep-28, at 19:04, Mark Millard <marklmi at yahoo.com> wrote:
> > >
> > > > On 2020-Sep-28, at 18:29, Mark Millard <marklmi at yahoo.com> wrote:
> > > >
> > > > > > [Be warned that the material is not familiar so I may need
> > > > > > educating. THis is based ont he example context that I
> > > > > > happen to have around.]
> > > > > > In the u-boot fdt print / output there are 2 distinct sets of dma channel
> > > > > > information, 1 for soc and 1 for scb, where the dma_tag values for the two
> > > > > > sets should be distinct as far as I can tell:
> > > > > > U-Boot> fdt address 0x7ef1000
> > > > > > U-Boot> fdt print /
> > > > > > / {
> > > > > > . . .
> > > > > > soc {
> > > > > > dma at 7e007000 {
> > > > > > compatible = "brcm,bcm2835-dma";
> > > > > > reg = <0x7e007000 0x00000b00>;
> > > > > > interrupts = * 0x0000000007ef645c [0x00000084];
> > > > > > interrupt-names = "dma0", "dma1", "dma2", "dma3", "dma4", "dma5", "dma6", "dma7", "dma8", "dma9", "dma10";
> > > > > > #dma-cells = <0x00000001>;
> > > > > > brcm,dma-channel-mask = <0x000001f5>;
> > > > > > phandle = <0x0000000b>;
> > > > > > };
> > > > > >
> > > > > >     scb {
> > > > > >
> > > > > >
> > > > > > . . .
> > > > > > dma at 7e007b00 {
> > > > > > compatible = "brcm,bcm2711-dma";
> > > > > > reg = <0x00000000 0x7e007b00 0x00000000 0x00000400>;
> > > > > > interrupts = <0x00000000 0x00000059 0x00000004 0x00000000 0x0000005a 0x00000004 0x00000000 0x0000005b 0x00000004 0x00000000 0x0000005c 0x00000004>;
> > > > > > interrupt-names = "dma11", "dma12", "dma13", "dma14";
> > > > > > #dma-cells = <0x00000001>;
> > > > > > brcm,dma-channel-mask = <0x00007000>;
> > > > > > phandle = <0x0000003d>;
> > > > > > };
> > > > > > . . .
> >
> > I had presumed that the dma at 7e007b00 would be processed. But
> > I finally happened to search for "bcm2711-dma" in FreeBSD and
> > it does not occur.
> > That appears to mean that BCM_DMA_CH_MAX being 12 is depending
> > on dma at 7e007000's brcm,dma-channel-mask to avoid referencing
> > number 11 that does not exist in that bcm2835-dma context.
> > I think this makes what I wrote about DMA4 engines (the most
> > capable ones) somewhat incoherent in the details but the basic
> > not-supported-in-the-code and not-used status appears to be
> > true.
> > As for DMA0-DMA10 (bcm2835-dma), some DMA (0-6) vs. DMA LITE
> > (7-10) distinctions not being handled (for example 65536
> > maxsegsz for DMA LITE) still looks to be true to me.
>
> Looks like FreeBSD is limited to 32-bit via usb/controller/generic_xhci.c
> has nothing explicit for other than 32 address lines (and overall the
> only alternative is 64 address lines):
>
> #define IS_DMA_32B 1
>
> int
> generic_xhci_attach(device_t dev)
> {
> . . .
> err = xhci_init(sc, dev, IS_DMA_32B);
> if (err != 0) {
> device_printf(dev, "Failed to init XHCI, with error %d\n", err);
> generic_xhci_detach(dev);
> return (ENXIO);
> }
> . . .
> /*
>
> -   The following structure describes the parent USB DMA tag.
>     /
>     #if USB_HAVE_BUSDMA
>     struct usb_dma_parent_tag {
>     . . .
>     uint8_t dma_bits; / number of DMA address lines /
>     . . .
>     };
>     #else
>     struct usb_dma_parent_tag {}; / empty struct */#endif
>     . . .
>     usb_error_t
>     xhci_init(struct xhci_softc sc, device_t self, uint8_t dma32)
>     {
>     . . .
>     / get DMA bits */sc->sc_bus.dma_bits = (XHCI_HCS0_AC64(temp) &&
>
>              xhcidma32 == 0 && dma32 == 0) ? 64 : 32;
>
>
>
> . . .
>
> Overall it looks like a bunch of places would need changes to
> support the RPi4B's 3 GiByte capability. (Probably more than
> I've discovered, ignoring things like DMA4 engine use to get
> write bursts and the like.)
>
> I will note that I found code in NetBSD that classifies "normal"
> DMA engines vs. DMA LITE engines (via testing a debug register)
> for bcm2835-dma and only requests normal DMA engines be used,
> skipping DMA LITE. (This is for DTB/fdt contexts I think. I've
> not done as well figuring out even such narrow aspects of ACPI
> handling of things.) This tends to confirm my worries over
> FreeBSD's bcm2835-dma handling of the DMA LITE engines existing
> but being less capable.
>
> > > > > > So, 0 through 10 need the soc criteria (mix of DMA and DMA LITE engine criteria)
> > > > > > and 11 through 14 need the scb criteria (DMA4 engine criteria). (I'm ignore
> > > > > > dma-channel-mask's at this point.)
> > > > > > I'll here note the code has:
> > > > > > #define BCM_DMA_CH_MAX 12
> > > > > > for use in code like:
> > > > > >
> > > > > >     /* setup initial settings */
> > > > > >     for (i = 0; i < BCM_DMA_CH_MAX; i++) {
> > > > > >             ch = &sc->sc_dma_ch[i];
> > > > > >
> > > > > >             bzero(ch, sizeof(struct bcm_dma_ch));
> > > > > >             ch->ch = i;
> > > > > >             ch->flags = BCM_DMA_CH_UNMAP;
> > > > > >
> > > > > >             if ((bcm_dma_channel_mask & (1 << i)) == 0)
> > > > > >                     continue;
> > > > > >
> > > > > >
> > > > > > . . .
> > > > > > It looks to me like the only scb/DMA4-engine "dma11" is covered
> > > > > > by such loops and that the "brcm,dma-channel-mask = <0x00007000>"
> > > > > > means that dma11 will not be used.
> > > > > > So: No scb/DMA4 engine will be used??? (That could explain the
> > > > > > 1 GiByte limit?)
> > > > > > rpi_DATA_2711_1p0.pdf reports that soc/0-10 have 2 types (0-6 vs. 7-10
> > > > > > as it turns out) as well as the scb/DM4-engines (11-14):
> > > > > > QUOTE (with omitted marked by ". . .")
> > > > > > . . .
> > > > > > The BCM2711 DMA Controller provides a total of 16 DMA channels. Four of these are DMA Lite channels (with reduced performance and features), and four of them are DMA4 channels (with increased performance and a wider address range).
> > > > > > . . .
> > > > > > 4.5. DMA LITE Engines
> > > > > > Several of the DMA engines are of the LITE design. This is a reduced specification engine designed to save space. The engine behaves in the same way as a normal DMA engine except for the following differences:
> > > > > > . . .
> > > > > > • The DMA length register is now 16 bits, limiting the maximum transferable length to 65536 bytes.
> > > > > > . . .
> > > > > > 4.6. DMA4 Engines
> > > > > > Several of the DMA engines are of the DMA4 design. These have higher performance due to their uncoupled read/write design and can access up to 40 address bits. Unlike the other DMA engines they are also capable of performing write bursts. Note that they directly access the full 35-bit address bus of the BCM2711 and so bypass the paging registers of the DMA and DMA Lite engines.
> > > > > > DMA channel 11 is additionally able to access the PCIe interface.
> > > > > > END QUOTE
> > > > > > The register map indicates (with some extra notes added):
> > > > > > 0-6: DMA
> > > > > > 7-10: DMA LITE (65536 bytes limit, for example)
> > > > > > 11-14: DMA4 (11 is special relative to "PCIe interface")
> > > > > > ("DMA Channel 15 is exclusively used by the VPU.")
> > > > > > Yet what I see in the head -r365932 code is:
> > > > > > #define BCM_DMA_CH_MAX 12
> > > > > > . . .
> > > > > > struct bcm_dma_softc {
> > > > > > device_t sc_dev;
> > > > > > struct mtx sc_mtx;
> > > > > > struct resource * sc_mem;
> > > > > > struct resource * sc_irq[BCM_DMA_CH_MAX];
> > > > > > void * sc_intrhand[BCM_DMA_CH_MAX];
> > > > > > struct bcm_dma_ch sc_dma_ch[BCM_DMA_CH_MAX];
> > > > > > bus_dma_tag_t sc_dma_tag;
> > > > > > };
> > > > > > . . .
> > > > > > err = bus_dma_tag_create(bus_get_dma_tag(dev),
> > > > > > 1, 0, BUS_SPACE_MAXADDR_32BIT,
> > > > > > BUS_SPACE_MAXADDR, NULL, NULL,
> > > > > > sizeof(struct bcm_dma_cb), 1,
> > > > > > sizeof(struct bcm_dma_cb),
> > > > > > BUS_DMA_ALLOCNOW, NULL, NULL,
> > > > > > &sc->sc_dma_tag);
> > > > > > As an example: does that deal with the likes of DMA LITE (so 7-10) "limiting
> > > > > > the maximum transferable length to 65536 bytes"?
> > > > > > As another example: Does it deal with the DMA4 (11-14) distinctions (if
> > > > > > such were in use anyway)?
> > > > > > For reference from the fdt print / :
> > > > > > / {
> > > > > > . . .
> > > > > > #address-cells = <0x00000002>;
> > > > > > #size-cells = <0x00000001>;
> > > > > > . . .
> > > > > > soc {
> > > > > > compatible = "simple-bus";
> > > > > > #address-cells = <0x00000001>;
> > > > > > #size-cells = <0x00000001>;
> > > > > > . . .
> > > > > > dma-ranges = <0xc0000000 0x00000000 0x00000000 0x40000000>;
> > > > > > . . .
> > > > > > firmware {
> > > > > > compatible = "raspberrypi,bcm2835-firmware", "simple-bus";
> > > > > > mboxes = <0x0000001c>;
> > > > > > dma-ranges;
> > > > > > . . .
> > > > > > emmc2bus {
> > > > > > compatible = "simple-bus";
> > > > > > #address-cells = <0x00000002>;
> > > > > > #size-cells = <0x00000001>;
> > > > > > . . .
> > > > > > dma-ranges = <0x00000000 0xc0000000 0x00000000 0x00000000 0x40000000>;
> > > > > > . . .
> > > > > > scb {
> > > > > > compatible = "simple-bus";
> > > > > > #address-cells = <0x00000002>;
> > > > > > #size-cells = <0x00000002>;
> > > > > > . . .
> > > > > > dma-ranges = <0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0xfc000000 0x00000001 0x00000000 0x00000001 0x00000000 0x00000001 0x00000000>;
> > > > > > . . .
> > > > > > pcie at 7d500000 {
> > > > > > compatible = "brcm,bcm2711-pcie";
> > > > > > . . .
> > > > > > #address-cells = <0x00000003>;
> > > > > > . . .
> > > > > > #size-cells = <0x00000002>;
> > > > > > . . .
> > > > > > dma-ranges = <0x02000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0xc0000000>;
> > > > > > . . .
> > > > > > v3dbus {
> > > > > > compatible = "simple-bus";
> > > > > > #address-cells = <0x00000001>;
> > > > > > #size-cells = <0x00000002>;
> > > > > > . . .
> > > > > > dma-ranges = <0x00000000 0x00000000 0x00000000 0x00000004 0x00000000>;
> > > > > > . . .
> > > > >
> > > > > rpi_DATA_2711_1p0.pdf reports:
> > > > > (I ignore 2D DMA transfer mode here.)
> > > > > For DMA engines 0-6: XLENGTH has bits 29:0
> > > > > bits 31:30 are write as 0, read as do not care.
> > > > > That would put maxsegsz as 2**30 == 1,073,741,824
> > > > > which matches a 1 GiByte space.
> > > > > For DMA LITE engines 7-10: XLENGTH has bit 15:0
> > > > > bits 31:16 are write as 0, read as do not care.
> > > > > That would put maxsegsz as 2**16 == 65,536.
> > > > > For DMA4 engines 11-14: XLENGTH has bits 29:0
> > > > > bits 31:30 are write as 0, read as do not care.
> > > > > That would put maxsegsz as 2**30 == 1,073,741,824
> > > > > which is smaller than the 3 GiByte space associated
> > > > > with xHCI.
> > >
> > > rpi_DATA_2711_1p0.pdf reports the following specifically for
> > > DMA11-DMA14 (so the DMA4 engines) for what goes in the CB and
> > > NEXT_CB ADDR fields:
> > > QUOTE
> > > The address must be 256-bit aligned and so the bottom 5 bits of the byte address are discarded, i.e. write cb_byte_address[39:0]>>5 into the CB
> > > END QUOTE
> > > This is not true for DMA0-DMA10 (DMA and DMA LITE).
> > > The following is extracted from various places to
> > > bring them together. I do not see evidence of handling
> > > the cb_byte_address[39:0]>>5 involved for DMA11-DMA14:
> > > #define ARMC_TO_VCBUS(pa) bcm283x_armc_to_vcbus(pa)
> > > vm_paddr_t
> > > bcm283x_armc_to_vcbus(vm_paddr_t pa)
> > > {
> > > struct bcm283x_memory_soc_cfg *cfg;
> > > struct bcm283x_memory_mapping *map, *ment;
> > >
> > >       /* Guaranteed not NULL if we haven't panicked yet. */
> > >       cfg = bcm283x_get_current_memcfg();
> > >       map = cfg->memmap;
> > >       for (ment = map; !BCM283X_MEMMAP_ISTERM(ment); ++ment) {
> > >               if (pa >= ment->armc_start &&
> > >                   pa < ment->armc_start + ment->armc_size) {
> > >                       return (pa - ment->armc_start) + ment->vcbus_start;
> > >               }
> > >       }
> > >
> > >       /*
> > >        * Assume 1:1 mapping for anything else, but complain about it on
> > >        * verbose boots.
> > >        */
> > >       if (bootverbose)
> > >               printf("bcm283x_vcbus: No armc -> vcbus mapping found: %jx\\n",
> > >                   (uintmax_t)pa);
> > >       return (pa);
> > >
> > >
> > > }
> > > static void
> > > bcm_dmamap_cb(void *arg, bus_dma_segment_t *segs,
> > > int nseg, int err)
> > > {
> > > bus_addr_t *addr;
> > >
> > >       if (err)
> > >               return;
> > >
> > >       addr = (bus_addr_t*)arg;
> > >       *addr = ARMC_TO_VCBUS(segs[0].ds_addr);
> > >
> > >
> > > }
> > > Note ds_addr assignments in:
> > > static bus_size_t
> > > _bus_dmamap_addseg(bus_dma_tag_t dmat, bus_dmamap_t map, bus_addr_t curaddr,
> > > bus_size_t sgsize, bus_dma_segment_t *segs, int *segp)
> > > {
> > > bus_addr_t baddr, bmask;
> > > int seg;
> > >
> > >       /*
> > >        * Make sure we don't cross any boundaries.
> > >        */
> > >       bmask = ~(dmat->common.boundary - 1);
> > >       if (dmat->common.boundary > 0) {
> > >               baddr = (curaddr + dmat->common.boundary) & bmask;
> > >               if (sgsize > (baddr - curaddr))
> > >                       sgsize = (baddr - curaddr);
> > >       }
> > >
> > >       /*
> > >        * Insert chunk into a segment, coalescing with
> > >        * previous segment if possible.
> > >        */
> > >       seg = *segp;
> > >       if (seg == -1) {
> > >               seg = 0;
> > >               segs[seg].ds_addr = curaddr;
> > >               segs[seg].ds_len = sgsize;
> > >       } else {
> > >               if (curaddr == segs[seg].ds_addr + segs[seg].ds_len &&
> > >                   (segs[seg].ds_len + sgsize) <= dmat->common.maxsegsz &&
> > >                   (dmat->common.boundary == 0 ||
> > >                    (segs[seg].ds_addr & bmask) == (curaddr & bmask)))
> > >                       segs[seg].ds_len += sgsize;
> > >               else {
> > >                       if (++seg >= dmat->common.nsegments)
> > >                               return (0);
> > >                       segs[seg].ds_addr = curaddr;
> > >                       segs[seg].ds_len = sgsize;
> > >               }
> > >       }
> > >       *segp = seg;
> > >       return (sgsize);
> > >
> > >
> > > }
> > > Note cb_phys and ch->vc_cb in:
> > > static int
> > > bcm_dma_init(device_t dev)
> > > {
> > > . . .
> > > /* setup initial settings */
> > > for (i = 0; i < BCM_DMA_CH_MAX; i++) {
> > > . . .
> > > err = bus_dmamap_load(sc->sc_dma_tag, ch->dma_map, cb_virt,
> > > sizeof(struct bcm_dma_cb), bcm_dmamap_cb, &cb_phys,
> > > BUS_DMA_WAITOK);
> > > if (err) {
> > > device_printf(dev, "cannot load DMA memory\n");
> > > break;
> > > }
> > >
> > >               ch->cb = cb_virt;
> > >               ch->vc_cb = cb_phys;
> > >
> > >
> > > . . .
> > > int
> > > bcm_dma_start(int ch, vm_paddr_t src, vm_paddr_t dst, int len)
> > > {
> > > struct bcm_dma_softc *sc = bcm_dma_sc;
> > > struct bcm_dma_cb *cb;
> > >
> > >       if (ch < 0 || ch >= BCM_DMA_CH_MAX)
> > >               return (-1);
> > >
> > >       if (!(sc->sc_dma_ch[ch].flags & BCM_DMA_CH_USED))
> > >               return (-1);
> > >
> > >       cb = sc->sc_dma_ch[ch].cb;
> > >       cb->src = ARMC_TO_VCBUS(src);
> > >       cb->dst = ARMC_TO_VCBUS(dst);
> > >
> > >       cb->len = len;
> > >
> > >       bus_dmamap_sync(sc->sc_dma_tag,
> > >           sc->sc_dma_ch[ch].dma_map, BUS_DMASYNC_PREWRITE);
> > >
> > >       bus_write_4(sc->sc_mem, BCM_DMA_CBADDR(ch),
> > >           sc->sc_dma_ch[ch].vc_cb);
> > >       bus_write_4(sc->sc_mem, BCM_DMA_CS(ch), CS_ACTIVE);
> > >
> > >
> > > #ifdef DEBUG
> > > bcm_dma_cb_dump(sc->sc_dma_ch[ch].cb);
> > > bcm_dma_reg_dump(ch);
> > > #endif
> > >
> > >       return (0);
> > >
> > >
> > > }
> > > It looks to me like FreeBSD is not set up to use the DMA4
> > > engines (DMA11-DMA14) and happens to not use them for the
> > > DTB that I get from u-boot.bin in my context.
> > > Of course, I may just have missed something in looking
> > > around at the unfamiliar material.
>
> ==
>
> Mark Millard
> marklmi at yahoo.com
> ( dsl-only.net went
> away in early 2018-Mar)




More information about the freebsd-arm mailing list