[Bug 262765] Random lockups, data loss, and poor I/O and sound quality after 95edb10b47fc1a919cd1687aaf16be9e14456c89

From: <bugzilla-noreply_at_freebsd.org>
Date: Thu, 24 Mar 2022 20:48:18 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=262765

            Bug ID: 262765
           Summary: Random lockups, data loss, and poor I/O and sound
                    quality after 95edb10b47fc1a919cd1687aaf16be9e14456c89
           Product: Base System
           Version: CURRENT
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: tod.jackson@gmail.com

This is way beyond my level, and one of the reasons I didn't want to move
beyond 13.0.

Reverting LinuxKPI: implement dma_sync_single_for_*, apply to (un)map single/sg
fixes all sorts of problems for me, but it's a big hammer that probably breaks
things for everyone else. I have no idea who the culprit is.

I first started having troubles in Linux a few years ago, and finally narrowed
it down here. It's entirely possible my firmware is broken, but is there
anything we can do?

My drm-kmod is aso devoid of panic(), but unluckily this doesn't manifest as
panics. I had to make some stuff up or return (-ENOMEM) to accomodate these
changes, but it's nothing of interest.

This is really complicated because multiple drivers are trying to manage memory
owned by the firmware, and they don't cooperate.

I found my workaround, and it solves a sort of several year mystery, but maybe
we can do better.

I don't even know what kind of quirk this could be. If I had to guess, the
relevant parts are dma_sync_single_for_cpu and cache flushing.

This is from  some i915 documentation:

Now the pagetables are a bit tricky. In the end, they're all in system memory,
but there are a few hoops to jump through to get at them. The GTT pagetables
has just one level, so with a 4 byte entry size we need 2MB of contiguous
pagetable space. The firmware allocates that for us from stolen memory (that
is, a part of the system memory that is not listed in the e820 map, so it's not
managed by the Linux kernel). But we write these PTEs through an alias in the
register mmio bar! The reason for that is to allow the SA to invalidate TLBs.
Note, though, that this only invalidates TLBs for cpu access. Any other access
to the GTT (such as from the GT or the display block) has its own rules for TLB
invalidation. Also, on recent generations we need to (depending upon
circumstances) manually invalidate the SA TLB by writing to a magic register.
To speed up map/unmap operations, we map that GTT PTE aliasing region in the
mmio with wc (if this is possible, which means the cpu needs to support PAT).

A lot of this is just stubbed or nonexistent right now, notably runtime PM and
the more complicated GT/engine bits. And we really have no idea what the Nvidia
driver is doing, aside from trying and failing to write in write-protected
regions. I took this upstream, but nobody really cares because they don't want
to deal with a proprietary blob.

scbus0 on ahcich1 bus 0:
<TOSHIBA MQ02ABD100H HEF01D>       at scbus0 target 0 lun 0 (pass0,ada0)
<>                                 at scbus0 target -1 lun ffffffff ()
scbus1 on ahciem0 bus 0:
<AHCI SGPIO Enclosure 2.00 0001>   at scbus1 target 0 lun 0 (pass1,ses0)
<>                                 at scbus1 target -1 lun ffffffff ()
scbus-1 on xpt0 bus 0:
<>                                 at scbus-1 target -1 lun ffffffff (xpt0)

I can provide any relevant information, but I don't fully understand the
problem. I'm on a few day old CURRENT with evadot's drm-subtree on top of it,
but I don't think my drm-kmod grabs anything of interest from there.

-- 
You are receiving this mail because:
You are the assignee for the bug.