Date: Thu, 24 Mar 2022 20:48:18 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=262765 Bug ID: 262765 Summary: Random lockups, data loss, and poor I/O and sound quality after 95edb10b47fc1a919cd1687aaf16be9e14456c89 Product: Base System Version: CURRENT Hardware: amd64 OS: Any Status: New Severity: Affects Only Me Priority: --- Component: kern Assignee: bugs@FreeBSD.org Reporter: firstname.lastname@example.org This is way beyond my level, and one of the reasons I didn't want to move beyond 13.0. Reverting LinuxKPI: implement dma_sync_single_for_*, apply to (un)map single/sg fixes all sorts of problems for me, but it's a big hammer that probably breaks things for everyone else. I have no idea who the culprit is. I first started having troubles in Linux a few years ago, and finally narrowed it down here. It's entirely possible my firmware is broken, but is there anything we can do? My drm-kmod is aso devoid of panic(), but unluckily this doesn't manifest as panics. I had to make some stuff up or return (-ENOMEM) to accomodate these changes, but it's nothing of interest. This is really complicated because multiple drivers are trying to manage memory owned by the firmware, and they don't cooperate. I found my workaround, and it solves a sort of several year mystery, but maybe we can do better. I don't even know what kind of quirk this could be. If I had to guess, the relevant parts are dma_sync_single_for_cpu and cache flushing. This is from some i915 documentation: Now the pagetables are a bit tricky. In the end, they're all in system memory, but there are a few hoops to jump through to get at them. The GTT pagetables has just one level, so with a 4 byte entry size we need 2MB of contiguous pagetable space. The firmware allocates that for us from stolen memory (that is, a part of the system memory that is not listed in the e820 map, so it's not managed by the Linux kernel). But we write these PTEs through an alias in the register mmio bar! The reason for that is to allow the SA to invalidate TLBs. Note, though, that this only invalidates TLBs for cpu access. Any other access to the GTT (such as from the GT or the display block) has its own rules for TLB invalidation. Also, on recent generations we need to (depending upon circumstances) manually invalidate the SA TLB by writing to a magic register. To speed up map/unmap operations, we map that GTT PTE aliasing region in the mmio with wc (if this is possible, which means the cpu needs to support PAT). A lot of this is just stubbed or nonexistent right now, notably runtime PM and the more complicated GT/engine bits. And we really have no idea what the Nvidia driver is doing, aside from trying and failing to write in write-protected regions. I took this upstream, but nobody really cares because they don't want to deal with a proprietary blob. scbus0 on ahcich1 bus 0: <TOSHIBA MQ02ABD100H HEF01D> at scbus0 target 0 lun 0 (pass0,ada0) <> at scbus0 target -1 lun ffffffff () scbus1 on ahciem0 bus 0: <AHCI SGPIO Enclosure 2.00 0001> at scbus1 target 0 lun 0 (pass1,ses0) <> at scbus1 target -1 lun ffffffff () scbus-1 on xpt0 bus 0: <> at scbus-1 target -1 lun ffffffff (xpt0) I can provide any relevant information, but I don't fully understand the problem. I'm on a few day old CURRENT with evadot's drm-subtree on top of it, but I don't think my drm-kmod grabs anything of interest from there. -- You are receiving this mail because: You are the assignee for the bug.