Re: git: 060699e91369 - stable/13 - Merge llvm-project release/15.x llvmorg-15.0.7-0-g8dfdcc7b7bf6

From: Dimitry Andric <dim_at_FreeBSD.org>
Date: Sat, 29 Apr 2023 18:49:28 UTC
On 29 Apr 2023, at 20:33, Jason A. Harmening <jah@FreeBSD.org> wrote:
> 
> On Sun, Apr 09, 2023 at 09:35:22PM +0000, Dimitry Andric wrote:
>> The branch stable/13 has been updated by dim:
>> 
>> URL: https://cgit.FreeBSD.org/src/commit/?id=060699e9136975d51d3f726b9785bdbac9a62ba6
>> 
>> commit 060699e9136975d51d3f726b9785bdbac9a62ba6
>> Author:     Dimitry Andric <dim@FreeBSD.org>
>> AuthorDate: 2023-01-14 16:33:24 +0000
>> Commit:     Dimitry Andric <dim@FreeBSD.org>
>> CommitDate: 2023-04-09 14:54:52 +0000
>> 
>>    Merge llvm-project release/15.x llvmorg-15.0.7-0-g8dfdcc7b7bf6
>> 
>>    This updates llvm, clang, compiler-rt, libc++, libunwind, lld, lldb and
>>    openmp to llvmorg-15.0.7-0-g8dfdcc7b7bf6.
>> 
>>    PR:             265425
>>    MFC after:      2 weeks
> 
> This MFC of llvm15 appears to have completely broken the Intel IOMMU
> driver on my stable/13 machine.  After this series of commits, any
> downstream DMA seems to produce an IOMMU translation fault, which
> renders the machine completely unusable: no nvme boot disk, no usb
> keyboard, etc.
> 
> The faults I see look something like this:
> 
> DMAR4: ahci0: pci0:17:5 sid 8d fault acc 0 adt 0x0 reason 0x3 addr 26000
> 
> It's a bit surprising to see a toolchain upgrade produce breakage like
> this, but that's what git bisect clearly tells me.  I wonder if some of
> the IOMMU control structures might be defined as C bitfields and the new
> compiler is emitting them differently?  Also, was any breakage like this
> observed when -current was upgraded to llvm15 several months ago?

I haven't heard anything about such breakage, no.


> More generally, this is the second time in as many months I've had to
> deal with IOMMU breakage on -stable.  I can't imagine I'm the only
> person who sees value in running with DMA remapping enabled; do we need
> a dedicated DMAR-enabled machine in the cluster to smoke-test changes
> like this?  More generally, should we avoid MFCing high-risk changes
> like this?

Since there were very few bug reports, it was not deemed high risk.

In any case, it would be good to get the bottom of what is causing the
problem, so is there any way you can isolate which code seems to be
going "bad"?

For example, if this problem affects code in sys/dev/iommu, is there
some way you can compile that part with -O1, or with an older version
of clang (from ports), to see if the problem goes away?

-Dimitry