8.1-RC2 - PCI fatal error or MCE triggered by USB/ehci on Sun X4100M2?

Fri Jul 9 23:53:42 UTC 2010

Hi John

Am 09.07.2010 um 22:03 schrieb John Baldwin:

> On Friday, July 09, 2010 11:26:00 am Markus Gebert wrote:
>> --
>> MCA: Bank 4, Status 0xb400004000030c2b
>> MCA: Global Cap 0x0000000000000105, Status 0x0000000000000007
>> MCA: Vendor "AuthenticAMD", ID 0x40f13, APIC ID 2
>> MCA: CPU 2 UNCOR BUSLG Observer WR I/O
>> MCA: Address 0xfd00000000
> 
> Using my local port of mcelog this is what I get for this check:
> 
> CPU 2 4 northbridge 
> ADDR fd00000000 
>  Northbridge Master abort
>  link number = 4
>       bit61 = error uncorrected
>  bus error 'local node observed, request didn't time out
>             generic write mem transaction
>             i/o access, level generic'
> STATUS b400004000030c2b MCGSTATUS 7
> MCGCAP 105 APICID 2 SOCKETID 0 
> CPUID Vendor AMD Family 15 Model 65
> 
> I don't know what to tell you off hand.  Did you buy this hardware from Sun 
> directly?  If so, I would try bugging them about this, especially given the 
> error that the BIOS is logging.

Yes, this hardware comes from Sun directly, but getting Sun (/Oracle) support for this issue is gonna be tough. FreeBSD is unsupported, and in a short test we couldn't reproduce the problem with a Linux kernel. While I agree that a hardware issue has always been and still is a possibility to be considered, the fact that we tested this on two machines remains as well as the fact that 6.x, 7.x do not show the behavior. Another possibility is of course, that the X4100 is prone to such issues and somehow 6.x and 7.x have workarounds we're not aware of or just do something different in way so that this issue does not get triggered.

>  It does sound like a hardware issue, but in 
> the chipset, not in the RAM, so you might need to swap out the main board 
> rather than the RAM.

Yep. The MCA report does not indicate RAM problems, and the MCE itself was not our only reason to replace RAM. We found a Sun document about the X4200 series getting hypertransport errors when RAM of a certain vendor is installed, so we swapped RAM to rule this one out.

We did not replace the mainboard though, but testing on a second X4100 should do about the same.

> I'm curious if disabling USB legacy support in the BIOS causes it to still die 
> even with ehci not loaded.  If so, then the SMI# for the ehci controller must 
> somehow prevent the issue, perhaps by triggering frequently enough to slow the 
> rate of I/O requests down?

I disabled usb legacy support in the BIOS and booted a kernel with usb+ohci+ukbd+ums but without ehci. Unfortunately, I cannot reproduce the MCE.

Just to get you right: your theory is that when we don't load the ehci driver, then the ehci-controller isn't taken over during boot and therefore handled through SMM so that SMIs might occur often enough to throttle the system just enough to not let the problem appear? I'm not very familiar with usb legacy support and SMM, but why would ehci be handled through SMM when the only usb devices (the virtual keyboard and mouse) actually sit on ohci? And why would disabling legacy support help getting more SMIs to throttle the system? As I unterstand this, and I might be terribly wrong, legacy support is what would cause SMIs in the first place.

Markus