8.1-RC2 MCE caused by some LAPIC/clock changes?

Markus Gebert markus.gebert at hostpoint.ch
Wed Jul 21 12:25:59 UTC 2010


On 21.07.2010, at 10:33, Andriy Gapon wrote:

> on 21/07/2010 03:57 Markus Gebert said the following:
>> Another thing though: Today I compared verbose boot output from 8-stable and
>> the current box. I saw that the ioapic sets up IRQ routing differently on
>> these two systems although the hardware is the same. This seemed not so
>> interesting at first, but then I noticed that 8-stable sets up two routes (to
>> lapic0 and lapic2, or sometimes lapic3) for IRQ58 (mpt0), while current only
>> uses one route (to lapic0).
> 
> My understanding that it's not "two routes", but re-routing.
> During early boot all interrupts are bound to BSP; later, when APs become
> online, the interrupts are re-distributed among available CPUs.

I guess you're right, misinterpretation on my side. Thanks for clarifying this.

Now being aware of this, it seems to me that in the machdep.lapic_allclocks=0 case, there might just be more interrupts to be assigned/routed due to "more clocks being used". If that's true, maybe it's just "luck" that in this case the mpt interrupt gets assigned to lapic0/cpu0 and the box runs fine. I'm just guessing though, since I have no clue how interrupts are assigned to lapics exactly (round-robin? some logic?).


>> I used 'cpuset -c -l 0 -x 58' in an attempt to make my 8-stable box behave
>> like the one running current. Indeed, this seems to have changed IRQ58 to be
>> routed to lapic0 only. And the box was running for hours without showing the
>> symptoms.
>> 
>> I just checked boot verbose outpout of my 8-stable box again (booted with
>> machdep.lapic_allclocks=0 as mentioned above). And now it seems to have set
>> up IRQ routes just like the current box (one route for IRQ58 to lapic0).
> 
> Not sure how to interpret this properly.
> One possibility is a hardware problem where interrupt message route between
> ioapic2 and CPU to which lapic3 belongs is flaky.
> Perhaps, this might be a FreeBSD problem: it could be that the system somehow
> tells to not set up such routes, but we don't listen.  But this is far fetched.


I'm not sure either. If my "theory" above proved to be true, it would have been just luck, that 6.x and 7.x (and current) run just fine on the X4100M2. A (short) test on Ubuntu didn't trigger the problem, so the Linux kernel is either lucky too by selecting an interrupt route that is "not flaky", or there's indeed some way to figure out not to use some lapics for some interrupts. Or we didn't test Linux thoroughly enough.


Markus




More information about the freebsd-stable mailing list