Extremely slow boot on VMWare with Opteron 2352 (acpi?)

Wed Mar 10 00:42:05 UTC 2010

On Mar 9, 2010, at 4:27 PM, John Baldwin wrote:

> On Tuesday 09 March 2010 3:40:26 pm Kevin Day wrote:
>> 
>> 
>> If I boot up on an Opteron 2218 system, it boots normally. If I boot the 
> exact same VM moved to a 2352, I get:
>> 
>> acpi0: <INTEL 440BX> on motherboard
>> PCIe: Memory Mapped configuration base @ 0xe0000000
>>   (very long pause)
>> ioapic0: routing intpin 9 (ISA IRQ 9) to lapic 0 vector 48
>> acpi0: [MPSAFE]
>> acpi0: [ITHREAD]
>> 
>> then booting normally.
> 
> It's probably worth adding some printfs to narrow down where the pause is 
> happening.  This looks to be all during the acpi_attach() routine, so maybe 
> you can start there.

Okay, good pointer. This is what I've narrowed down:

acpi_enable_pcie() calls pcie_cfgregopen(). It's called here with pcie_cfgregopen(0xe0000000, 0, 255). inside pcie_cfgregopen, the pause starts here:

        /* XXX: We should make sure this really fits into the direct map. */
        pcie_base = (vm_offset_t)pmap_mapdev(base, (maxbus + 1) << 20);

pmap_mapdev calls pmap_mapdev_attr, and in there this evaluates to true:

        /*
         * If the specified range of physical addresses fits within the direct
         * map window, use the direct map. 
         */
        if (pa < dmaplimit && pa + size < dmaplimit) {

so we call pmap_change_attr which called pmap_change_attr_locked. It's changing 0x10000000 bytes starting at 0xffffff00e0000000.  The very last line before returning from pmap_change_attr_locked is:

                pmap_invalidate_cache_range(base, tmpva);

And this is where the delay is. This is calling MFENCE/CLFLUSH in a loop 8 million times. We actually had a problem with CLFLUSH causing panics on these same CPUs under Xen, which is partially why we're looking at VMware now. (see kern/138863). I'm wondering if VMware didn't encounter the same problem and replace CLFLUSH with a software emulated version that is far slower... based on the speed is probably invalidating the entire cache. A quick change to pmap_invalidate_cache_range to just clear the entire cache if the area being cleared is over 8MB seems to have fixed it. i.e.:

        else if (cpu_feature & CPUID_CLFSH)  {

to

        else if ((cpu_feature & CPUID_CLFSH) && ((eva-sva) < (2<<22))) {

However, I'm a little blurry on if everything leading to this point is correct. It's ending up with 256MB of memory for the pci area, which seems really excessive. Is the problem just that it wants room for 256 busses, or...? Anyone know this code path well enough to know if this is deviating from the norm?

-- Kevin