Extremely slow boot on VMWare with Opteron 2352 (acpi?)

Wed Mar 10 13:49:35 UTC 2010

On Wednesday 10 March 2010 6:27:53 am Kostik Belousov wrote:
> On Tue, Mar 09, 2010 at 06:42:02PM -0600, Kevin Day wrote:
> > 
> > On Mar 9, 2010, at 4:27 PM, John Baldwin wrote:
> > 
> > > On Tuesday 09 March 2010 3:40:26 pm Kevin Day wrote:
> > >> 
> > >> 
> > >> If I boot up on an Opteron 2218 system, it boots normally. If I boot the 
> > > exact same VM moved to a 2352, I get:
> > >> 
> > >> acpi0: <INTEL 440BX> on motherboard
> > >> PCIe: Memory Mapped configuration base @ 0xe0000000
> > >>   (very long pause)
> > >> ioapic0: routing intpin 9 (ISA IRQ 9) to lapic 0 vector 48
> > >> acpi0: [MPSAFE]
> > >> acpi0: [ITHREAD]
> > >> 
> > >> then booting normally.
> > > 
> > > It's probably worth adding some printfs to narrow down where the pause is 
> > > happening.  This looks to be all during the acpi_attach() routine, so maybe 
> > > you can start there.
> > 
> > Okay, good pointer. This is what I've narrowed down:
> > 
> > acpi_enable_pcie() calls pcie_cfgregopen(). It's called here with pcie_cfgregopen(0xe0000000, 0, 255). inside pcie_cfgregopen, the pause starts 
here:
> > 
> >         /* XXX: We should make sure this really fits into the direct map. */
> >         pcie_base = (vm_offset_t)pmap_mapdev(base, (maxbus + 1) << 20);
> > 
> > pmap_mapdev calls pmap_mapdev_attr, and in there this evaluates to true:
> > 
> >         /*
> >          * If the specified range of physical addresses fits within the direct
> >          * map window, use the direct map. 
> >          */
> >         if (pa < dmaplimit && pa + size < dmaplimit) {
> > 
> > so we call pmap_change_attr which called pmap_change_attr_locked. It's changing 0x10000000 bytes starting at 0xffffff00e0000000.  The very last 
line before returning from pmap_change_attr_locked is:
> > 
> >                 pmap_invalidate_cache_range(base, tmpva);
> > 
> > And this is where the delay is. This is calling MFENCE/CLFLUSH in a loop 8 million times. We actually had a problem with CLFLUSH causing panics on 
these same CPUs under Xen, which is partially why we're looking at VMware now. (see kern/138863). I'm wondering if VMware didn't encounter the same 
problem and replace CLFLUSH with a software emulated version that is far slower... based on the speed is probably invalidating the entire cache. A 
quick change to pmap_invalidate_cache_range to just clear the entire cache if the area being cleared is over 8MB seems to have fixed it. i.e.:
> > 
> >         else if (cpu_feature & CPUID_CLFSH)  {
> > 
> > to
> > 
> >         else if ((cpu_feature & CPUID_CLFSH) && ((eva-sva) < (2<<22))) {
> > 
> > 
> > However, I'm a little blurry on if everything leading to this point is correct. It's ending up with 256MB of memory for the pci area, which seems 
really excessive. Is the problem just that it wants room for 256 busses, or...? Anyone know this code path well enough to know if this is deviating 
from the norm?
> 
> I think that the idea not to for CLFLUSH in the loop for large regions
> is good. We do not extract the L2/L3 cache size now, I suppose that 2MB
> estimation is good for most situations.
> 
> commit bbac1632d349d68b905df644656ce9a8e4aed094
> Author: Konstantin Belousov <kostik at pooma.home>
> Date:   Wed Mar 10 13:07:51 2010 +0200
> 
>     Fall back to wbinvd when region for CLFLUSH is >= 2MB.
>     
>     Submitted by:	Kevin Day <toasty at dragondata.com>

This looks good to me.

-- 
John Baldwin