Extremely slow boot on VMWare with Opteron 2352 (acpi?)

Kostik Belousov kostikbel at gmail.com
Wed Mar 10 11:28:01 UTC 2010


On Tue, Mar 09, 2010 at 06:42:02PM -0600, Kevin Day wrote:
> 
> On Mar 9, 2010, at 4:27 PM, John Baldwin wrote:
> 
> > On Tuesday 09 March 2010 3:40:26 pm Kevin Day wrote:
> >> 
> >> 
> >> If I boot up on an Opteron 2218 system, it boots normally. If I boot the 
> > exact same VM moved to a 2352, I get:
> >> 
> >> acpi0: <INTEL 440BX> on motherboard
> >> PCIe: Memory Mapped configuration base @ 0xe0000000
> >>   (very long pause)
> >> ioapic0: routing intpin 9 (ISA IRQ 9) to lapic 0 vector 48
> >> acpi0: [MPSAFE]
> >> acpi0: [ITHREAD]
> >> 
> >> then booting normally.
> > 
> > It's probably worth adding some printfs to narrow down where the pause is 
> > happening.  This looks to be all during the acpi_attach() routine, so maybe 
> > you can start there.
> 
> Okay, good pointer. This is what I've narrowed down:
> 
> acpi_enable_pcie() calls pcie_cfgregopen(). It's called here with pcie_cfgregopen(0xe0000000, 0, 255). inside pcie_cfgregopen, the pause starts here:
> 
>         /* XXX: We should make sure this really fits into the direct map. */
>         pcie_base = (vm_offset_t)pmap_mapdev(base, (maxbus + 1) << 20);
> 
> pmap_mapdev calls pmap_mapdev_attr, and in there this evaluates to true:
> 
>         /*
>          * If the specified range of physical addresses fits within the direct
>          * map window, use the direct map. 
>          */
>         if (pa < dmaplimit && pa + size < dmaplimit) {
> 
> so we call pmap_change_attr which called pmap_change_attr_locked. It's changing 0x10000000 bytes starting at 0xffffff00e0000000.  The very last line before returning from pmap_change_attr_locked is:
> 
>                 pmap_invalidate_cache_range(base, tmpva);
> 
> And this is where the delay is. This is calling MFENCE/CLFLUSH in a loop 8 million times. We actually had a problem with CLFLUSH causing panics on these same CPUs under Xen, which is partially why we're looking at VMware now. (see kern/138863). I'm wondering if VMware didn't encounter the same problem and replace CLFLUSH with a software emulated version that is far slower... based on the speed is probably invalidating the entire cache. A quick change to pmap_invalidate_cache_range to just clear the entire cache if the area being cleared is over 8MB seems to have fixed it. i.e.:
> 
>         else if (cpu_feature & CPUID_CLFSH)  {
> 
> to
> 
>         else if ((cpu_feature & CPUID_CLFSH) && ((eva-sva) < (2<<22))) {
> 
> 
> However, I'm a little blurry on if everything leading to this point is correct. It's ending up with 256MB of memory for the pci area, which seems really excessive. Is the problem just that it wants room for 256 busses, or...? Anyone know this code path well enough to know if this is deviating from the norm?

I think that the idea not to for CLFLUSH in the loop for large regions
is good. We do not extract the L2/L3 cache size now, I suppose that 2MB
estimation is good for most situations.

commit bbac1632d349d68b905df644656ce9a8e4aed094
Author: Konstantin Belousov <kostik at pooma.home>
Date:   Wed Mar 10 13:07:51 2010 +0200

    Fall back to wbinvd when region for CLFLUSH is >= 2MB.
    
    Submitted by:	Kevin Day <toasty at dragondata.com>

diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c
index 07db5d1..4361be0 100644
--- a/sys/amd64/amd64/pmap.c
+++ b/sys/amd64/amd64/pmap.c
@@ -994,7 +994,8 @@ pmap_invalidate_cache_range(vm_offset_t sva, vm_offset_t eva)
 
 	if (cpu_feature & CPUID_SS)
 		; /* If "Self Snoop" is supported, do nothing. */
-	else if (cpu_feature & CPUID_CLFSH) {
+	else if ((cpu_feature & CPUID_CLFSH) != 0 &&
+		 eva - sva < 2 * 1024 * 1024) {
 
 		/*
 		 * Otherwise, do per-cache line flush.  Use the mfence
@@ -1011,7 +1012,8 @@ pmap_invalidate_cache_range(vm_offset_t sva, vm_offset_t eva)
 
 		/*
 		 * No targeted cache flush methods are supported by CPU,
-		 * globally invalidate cache as a last resort.
+		 * or the supplied range is bigger then 2MB.
+		 * Globally invalidate cache.
 		 */
 		pmap_invalidate_cache();
 	}
diff --git a/sys/i386/i386/pmap.c b/sys/i386/i386/pmap.c
index 4b2e34f..f448071 100644
--- a/sys/i386/i386/pmap.c
+++ b/sys/i386/i386/pmap.c
@@ -996,7 +996,8 @@ pmap_invalidate_cache_range(vm_offset_t sva, vm_offset_t eva)
 
 	if (cpu_feature & CPUID_SS)
 		; /* If "Self Snoop" is supported, do nothing. */
-	else if (cpu_feature & CPUID_CLFSH) {
+	else if ((cpu_feature & CPUID_CLFSH) != 0 &&
+		 eva - sva < 2 * 1024 * 1024) {
 
 		/*
 		 * Otherwise, do per-cache line flush.  Use the mfence
@@ -1013,7 +1014,8 @@ pmap_invalidate_cache_range(vm_offset_t sva, vm_offset_t eva)
 
 		/*
 		 * No targeted cache flush methods are supported by CPU,
-		 * globally invalidate cache as a last resort.
+		 * or the supplied range is bigger then 2MB.
+		 * Globally invalidate cache.
 		 */
 		pmap_invalidate_cache();
 	}
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-hackers/attachments/20100310/222fbf3f/attachment.pgp


More information about the freebsd-hackers mailing list