Purchasing the correct hardware: dual-core intel? Big cache?

Tue Apr 25 15:13:26 UTC 2006

On Tue, Apr 25, 2006 at 09:59:20AM -0400, Bill Moran wrote:
> On Tue, 25 Apr 2006 09:48:21 -0400
> Chuck Swiger <cswiger at mac.com> wrote:
> 
> > Bill Moran wrote:
> > [ ... ]
> > >> If you use well optimized applications, you see the larger performance 
> > >> gain.  Poor optimization causes a CPU to chug along, flushing the CPU cache 
> > >> often, and slowing things down considerably.
> > > 
> > > I know.  That's why I'm so desperately trying to find a way to determine
> > > how often the cache is being invalidated - so I can determine whether
> > > larger cache sizes (such as 8M) are worthwhile.
> > 
> > Guys, you're confusing two things:
> > "flushing the pipeline" vs. "L2 cache hit ratio".
> > 
> > The former happens when branch prediction/speculative execution goes awry and 
> > requires the CPU to clear the pipeline of partially-executed instructions and 
> > backtrack to follow the other path.  It is related to optimization quality of 
> > compilers, but is not related at all to how big your L2 cache is.
> > 
> > The size of your L2 cache affects how much data is more local to the CPU than 
> > main memory, and increasing it will improve the L2 cache hit ratio, or, 
> > equivalently, reduce L2 cache misses.  This is affected by some specific 
> > compiler optimizations (cf "loop unrolling"), but tends to reflect the specifics 
> > of the workload and how much multitasking of different programs you do more than 
> > the compiler.
> 
> Thanks, Chuck.
> 
> What I'm looking for is a way to measure this on the current machines
> we're using so I can make a prediction as to whether larger cache
> sizes will improve performance.  What I'm looking for is some sort of
> counter or the like that I can use to tell what my current L2 cache
> hit ratio _is_, so I can intelligently speculate as to whether another
> 6M of cache is worth the outrageous price.

The only way to be certain is to measure the performance of your particular
application on the different pieces of hardware and see which one is
fastest.

There are various methods available to measure the cache behaviour on
you current hardware, but none of them is exactly trivial use correctly, and
even if you do get useful measurements it can be tricky to extrapolate them
to a larger cachesize.

If you want to try using the various internal counters most modern CPUs have
you can try to read up on the hwpmc(4) or perfmon(4) virtual devices.

You could also run the code under some kind of simulator that allows you to
record the cache hits and misses for various simulated caches, but doing
that can be quite slow.

There are several other software based methods that have been proposed and
analyzed in various academic papers, but I suspect that most of them (maybe
even all) are currently a bit too complicated for an ordinary end-user to
apply (and definitely too complicated for me to go into any details here and
now.)

Some general thoughts:

If there are currently very few cache misses then increasing the cache size
will not give any noticable performance increase (but I suspect you already
knew that.)

If you currently have a lot of cache misses the performance would likely be
improved by a larger cache, but it is possible (though unlikely) that you
would need to increase the cache to as large as 16MB (or even more) in order
to see any improvement (it depends almost entirely on the memory access
patterns of the application.)

In general one usually reaches a point of dimnishing returns when increasing
cache size, so unless your workload has an unusual memory access pattern I
suspect that you would not see much improvement by moving to an 8MB cache.
(But then again it might be that your particular workload would benefit
enormously from the larger cache.  Impossible to tell for certain without
actually trying it.)

It might also be worth noting that dual-core CPUs (as I believe was another
alternative you were looking at) usually have twice the L2 cache of a
corresponding single-core CPU so you will get larger cache this way too.
(And the 8MB CPUs you were thinking of I believe has it as an extra L3 cache
rather than as an larger L2 cache. L3 cache is almost always slower than L2
cache (but still faster than main memory.)

How much your application would benefit from moving to a dual-core solution
depends on how well it scales with the number of cores.  If you are really
lucky it may be that its performance will be almost linear (or perhaps even
super-linear) in the number of cores (i.e. dual-core gives twice the
performance of a single-core) but it may also be that due to threads
contending for various resources performance will only improve marginally.
Usually the result is somewhere in between, but again the only way to tell
for certain is to actually try it. 

-- 
<Insert your favourite quote here.>
Erik Trulsson
ertr1013 at student.uu.se