4.7 vs 5.2.1 SMP/UP bridging performance

Thu May 6 11:55:32 PDT 2004

On Thursday 06 May 2004 01:18 pm, Bruce Evans wrote:
> On Thu, 6 May 2004, Bruce M Simpson wrote:
> > On Thu, May 06, 2004 at 10:15:44AM -0400, Andrew Gallatin wrote:
> > > For what its worth, using those operations yeilds these results
> > > on my 2.53GHz P4 (for UP)
> > >
> > > Mutex (atomic_store_rel_int) cycles per iteration: 208
> > > Mutex (sfence) cycles per iteration: 85
> > > Mutex (lfence) cycles per iteration: 63
> > > Mutex (mfence) cycles per iteration: 169
> > > Mutex (none) cycles per iteration: 18
> > >
> > > lfence looks like a winner..
> >
> > Please be aware, though, that the different FENCE instructions are acting
> > as fences against different things. The NASM documentation has a good
> > quick reference for what each of the instructions do, but the definitive
> > reference is Intel's IA-32 programmer's reference manuals.
>
> They are also documented in amd64 manuals.
>
> Don't they all act as fences only on the same CPU, so they are no help
> for SMP?  They are still almost twice as slow than full locks on Athlons,
> so hopefully they do more.

They are a traditional membar like membar on Sparc or acq/rel on ia64.  
membars only have to apply to the current CPU, but you have to use them in 
conjunction with a memory address used to implement a lock.  Thus, when you 
acquire a lock, you want to use a lfence to ensure that the CPU won't go past 
the lfence (assuming lfence is like ia64 acq and sfence is like ia64 rel) for 
loads.  This ensures that you don't read any of the locked values until you 
have the lock.  On release you would use a sfence to prevent any stores from 
occurring before the store that releases the actual lock.  The fence doesn't 
push out the pending writes to the other CPUs.  However, it does mean that 
another CPU won't see that the lock is released unless it can also see all 
the other stores before the sfence.  Thus, you can actually have a CPU spin 
waiting for a lock that is already unlocked.  I've seen this on my test Alpha 
(DS20) where CPU0 unlocked sched_lock, CPU1 logged a KTR trace saying it was 
starting to spin on sched_lock, and a short time later, CPU1 then logged 
saying it had gotten sched_lock.  I'm not sure if *fence is quite that weak.  
They might be though.  Note that each generation of ia32 processors seems to 
have a weaker memory model than the previous generation.

-- 
John Baldwin <jhb at FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve"  =  http://www.FreeBSD.org