4.7 vs 5.2.1 SMP/UP bridging performance
John Baldwin
jhb at FreeBSD.org
Thu May 6 11:55:32 PDT 2004
On Thursday 06 May 2004 01:18 pm, Bruce Evans wrote:
> On Thu, 6 May 2004, Bruce M Simpson wrote:
> > On Thu, May 06, 2004 at 10:15:44AM -0400, Andrew Gallatin wrote:
> > > For what its worth, using those operations yeilds these results
> > > on my 2.53GHz P4 (for UP)
> > >
> > > Mutex (atomic_store_rel_int) cycles per iteration: 208
> > > Mutex (sfence) cycles per iteration: 85
> > > Mutex (lfence) cycles per iteration: 63
> > > Mutex (mfence) cycles per iteration: 169
> > > Mutex (none) cycles per iteration: 18
> > >
> > > lfence looks like a winner..
> >
> > Please be aware, though, that the different FENCE instructions are acting
> > as fences against different things. The NASM documentation has a good
> > quick reference for what each of the instructions do, but the definitive
> > reference is Intel's IA-32 programmer's reference manuals.
>
> They are also documented in amd64 manuals.
>
> Don't they all act as fences only on the same CPU, so they are no help
> for SMP? They are still almost twice as slow than full locks on Athlons,
> so hopefully they do more.
They are a traditional membar like membar on Sparc or acq/rel on ia64.
membars only have to apply to the current CPU, but you have to use them in
conjunction with a memory address used to implement a lock. Thus, when you
acquire a lock, you want to use a lfence to ensure that the CPU won't go past
the lfence (assuming lfence is like ia64 acq and sfence is like ia64 rel) for
loads. This ensures that you don't read any of the locked values until you
have the lock. On release you would use a sfence to prevent any stores from
occurring before the store that releases the actual lock. The fence doesn't
push out the pending writes to the other CPUs. However, it does mean that
another CPU won't see that the lock is released unless it can also see all
the other stores before the sfence. Thus, you can actually have a CPU spin
waiting for a lock that is already unlocked. I've seen this on my test Alpha
(DS20) where CPU0 unlocked sched_lock, CPU1 logged a KTR trace saying it was
starting to spin on sched_lock, and a short time later, CPU1 then logged
saying it had gotten sched_lock. I'm not sure if *fence is quite that weak.
They might be though. Note that each generation of ia32 processors seems to
have a weaker memory model than the previous generation.
--
John Baldwin <jhb at FreeBSD.org> <>< http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve" = http://www.FreeBSD.org
More information about the freebsd-current
mailing list