80386 support in -current

Mon Jan 26 21:48:55 PST 2004

On Mon, 26 Jan 2004, Maxim Sobolev wrote:

> Out of curiosity I had run the bench on my good ol' P4 2GHz notebook,
> and was very surprised that it much slower than even PIII-400 in most cases:
>
>                                  212.22 cycles/call
> -DNO_MPLOCKED -DI386_CPU        117.32 cycles/call
> -DI386_CPU                      117.19 cycles/call
> -DNO_MPLOCKED                   31.39 cycles/call
>
> So, indeed, xchg is *lot* slower on p4 in non-SMP case than cmpxchgl, I

Slowness of xchg vs (unlocked) cmpxchg is normal (xchg forces a lock which
is expensive).  I'm going to change the xchg to a mov in the UP case.

More worrying and interesting is that everything is way slower than
on an Athlon-XP.  Locking seems to be much more expensive than cli/sti.

> > Athlon XP1600          NO_MPLOCKED:             2.02 cycles/call
> > Athlon XP1600:                                 18.07 cycles/call
> > Athlon XP1600 I386_CPU NO_MPLOCKED:            19.06 cycles/call
> > Athlon XP1600 I386_CPU:                        19.06 cycles/call
> > Celeron 400            NO_MPLOCKED:             5.03 cycles/call
> > Celeron 400:                                   25.36 cycles/call
> > Celeron 400   I386_CPU NO_MPLOCKED:            35.27 cycles/call
> > Celeron 400   I386_CPU:                        35.32 cycles/call

Of course the cycle counts for locked instructions are much longer on
the P4, because its CPU frequency is faster and its memory frequency
is much faster.  However, the Athlon is running at 1532 MHz (nominally
1400MHz 266MHz FSB with everything overclocked by 1532/1400), so its
frequencies are not very different from the P4.  But somehow the it
is 15 times faster by cycle count for the unlocked cmpxchg.  Locking
the cmpxchg apparently takes 16 cycles on the Athlon and 181 cycles
on the P4. cli/sti locking (plus a couple of extra instructions for
the i386 case) apparently takes 17 cycles on the Athlon and 86 cyles
on the P4.

> had tried to rewrite atomic_readandclear_int() using cmpxchg - in
> non-SMP case it became more than 10 times faster than current xchg
> version (15 cycles vs. 200 cycles). However, when I've hacked all

atomic_readandclear_int() should be changed too.  It still uses
essentially my original plain-i386ish code which may have been written
without understanding that xchg has an implicit lock.  But it is not
used much.  xchg is used mainly in _release_lock_quick() for spinlocks.
Non-spin locks use _release_lock() which uses cmpxchg.

> functions in atomic.h to use cmpxchg instead of xchg, and run make world
> benchmark on kernels without this change and with it, I found that there
> was hardly any improvement in performance, despite expected decrease of
> mutex unlocking operation.

Does anyone know how many mutex calls there are for makeworld?  I've
noticed that for almost anything that you can count for makeworld,
although the count may look large it only accounts for epsilon% of
the time.  E.g., a makeworld that took 2540 seconds did about 600000
context switches.  Context switches are very expensive -- they take
between 1.1 and 87.6 usec according to lmbench2.  If they take 87.6
then 600000 of them take 52.5 seconds which is significant, but I
suspect an average one takes closer to 1.1 usec than 87.6 (87.6 is
with 16 processes writing to 64K.

Bruce