cvs commit: src/sys/i386/i386 apic_vector.s src/sys/i386/isa atpic_vector.s

Mon Feb 2 00:46:05 PST 2004

On Mon, 2 Feb 2004, Andy Farkas wrote:

> On Wed, 28 Jan 2004, John Baldwin wrote:
>
> >   Modified files:
> >     sys/i386/i386        apic_vector.s
> >     sys/i386/isa         atpic_vector.s
> >   Log:
> >   Optimize the i386 interrupt entry code to not reload the segment registers
> >   if they already contain the correct kernel selectors.
>
> What effect on performance does this change have? It seems to be a rather
> significant change to an important code path, or am I totally confused..?

I measured it in userland and saw about -1 cycles/interrupt on an AthlonXP
and about -22 cycles/interrupt on an old Celeron (negative means a
pessimization).

This optimization is hard to measure because it depends on branch
prediction, and interrupts give very unpredictable branches.  I tested
with a predictable pattern of (branch, !branch, branch, !branch, ...)
in userland to see the -1 and -22 cycle results.

In any case, this optimization is not worth doing on these machines,
since loading segment registers (at least with the same value) is not
slow.  It takes 3 cycles on AthlonXP's and not many more on old Celerons
(6 at most).  So a whole 9 or so cycles per interrupt is up for
optimization on these machines.  P4's are said to be much slower (100
cycles for a segreg load) but this disagrees with pentopt.pdf which
says that that the load takes 6-8 cycles.

> Also, you've changed:
>
>  movl    $KDSEL, %eax ;  /* reload with kernel's data segment */
>
> and,
>
>  movl    $KPSEL, %eax ;  /* reload with per-CPU data segment */
>
> to:
>
>  mov     $KDSEL,%ax ;    /* load kernel ds, es and fs */
>
> and,
>
>  mov     $KPSEL,%ax ;
>
>
> Is this part of the optimisations? Or, could you briefly explain this
> change? Thank you.

This gives most of the -22 cycle optimization on Celerons.  It is a
small negative optimization on old machines and a relatively large
negative optimization on PentiumPro class machines (PPro and Celeron,
and probably P2 and P3 but not P4), but is harmless or a small positive
optimization on Athlons.  It gives an operand size prefix on all
machines and partial register stalls on PPros.  The partial register
stalls are due to a gas bug assembling the segment register moves in
the next instuctions:

	mov	$KDSEL, %ax
	mov	%ax, %ds	# partial register stall
	mov	%ax, %es	# already stalled; probably not another one
	mov	$KPSEL, %ax
	mov	%ax, %fs	# partial register stall

Gas misassembles the apparent 16-bit moves to 32-bit ones.  See objdump
or gdb output for a correct disassembly of the generated code -- it
doesn't match the source code.  Segment registers have only 16 bits, so
the top 16 bits are thrown away by the CPU; however, this is apparently
done in a late stage of the pipeline after the stall occurs.  On old
Celerons, each partial register stall takes longer than non-stalling
loading all 3 segment registers.  This gives the -22 cycle optimization.

Old code avoided the stalls accidentally by being optimized to avoid the
operand size prefixes (since these just waste cycles on old CPUs; on
current CPUs they are usually free because deep pipelines optimize them
away).  There was a movl to %eax to avoid a prefix for this instruction,
and a hack to get the same result as the gas bug (no prefix and thus
a 32-bit move).  The hack was needed because gas bugs in this area used
to be larger.  The movl to %eax was already unoptimized for the !SMP
case due to wrong fixes for warnings about the hack that gas started
emitting when it started understanding operand sizes better.

The gas bug and the operand size prefixes are now easy to avoid using
32-bit moves for everything:

	movl	$KDSEL,%eax
	movl	%eax,%ds	# actually assembled correctly

Gas assembles 32-bit moves _from_ segment registers correctly.

The gas bug is presumably the result of incomplete and confusing
documentation about this.  The i386 and i486 manuals barely mention
the effect of the prefix.  My assembler gets it wrong for both directions
by never generating a prefix, and it doesn't permit moves between
segment registers and 32-bit general registers.  This is based on a
literal reading of the opcodes in the table of mov's in the i386 manual
(there are no prefixes there).  However, with no prefix such moves are
actually 32-bit in 32-bit mode (except for moves to segment registers
the non- no-op-ness of the missing prefix is only visible as a partial
register stall).  Current Intel manuals still don't mention prefixes
in the table, but have a lot of notes about them.  They warn that some
assemblers insert a useless prefix and recommend using "MOV DS,EAX"
(Intel Syntax) to avoid it.  However, a prefix for "MOV DS,AX" is not
always useless since it may avoid a partial register stall, so "MOV
DS,AX" should give the prefix unconditionally and "MOV DS,EAX" for
avoiding it should be more than a recommendation.  Current Intel manuals
also document the effect of 32-bit moves from segment registers (the
top bits are undefined for old CPUs and 0 for new ones).  See another
thread and the commit logs for the i386 cpufunc.h about undoing old
operand size prefix optimizations for this direction.

Bruce