[PATCH] Simplify in() and out() functions of AMD64 and i386

Fri Apr 10 05:54:43 PDT 2009

On Fri, 10 Apr 2009, Christoph Mallon wrote:

> Bruce Evans schrieb:
>> On Wed, 8 Apr 2009, Christoph Mallon wrote:
>>> Also the {in,out}{w,l}() get treated with this constraint. They had no 
>>> special case before, so now better code is generated for them.
>> 
>> Actually, worse code is generated for them on most cases since the
>> optimization of loading 32 bits for the usual case of a non-constant
>> operand is lost in your version -- see below.  The case of a constant
>
> If you are referring to the u_uint to u_short change, then you are
> wrong, see below.

No, you are wrong.  See below in previous and this mail.

>> kib@ already pointed out that the direction flag is not guaranteed to be
>> cleared in the kernel.  For amd64, this may be a bug in the kernel, but
>> for i386 it is a bug in gcc say that the ABI specifies that the direction
>> flag is cleared, since there is no standard i386 ABI so gcc must not
>> assume that the direction flag is clear, like it has done until recently.
>
> There *is* an ABI spec for i386 and it mandates that DF is cleared on 
> function entry and exit. Also other compilers (e.g. LLVM and
> ICC) do not have this misbehaviour of GCC < 4.3 to emit cld before every
> string operation at all. I would not trust GCC < 4.3 to do this
> correctly everywhere either.

Anyone can make up a spec but no one has to follow it.  FreeBSD has always
followed the old de-facto spec that requires any code that uses string
operations to set the direction flag for itself.  Trying to change this
after about 1990 is especially stupid since string instructions should
almost never be used due to their becoming relatively even slower without
even counting their large setup overhead.  Their setup overhead includes
the cld but should be large enough to do the cld in parallel so that it
is almost free.

>> % % -#if !defined(__GNUCLIKE_BUILTIN_CONSTANT_P) || __GNUCLIKE_ASM < 3
>> % -
>> % -#define    inb(port)        inbv(port)
>> % -#define    outb(port, data)    outbv(port, data)
>> % -
>> % -#else /* __GNUCLIKE_BUILTIN_CONSTANT_P && __GNUCLIKE_ASM >= 3 */
>> 
>> The ifdefs are ugly, and this one seems to have been slightly wrong, but
>> they provide documentation aboout the features used:
>> - __GNUCLIKE_BUILTIN_CONSTANT_P: no longer needed
>> - __GNUCLIKE_ASM >= 3: to ensure that the "i" constraint is available.
>>   This used to be __GNUC__ >= 2.  If that was correct then >= 3 is
>>   stricter than necessary; otherwise >= 2 was not as strict as it
>>   should have been, since some versions of gcc-2 certainly supported
>>   the "i" constraint.
>> 
>> A precise translation of the above would be __GNU_PREREQ__(maj, min) ||
>> defined(__INTEL_COMPILER), where (maj, min) is the version of gcc that
>> implemented the "N" constraint and you should check that the Intel
>> compiler (all version?) support the "N" constraint.
>
> This is unnecessary, even GCC 2.95 (10 years old, the oldest release you can 
> get from official sources) supports this constraint 
> (gcc-2.95/gcc/config/i386/i386.h, line 932). This shows how old and stale 
> this hack is.

I just want it pointed out (in the commit log) and in a separate commit
that you are intentionally weakening the ifdefs since old compilers are
not supported even to the extent of not generating syntax errors for
them and most other places (e.g., atomic.h) already assume a non-old
compiler and generate the syntax errors for old ones.

>> % - * The unnecessary test `(port) < 0x10000' is to generate a warning if
>> % - * the `port' has type u_short or smaller.  Such types are pessimal.
>> % - * This actually only works for signed types.  The range check is
>> % - * careful to avoid generating warnings.
>> 
>> This won't work any more, if it ever did.  I think it only worked for
>> the constant case, but for the constant case the width doesn't really
>> matter.
>
> The use of u_short for the port enables the compiler to generate a warning 
> for too large port numbers.

Only if it warns for truncation on assignment, which most compilers won't
do.

>> % -static __inline u_char
>> % -inbc(u_int port)
>> % +static inline u_char
>> % +inb(u_short port)
>> 
>> __inline should be globabally changed to inline someday, not starting here.
>> Plain inline has been Standard for 10 years now, but it still isn't
>> standard -- we're still waiting for gcc to implement C99, and haven't
>> switched to gcc-4.3(?) or later which will implement C99 extern inline.
>
> This very file already uses "inline" in other places.

Only in recent style breakage in 2 places when I wasn't looking
(18-Apr-08 for cpu_monitor() and cpu_wait()).  Other style breakage
for these functions:
- insertion sort errors.  This file is supposed to be sorted on function
   name
- namespace errors.  Functions in this file should have the same
   name as the instruction that they provided access too.  monitor()
   is a bit too generic for this and wait() is much too generic for this
   (however, the instruction name is mwait() which is less generic than
   monitor()), so some renaming is required.  The renaming should not
   be into the cpu_* namespace since the cpu_ prefix has a different,
   precise meaning.  These bugs are missing for the older interface to
   the "pause" instruction:
   - pause() is too generic, so the cpufunc is named ia32_pause() on i386,
     etc. (but I prefer mdpause()).
   - pause() was already in use, so even the MI interface couldn't be named
     pause().  It is named cpu_spinwait().
   - cpu_spinwait() is implemented as a macro that invokes ia32_pause().
- bogus semicolon at the end of the main asm string
- missing space after ": :" (2 instances)
- missing space after just 1 of the 4 constraint strings.
- missing declarations in the non-__GNUCLIKE_ASM &&
   non-__CC_SUPPORTS___INLINE case.

Hmm, your change from __inline to inline also bogotifies the
__CC_SUPPORTS___INLINE ifdef.  This ifdef is careful to a fault, but
not wrong provided __inline is used consistently.

>> Changing the type changes the API and breaks an optimization.  Callers
>
> It does not change the API, it corrects it: Passing a port number > 0xFFFF 
> never worked and the API now reflects this. So if now accidently a too large 
> constant port number is passed, the compiler can warn. It's a static inline 
> function, so there is no problem with the ABI. About optimization see below.

No, it breaks it.  I define the API so I know what it is!  The caller should
pass a u_int port so that optimal code is generated (at a tiny cost to
data space for storing 32-bit port numbers in softcs).  The hardware only
uses the low 16 bits so the upper 16 bits don't matter.  If the caller
starts with a u_short port, then the API generates sub-optimal code to
promote to a u_int.  The promotion logically occurs as part of the function
call protocol, but since the function is inline it normally occurs more
directly (e.g., as "movswl port,%edx").

>> are supposed to provide an arg of width 32 so that it can be loaded
>> as efficiently as possible (instruction prefixes and/or sign extension
>> waste code space and time, mainly on old CPUs).  The 0x10000 check
>> referred to in the above comment attempts to enforce this.  Casting
>> down in the API prevents the 32-bit loads.
>
> When specifying 16 bit input the compiler is perfectly allowed to leave
> arbitrary garbage in the upper 16 bits, e.g. do a 32bit load of the
> port. Lying about this fact leads to worse code, example:

It is only permitted to do that if it knows that the the upper 16 bits
are in an object, or if they are in a register.  While it could know that
bits in an object (say for port = sc->sc_port where sc has stuff above
sc_port so that the bits are in the object (*sc)), I've never seen gcc
do this optimization.

I didn't lie to the compiler, but told it to load all 32 bits of %edx.
Then I ignore the top 16 bits in the asm.

> static inline unsigned char inb(unsigned short port)
> /* ALT: static inline unsigned char inb(unsigned int port) */
> {
>  unsigned char data;
>  asm volatile("inb %1, %0" : "=a" (data) : "d" (port));
> /* ALT: asm volatile("inb %w1, %0" : "=a" (data) : "d" (port)); */
>  return data;
> }
>
> unsigned short in2(unsigned short port)
> {
>  unsigned short res;
>  res  = inb(port) << 8;
>  res |= inb(++port);
>  return res;
> }

Sure, this gives suboptimal code for ALT since `port' starts in a u_short.
Try starting with a matching ALT for in2: in2(unsigned port).  The types
must match at all levels including in2()'s caller to avoid promoting and
demoting back and forth.  Only some of the conversions can be free.

++port involves various conversions if port is not u_int throughout.
The compiler can usually optimize these (by keeping port in a register
and ignoring top bits).

> diff of the assember of the two versions:
> --- in_short.s	2009-04-10 07:54:10.000000000 +0200
> +++ in_int.s	2009-04-10 07:54:48.000000000 +0200
> @@ -4,17 +4,22 @@
> .globl in2
> 	.type	in2, @function
> in2:
> -	movzwl	4(%esp), %edx

With my ALT in2 version:

 	movl	4(%esp), %edx.

This is 3-4 cycles faster than on i386 and i486.  On modern CPUs, there
is probably no difference, but movzwl wastes a byte or two of instruction
space so it might cost in some cases.

> +	pushl	%ebx
> +	movzwl	8(%esp), %ebx
> +	movzwl	%bx, %eax
> +	movl	%eax, %edx

This is unnecessarily stupid code.  Why not identical code to the
u_int case?  Leaving out the inb(++port) gives identical code for the
first inb().

> #APP
> 	inb %dx, %al
> #NO_APP
> 	movl	%eax, %ecx
> -	addl	$1, %edx
> +	leal	1(%ebx), %edx

The compiler is being stupid above so that it can be stupid here too
(leal is slightly worse than addl).  It should just use %edx and not take
a trip through %ebx and %eax, and have to save %ebx.

> 	sall	$8, %ecx
> +	movzwl	%dx, %edx

This is needed for ++port since it has to demote the intermediate u_int
in ++port back to u_short.

> #APP
> 	inb %dx, %al
> #NO_APP
> 	movzbl	%al, %edx
> +	popl	%ebx
> 	orl	%edx, %ecx
> 	movzwl	%cx, %eax
> 	ret

With u_ints throughout (ALT for inb() and in2()), there are no conversions
and the generated code is optimal (movl instead of movzwl and no other
differences).  With the only conversion being a demotion in inb() (non-ALT
for inb() and ALT for in2()), the generated code is again optimal
(movl instead of movzwl).  gcc is optimizing away the conversion in this
case.  With port replaced by portarray[0] where portarray[1] exists,
gcc no longer does the optimization.

This is with gcc-4.2 with the default (usually wrong) arch.  gcc-3.3
and gcc-4.2 -march=i386 give the original pessimization of using movw
instead of movzwl to load a 16-bit object into a 16 or 32-bit register.
When only 16 bits are needed gcc-old just uses movw.  When a promotion
is needed in the ALT in2(), it uses movw to %bx where the above uses
movzwl to %ebx.  Then it must promote %bx and it uses movzwl to %eax
for this, then it moves to %edx.  This is sub-optimal but not as silly
as the above -- the above has already done the promotion into %ebx and
then it does it again into %eax.

> Compilers are way better today than back in the GCC 1.x times. Lying to
> them to trick them to show a certain behaviour usually is a bad idea and 
> tends to backfire.

Apparently gcc is still stupid when there are just a 4 not-quite null
promotions and demotions: for

 	u_short port;

 	use((u_int)port);			/* ALT inb(port); */
 	port = (u_short)((u_int)port + 1);	/* ++port; */
 	use((u_int)port);			/* ALT inb(port); */

>> %  {
>> % -    u_char    data;
>> % -
>> % -    __asm __volatile("inb %1,%0" : "=a" (data) : "id" 
>> ((u_short)(port)));
>> 
>> The port address was already cast down here in inbc().  This seems to
>> have been just because I didn't know gnu asm very well when I wrote
>> the above.  The cast has no effect in the usual case where the "i"
>> constraint is used, but when the "d" constraint is used (only when
>> compiling with -O0?) %1 would be %edx without the cast.  In inbv(), I
>> hard-code %dx, but that does't work here "i" case.  The correct method
>> seems to be to use %w1 in both places (keep u_int port everywhere and
>> use the same method for the "N" constraint).
>
> The correct way to handle this is discussed above.

No, this gives movzwl instead of movl in some cases and movw instead of
movl in some other cases.

>> % -    return (data);
>> % +    u_char data;
>> % +    __asm volatile("inb %1, %0" : "=a" (data) : "Nd" (port));
>> % +    return data;
>> %  }
>> 
>> Style breakages:
>> - lost blank line after declarations.  This file followed the rule about
>>   this except for previous breakage.  It even followed it in some cases
>>   where the declarations are null.
>> - lost parentheses around return value.  This file followed the rule about
>>   this in all cases.
>
> Anachronisms.

No, standard in FreeBSD.

>> Style not-quite-fix:
>> - lost the tab after the type in the declaration.  In KNF, such a tab is
>>   not used for auto declarations.  However, this file was fairly consistent
>>   about breaking the rule.
>> 
>> Why change __volatile to volatile?  This is similar to changing __inline
>> to inline, except plain volatile is more standard.  <sys/cdefs.h> has
>
> You answered this by yourself: volatile is a standard word - there is no
> reason to not use this.

It is for stylistic reasons.  Readers will wonder why most code uses
__volatile and yours doesn't.

>> ifdef tangles which are supposed to allow users or cdefs.h itself to
>> define away volatile so as to support old compilers, and we use __volatile
>> instead of volatile a lot to support this.  This is probably not needed
>> for kernel headers.  However, cpufunc.h is sometimes abused in userland.
>
> "Old compilers" cannot compile the FreeBSD source anyway, so what's your 
> point? This file can only be used with "new" (not older than a decade or so) 
> and very GCC-compatible compilers.

The sources should be as independent of compiler versions as possible.
cpufunc.h was independent (it is supposed contain only prototypes only
for compilers that can't handle its asms), so why break it?

>> % -static __inline u_char
>> % -inbv(u_int port)
>> % -{
>> % -    u_char    data;
>> % -    /*
>> % -     * We use %%dx and not %1 here because i/o is done at %dx and not at
>> % -     * %edx, while gcc generates inferior code (movw instead of movl)
>> % -     * if we tell it to load (u_short) port.
>> % -     */
>> 
>> Lost this documentation.  Now the magic in my version is in %w in my 
>> version.
>> The comment shouldn't blame gcc for generating movw -- that is the
>> programmer's fault.
>
> This is wrong, it's correct to specify a 16bit input. The compiler therefore 
> is perfectly allowed to leave arbitrary garbage in the upper 16 bits, e.g. do 
> a 32bit load of the port and other things, see above. Also GCC does not 
> generate movw here - for neither u_short nor u_int. GCC avoids partial 
> register updates.

No, this is correct.  Please allow me to find bugs in my own comment :-).
The comment would need to be too long, like this mail, to give all the
details and I apparently broke it trying to make it shorter.

It is correct but sub-optimal to specify a 16-bit input.

Whether gcc generates a movw is now very machine-dependent.  When the
code and comment was written, gcc always generated movw for memory to
register 16-bit loads.  Now it prefers to generate movzwl since partial
register updates are bad and movzwl is now fast, but it still generates
movw for old CPUs.  My optimization was not to avoid partial register
updates, but just to avoid operand size prefixes for loading u_shorts
and extra instructions for conversions in expressions like port+1 and
use(foo) (where use() takes a u_int)).  Your examples show that compilers
now avoid the prefixes or conversions in some but not all of the cases.

>> % Index: sys/i386/i386/machdep.c
>> % ===================================================================
>> % --- sys/i386/i386/machdep.c    (Revision 190841)
>> % +++ sys/i386/i386/machdep.c    (Arbeitskopie)
>> % @@ -3555,45 +3555,24 @@
>> %  #ifdef KDB
>> % %  /*
>> % - * Provide inb() and outb() as functions.  They are normally only
>> % - * available as macros calling inlined functions, thus cannot be
>> % - * called from the debugger.
>> % - *
>> % - * The actual code is stolen from <machine/cpufunc.h>, and de-inlined.
>> % + * Provide inb() and outb() as functions.  They are normally only 
>> available as
>> % + * inline functions, thus cannot be called from the debugger.
>> %   */
>> 
>> This should have been implemented by using cpufunc.h instead of repeating
>> it, and for everything in cpufunc.h not just inb and outb, e.g., as is
>> done for everything in atomic.h.  This gives a technical reason for
>> using __inline -- we want to #undef it and shouldn't #undef a standard
>> keyword.
>
> You can neither #undef __inline nor inline, they both are keywords, not
> macros. Actually it's the other way round: There is no way to get rid of 
> __inline, it is always a keyword to GCC. But you can tell GCC to not consider 
> inline as keyword (-std=c89).

Oops, I should have written "#define away" instead of #undef.  I think it
is legal in C to #define any identifer.  It just might things by giving
inconsistencies.  <sys/cdefs.h> already defines away __inline in some
cases (mostly for old compilers) and defines it to plain inline for the
C++ case even for new gcc.

The u_short optimizations are mostly moot for inb() etc.  The hardware
part of inb() now normally takes a few thousand times longer than the
whole 1 cycle potentially saved by these optimizations.  When inb() was
written, the hardware part took less than one hundred times longer.  I
defend the optimization mainly because losing it is a regression and
I still hate APIs that use chars and shorts (or [u]intN_t's) to give
pessimizations by maximizing conversions.

Bruce

[PATCH] Simplify in*() and out*() functions of AMD64 and i386

[PATCH] Simplify in() and out() functions of AMD64 and i386