[PATCH] Simplify in() and out() functions of AMD64 and i386

Fri Apr 10 02:52:01 PDT 2009

Bruce Evans schrieb:
> On Wed, 8 Apr 2009, Christoph Mallon wrote:
>> Also the {in,out}{w,l}() get treated with this constraint. They had no 
>> special case before, so now better code is generated for them.
> 
> Actually, worse code is generated for them on most cases since the
> optimization of loading 32 bits for the usual case of a non-constant
> operand is lost in your version -- see below.  The case of a constant

If you are referring to the u_uint to u_short change, then you are
wrong, see below.

> port address was rare except for 8-bit isa ports "on the motherboard"
> and has become rarer with FreeBSD's runtime configuration becoming
> even more excessive (so instead of "inb $N,%al", the code is normally
> about 10 instructions for bus_space_read_1(sc->bst, sc->bsh, N1)).
>
>> Further, the unnecessary "cld" is removed from {in,out}s{b,w,l}(), 
>> because it is guaranteed by the ABI that the direction flag is 
>> cleared. All in all the code for in/out gets a bit simpler.
> 
> kib@ already pointed out that the direction flag is not guaranteed to be
> cleared in the kernel.  For amd64, this may be a bug in the kernel, but
> for i386 it is a bug in gcc say that the ABI specifies that the direction
> flag is cleared, since there is no standard i386 ABI so gcc must not
> assume that the direction flag is clear, like it has done until recently.

There *is* an ABI spec for i386 and it mandates that DF is cleared on 
function entry and exit. Also other compilers (e.g. LLVM and
ICC) do not have this misbehaviour of GCC < 4.3 to emit cld before every
string operation at all. I would not trust GCC < 4.3 to do this
correctly everywhere either.

> There are lots more similar cld's in bus.h.
> 
> % Index: sys/i386/include/cpufunc.h
> % ===================================================================
> % --- sys/i386/include/cpufunc.h    (Revision 190841)
> % +++ sys/i386/include/cpufunc.h    (Arbeitskopie)
> % @@ -170,177 +170,97 @@
> %      __asm __volatile("hlt");
> %  }
> % % -#if !defined(__GNUCLIKE_BUILTIN_CONSTANT_P) || __GNUCLIKE_ASM < 3
> % -
> % -#define    inb(port)        inbv(port)
> % -#define    outb(port, data)    outbv(port, data)
> % -
> % -#else /* __GNUCLIKE_BUILTIN_CONSTANT_P && __GNUCLIKE_ASM >= 3 */
> 
> The ifdefs are ugly, and this one seems to have been slightly wrong, but
> they provide documentation aboout the features used:
> - __GNUCLIKE_BUILTIN_CONSTANT_P: no longer needed
> - __GNUCLIKE_ASM >= 3: to ensure that the "i" constraint is available.
>   This used to be __GNUC__ >= 2.  If that was correct then >= 3 is
>   stricter than necessary; otherwise >= 2 was not as strict as it
>   should have been, since some versions of gcc-2 certainly supported
>   the "i" constraint.
> 
> A precise translation of the above would be __GNU_PREREQ__(maj, min) ||
> defined(__INTEL_COMPILER), where (maj, min) is the version of gcc that
> implemented the "N" constraint and you should check that the Intel
> compiler (all version?) support the "N" constraint.

This is unnecessary, even GCC 2.95 (10 years old, the oldest release you 
can get from official sources) supports this constraint 
(gcc-2.95/gcc/config/i386/i386.h, line 932). This shows how old and 
stale this hack is.

> % - * The unnecessary test `(port) < 0x10000' is to generate a warning if
> % - * the `port' has type u_short or smaller.  Such types are pessimal.
> % - * This actually only works for signed types.  The range check is
> % - * careful to avoid generating warnings.
> 
> This won't work any more, if it ever did.  I think it only worked for
> the constant case, but for the constant case the width doesn't really
> matter.

The use of u_short for the port enables the compiler to generate a 
warning for too large port numbers.

> % -static __inline u_char
> % -inbc(u_int port)
> % +static inline u_char
> % +inb(u_short port)
> 
> __inline should be globabally changed to inline someday, not starting here.
> Plain inline has been Standard for 10 years now, but it still isn't
> standard -- we're still waiting for gcc to implement C99, and haven't
> switched to gcc-4.3(?) or later which will implement C99 extern inline.

This very file already uses "inline" in other places.

> Changing the type changes the API and breaks an optimization.  Callers

It does not change the API, it corrects it: Passing a port number > 
0xFFFF never worked and the API now reflects this. So if now accidently 
a too large constant port number is passed, the compiler can warn. It's 
a static inline function, so there is no problem with the ABI. About 
optimization see below.

> are supposed to provide an arg of width 32 so that it can be loaded
> as efficiently as possible (instruction prefixes and/or sign extension
> waste code space and time, mainly on old CPUs).  The 0x10000 check
> referred to in the above comment attempts to enforce this.  Casting
> down in the API prevents the 32-bit loads.

When specifying 16 bit input the compiler is perfectly allowed to leave
arbitrary garbage in the upper 16 bits, e.g. do a 32bit load of the
port. Lying about this fact leads to worse code, example:

static inline unsigned char inb(unsigned short port)
/* ALT: static inline unsigned char inb(unsigned int port) */
{
   unsigned char data;
   asm volatile("inb %1, %0" : "=a" (data) : "d" (port));
/* ALT: asm volatile("inb %w1, %0" : "=a" (data) : "d" (port)); */
   return data;
}

unsigned short in2(unsigned short port)
{
   unsigned short res;
   res  = inb(port) << 8;
   res |= inb(++port);
   return res;
}

diff of the assember of the two versions:

--- in_short.s	2009-04-10 07:54:10.000000000 +0200
+++ in_int.s	2009-04-10 07:54:48.000000000 +0200
@@ -4,17 +4,22 @@
  .globl in2
  	.type	in2, @function
  in2:
-	movzwl	4(%esp), %edx
+	pushl	%ebx
+	movzwl	8(%esp), %ebx
+	movzwl	%bx, %eax
+	movl	%eax, %edx
  #APP
  	inb %dx, %al
  #NO_APP
  	movl	%eax, %ecx
-	addl	$1, %edx
+	leal	1(%ebx), %edx
  	sall	$8, %ecx
+	movzwl	%dx, %edx
  #APP
  	inb %dx, %al
  #NO_APP
  	movzbl	%al, %edx
+	popl	%ebx
  	orl	%edx, %ecx
  	movzwl	%cx, %eax
  	ret

Compilers are way better today than back in the GCC 1.x times. Lying to
them to trick them to show a certain behaviour usually is a bad idea and 
tends to backfire.

> %  {
> % -    u_char    data;
> % -
> % -    __asm __volatile("inb %1,%0" : "=a" (data) : "id" 
> ((u_short)(port)));
> 
> The port address was already cast down here in inbc().  This seems to
> have been just because I didn't know gnu asm very well when I wrote
> the above.  The cast has no effect in the usual case where the "i"
> constraint is used, but when the "d" constraint is used (only when
> compiling with -O0?) %1 would be %edx without the cast.  In inbv(), I
> hard-code %dx, but that does't work here "i" case.  The correct method
> seems to be to use %w1 in both places (keep u_int port everywhere and
> use the same method for the "N" constraint).

The correct way to handle this is discussed above.

> % -    return (data);
> % +    u_char data;
> % +    __asm volatile("inb %1, %0" : "=a" (data) : "Nd" (port));
> % +    return data;
> %  }
> 
> Style breakages:
> - lost blank line after declarations.  This file followed the rule about
>   this except for previous breakage.  It even followed it in some cases
>   where the declarations are null.
> - lost parentheses around return value.  This file followed the rule about
>   this in all cases.

Anachronisms.

> Style not-quite-fix:
> - lost the tab after the type in the declaration.  In KNF, such a tab is
>   not used for auto declarations.  However, this file was fairly consistent
>   about breaking the rule.
> 
> Why change __volatile to volatile?  This is similar to changing __inline
> to inline, except plain volatile is more standard.  <sys/cdefs.h> has

You answered this by yourself: volatile is a standard word - there is no
reason to not use this.

> ifdef tangles which are supposed to allow users or cdefs.h itself to
> define away volatile so as to support old compilers, and we use __volatile
> instead of volatile a lot to support this.  This is probably not needed
> for kernel headers.  However, cpufunc.h is sometimes abused in userland.

"Old compilers" cannot compile the FreeBSD source anyway, so what's your 
point? This file can only be used with "new" (not older than a decade or 
so) and very GCC-compatible compilers.

> Fixed version (change nothing except the function name, %1, and the
> constraint):
> 
> %%%
> static __inline u_char
> inb(u_int port)
> {
>     u_char    data;
> 
>     __asm __volatile("inb %w1, %0" : "=a" (data) : "Nd" (port));
>     return (data);
> }
> %%%
> 
> Similarly for the other functions.
> 
> BTW, I sometimes missing having the inverse of %[bwl...] -- a way to
> generate an operand size letter from the type.  gcc-3 and gcc-4 have
> some horribly incomplete support for this.
> 
> % -#endif /* __GNUCLIKE_BUILTIN_CONSTANT_P  && __GNUCLIKE_ASM >= 3*/
> 
> Removing this accidentally fixes >= 2 style bugs in it.
> 
> % -static __inline u_char
> % -inbv(u_int port)
> % -{
> % -    u_char    data;
> % -    /*
> % -     * We use %%dx and not %1 here because i/o is done at %dx and not at
> % -     * %edx, while gcc generates inferior code (movw instead of movl)
> % -     * if we tell it to load (u_short) port.
> % -     */
> 
> Lost this documentation.  Now the magic in my version is in %w in my 
> version.
> The comment shouldn't blame gcc for generating movw -- that is the
> programmer's fault.

This is wrong, it's correct to specify a 16bit input. The compiler 
therefore is perfectly allowed to leave arbitrary garbage in the upper 
16 bits, e.g. do a 32bit load of the port and other things, see above. 
Also GCC does not generate movw here - for neither u_short nor u_int. 
GCC avoids partial register updates.

> gcc on modern machines now generates movswl for loading a u_short port
> to the u_int port specified by the old API, and movswl is fast on modern
> machines, so there is no penalty in using the old API/implementation
> on modern machines even if the caller neglected to start with a u_int
> port.

There is a penalty, see example above.

> % ...
> % -static __inline void
> % -insb(u_int port, void *addr, size_t cnt)
> % +static inline void
> % +insw(u_short port, void *addr, size_t cnt)
> %  {
> % -    __asm __volatile("cld; rep; insb"
> % -             : "+D" (addr), "+c" (cnt)
> % -             : "d" (port)
> % -             : "memory");
> % +    __asm volatile("rep; insw"
> % +        : "+D" (addr), "+c" (cnt)
> % +        : "d" (port)
> % +        : "memory");
> %  }
> 
> The diffs are hard to read, apparently due to diff getting confused
> by your API and style changes and comparing unrelated functions.
> Without the API changes, I think it would have matched insb with insb,
> etc.
> 
> % -static __inline void
> % -outbv(u_int port, u_char data)
> % +static inline void
> % +outb(u_short port, u_char data)
> %  {
> % -    u_char    al;
> % -    /*
> % -     * Use an unnecessary assignment to help gcc's register allocator.
> % -     * This make a large difference for gcc-1.40 and a tiny difference
> % -     * for gcc-2.6.0.  For gcc-1.40, al had to be ``asm("ax")'' for
> % -     * best results.  gcc-2.6.0 can't handle this.
> % -     */
> % -    al = data;
> 
> This can go of course.  I saw dummy assignments bogusly helping the
> allocator in more complicated asms for a version of atomic_cmpset()
> as late as gcc-3, but that seemed to be fixed by gcc-3.3.
> 
> % Index: sys/i386/i386/machdep.c
> % ===================================================================
> % --- sys/i386/i386/machdep.c    (Revision 190841)
> % +++ sys/i386/i386/machdep.c    (Arbeitskopie)
> % @@ -3555,45 +3555,24 @@
> %  #ifdef KDB
> % %  /*
> % - * Provide inb() and outb() as functions.  They are normally only
> % - * available as macros calling inlined functions, thus cannot be
> % - * called from the debugger.
> % - *
> % - * The actual code is stolen from <machine/cpufunc.h>, and de-inlined.
> % + * Provide inb() and outb() as functions.  They are normally only 
> available as
> % + * inline functions, thus cannot be called from the debugger.
> %   */
> 
> This should have been implemented by using cpufunc.h instead of repeating
> it, and for everything in cpufunc.h not just inb and outb, e.g., as is
> done for everything in atomic.h.  This gives a technical reason for
> using __inline -- we want to #undef it and shouldn't #undef a standard
> keyword.

You can neither #undef __inline nor inline, they both are keywords, not
macros. Actually it's the other way round: There is no way to get rid of 
__inline, it is always a keyword to GCC. But you can tell GCC to not 
consider inline as keyword (-std=c89).

> % % -#undef inb
> % -#undef outb
> % -
> %  /* silence compiler warnings */
> % -u_char inb(u_int);
> % -void outb(u_int, u_char);
> % +u_char inb_(u_short);
> % +void outb_(u_short, u_char);
> % %  u_char
> % -inb(u_int port)
> % +inb_(u_short port)
> %  {
> % -    u_char    data;
> % -    /*
> % -     * We use %%dx and not %1 here because i/o is done at %dx and not at
> % -     * %edx, while gcc generates inferior code (movw instead of movl)
> % -     * if we tell it to load (u_short) port.
> % -     */
> % -    __asm __volatile("inb %%dx,%0" : "=a" (data) : "d" (port));
> % -    return (data);
> % +    return inb(port);
> %  }
> % %  void
> % -outb(u_int port, u_char data)
> % +outb_(u_short port, u_char data)
> %  {
> % -    u_char    al;
> % -    /*
> % -     * Use an unnecessary assignment to help gcc's register allocator.
> % -     * This make a large difference for gcc-1.40 and a tiny difference
> % -     * for gcc-2.6.0.  For gcc-1.40, al had to be ``asm("ax")'' for
> % -     * best results.  gcc-2.6.0 can't handle this.
> % -     */
> % -    al = data;
> % -    __asm __volatile("outb %0,%%dx" : : "a" (al), "d" (port));
> % +    outb(port, data);
> %  }
> % %  #endif /* KDB */
> 
> This changes the (ddb undocumented) API and ABI to worse ones.  I
> already don't line having to type "call inb(foo)" (in my x86 debugger,
> inb is a command named "i", with certain space insensitivity that
> allows typing "ifoo").

The "worse" aspect is discussed above. I also changed the non-existent 
documentation of these functions, so the name change should be fine. (:

	Christoph


[PATCH] Simplify in*() and out*() functions of AMD64 and i386

[PATCH] Simplify in() and out() functions of AMD64 and i386