svn commit: r218745 - head/sys/boot/i386/boot2

Thu Feb 17 10:03:49 UTC 2011

On Thu, 17 Feb 2011, Alexander Best wrote:

> On Wed Feb 16 11, Alexander Best wrote:
>> On Wed Feb 16 11, Alexander Best wrote:
>>> On Wed Feb 16 11, Alexander Best wrote:
>>>> On Wed Feb 16 11, Warner Losh wrote:
>>>> i think without this code uint32_t x serves no purpose any longer:
>>>>
>>>> /usr/git-freebsd-head/sys/boot/i386/boot2/boot2.c:322:20: warning: unused variable 'x' [-Wunused-variable]
>>>>     uint32_t addr, x;
>>>>                    ^

Removing it won't save any space, however.

>>> also due to
>>>
>>> /usr/git-freebsd-head/sys/boot/i386/boot2/boot2.c:631:8: warning: cast from 'caddr_t' (aka 'char *') to 'uint32_t *' (aka 'unsigned int *') increases required alignment from 1 to 4 [-Wcast-align]
>>>         t1 = *(uint32_t *)PTOV(0x46c);
>>>               ^~~~~~~~~~~~~~~~~~~~~~~
>>>
>>> i think t0 and t1 can be turned into uint8_t's and PTOV(0x46c); can be casted
>>> to (uint8_t *), instead of (uint32_t *).

Maybe, but further changes would be required.  The variable is a 32-bit
tick counter.  I think the tick frequency is 18.2 Hz.  This will wrap
after some time, and the code is already broken when it wraps.  This
brokenness is not too important since this code is normally used soon
after booting and the wrap occurs after 7.25 years, but if you reduce
the counts to uint8_t's then the wrap occurs after only 14 seconds,
so you should be more careful.  From an old version of boot2.c:

% static int
% keyhit(unsigned ticks)

Ticks might also be best as uint8_t, especially if t0 and t1 are changed
to uint8_t.  It always has the value 3*SECOND or 5*SECOND, where SECOND
is the tick frequency 18.2 rounded to 18 (rounding gives an error of 1
tick after 5 seconds; this can be fixed using floating point evaluated
at compile time).  I think the compiler is prevented from reducing it
to its constant value by lack of (implicit) inlining.

% {
%     uint32_t t0, t1;
% 
%     if (opts & 1 << RBX_NOINTR)
% 	return 0;

The only change in the current verion is to obfuscate the bit test using
a macro.  Such optimizations make it hard to see where the bloat is.

%     t0 = 0;
%     for (;;) {
% 	if (xgetc(1))
% 	    return 1;
% 	t1 = *(uint32_t *)PTOV(0x46c);
% 	if (!t0)
% 	    t0 = t1;

Might be able to rewrite this so it is auto-initializing without a special
case.  Try putting 't0 = *(uint32_t *)PTOV(0x46c);' outside of the loop.
I think it is intentionally not done this way since PTOV(0x46c) involves
the large address constant 0x46c (which tends to generate large code
(see below)) and the PTOV() macro makes this worse due to the bad
implementation of this macro.  (PTOV(pa) is ((caddr_t)(pa) - __base).
Subtracting the base makes no sense, and defeats the normal address
mode base+offset.  However, here the address of PTOV(0x46C) is best
calculated (with the bad PTOV()) by loading 0x46C into a register and
subtracting __base.  Then we have a pointer to it and can load t0
initially by just dereferencing this pointer.  With the better PTOV(pa)
of (__actual_base_with_the_correct_type + (pa)) the best calculation
of the address is to load the actual base into a register and not add
0x46C to that register if we are only using PTOV(0x46c) once, but when
PTOV(046c) is used twice the best code is less clear.)

% 	if (t1 < t0 || t1 >= t0 + ticks)
% 	    return 0;

Oops, this does handle wrap, but in a slow way and sloppy way.  t1 <
t0 means that the counter has wrapped.  We then return immediately.
Not too bad if there is 1 error 7.25 years after booting.  With the
counter reduced to 8 bits, this error after 14 seconds would be
annoying.  But the usual method of handling cyclic counters takes
less code and is less sloppy:

 	if ((counter_size_t)(t1 - t0) >= ticks)
 	    return 0;

We must be careful with the cast.  Then this works with a counter_size_t
of uint8_t, since the maximum value for `ticks' is 70 which fits in 8
bits.  Also, we can almost guarantee that this code isn't interrupted,
so we won't miss a single change of the counter.  With interruption,
we would have to be very unlucky to miss a counter rollover after 7.25
7.25 years or even after 14 seconds, and even if we miss we would
normally only wait at most `ticks' ticks extra; then we do a less-wrong
thing than returning early.

%     }
% }

> this will actually increase the size.

Maybe.  Smaller types often give larger object code due to extra prefix
bytes (in i386 code) and extra instructions to promote the types before
using them.  16-bit types in 32-bit i386 code are especially bad, since
they require prefixes.  In the above, t1 < t0 gives the same size object
code with either uint8_t or unisigned.  t1 >= t0 + ticks gives larger
object code with uint8_t.  At least 1 movl will have to become movzbl
to promote a uint8_t.  But with everything uint8_t, only 8-bit
instructions should be needed.

>> with this additional change the code fits when compiled with clang:
>>
>> diff --git a/sys/boot/i386/boot2/sio.S b/sys/boot/i386/boot2/sio.S
>> index 7b8e9eb..d745129 100644
>> --- a/sys/boot/i386/boot2/sio.S
>> +++ b/sys/boot/i386/boot2/sio.S
>> @@ -29,11 +29,11 @@
>>  sio_init:      movw $SIO_PRT+0x3,%dx           # Data format reg
>>                 movb $SIO_FMT|0x80,%al          # Set format
>>                 outb %al,(%dx)                  #  and DLAB
>> -               pushl %edx                      # Save
>> +               pushb %dl                       # Save
>>                 subb $0x3,%dl                   # Divisor latch reg
>>                 movl 0x8(%esp),%eax             # Set
>>                 outw %ax,(%dx)                  #  BPS
>> -               popl %edx                       # Restore
>> +               popb %dl                        # Restore
>>                 movb $SIO_FMT,%al               # Clear
>>                 outb %al,(%dx)                  #  DLAB
>>                 incl %edx                       # Modem control reg
>
> cannot push a single byte onto the stack on i386 32 bit.

Indeed.  It could only save a space by not generating any code for pushb.
Compiling with -Wa,-al shows the null code, but also shows errors.

push/pop of a single byte only works on i386 for pushing a single
_immediate_ byte.  The result is still to push a 16- ot 32-bit word,
with the immediate byte sign extended, but when you have a constant
value that is between -128 and 127, "push[lw] $value" is by far the
shortest way to put this value in memory (2 bytes, 1 for the opcode
and 1 for the value).  In contrast, storing a general constant value
to a general address often takes 10 bytes (2 bytes for the opcode,
4 for the memory address, and 4 for the value; with a reduction of
only the last from 4 to 1 for a small immediate value).

>> ...since we're only modifying %dl in subb $0x3,%dl, we don't need to push/pop
>> a 32 bit value, but only 8 bits.

gcc also has little idea of how to generate small code for this.  Use
of any byte register except %al should be avoided, since for example
subb $3,%dl takes 3 bytes while subb $3,%al takes 2 bytes.  However
in sio.S, the result is needed in %dx.  Operating on %dl is already a
trick to save space, but is is far from best: unlike operations, movb
to %al is no more efficient than movb to any byte register.  subb
$3,%dl should therefore be written as movb $mumble,%dl to save 1 byte.
$mumble is the device's register set address, plus the device register
offset, all mod 256, and the everything must be known at compile time
not cross a 256-byte boundary for this to work.  The source code is
then more readable (no magic $3, which is the difference between the
offset of the device register previously used (very context dependent)
and the device register to be used next; instead, just the offset of
the device reguster to be used next combined with the base).  ISTR
seeing gcc doing this optimization starting from C code, but it rarely
gets a chance since the base address is rarely known at compile time.
Here we direct the code generated precisely and don't do it as well
as a compiler could starting from C code -- sio.S never needed to be
in asm.  In sio.S, the base is known at compile time (as SIOPRT), and
xx50 ports are always aligned at 8-byte boundaries and have 8 internal
registers so there is no problem with 256-byte boundaries.

Here is serial.S from the old and better boot loader i386/boot/biosboot
which I use, translated into C according to the above ideas.  When
compiled with -Os -fomit-frame-pointer, it has size 112 bytes, but
serial.S gives size 105 bytes so this doesn't quite prove my case that
C is better even for things than are carefully optimized in asm.
gcc-3.3.3 gives size 128 bytes, so gcc is actually getting closer to
being competitive with asm programmers.

serial.c:

% /*
%  * The serial port interface routines implement a simple polled i/o
%  * interface to a standard serial port.  Due to the space restrictions
%  * for the boot blocks, no BIOS support is used (since BIOS requires
%  * expensive real/protected mode switches), instead the rudimentary
%  * BIOS support is duplicated here.
%  *
%  * XXX modes switches shouldn't be expensive.  An int86() function
%  * could be used, so that new BIOS calls could be written in C.  This
%  * would probably take more space, however.  The switch could be done
%  * automagically by trapping interrupts.  This would allow space-efficient
%  * calls of the form asm("int nn" : "outputs" : "inputs").
%  *
%  * The base address and speed for the i/o port are passed from the
%  * Makefile in the COMCONSOLE and CONSPEED preprocessor macros.  The
%  * line control parameters are currently hard-coded to 8 bits, no
%  * parity, 1 stop bit (8N1).  This can be changed in init_serial().
%  *
%  * XXX FIXME we now use the kernel's default 8N1.
%  */
% 
% #ifdef broken
% #include "boot.h"
% #endif
% 
% #include <sys/types.h>
% #include <sys/tty.h>
% 
% #include <machine/cpufunc.h>
% 
% #include <dev/ic/ns16550.h>
% #include </sys/dev/sio/sioreg.h>	/* XXX incompletely globalized defs */
% #define	COMCONSOLE	0x3f8		/* XXX another; normally in makefile */
% 
% #define	COMBRD(x)	(1843200 / (16 * (x)))
% #define	CONADDR		COMCONSOLE
% 
% /*
%  * Write a character to the serial port.
%  */
% /*
%  * XXX should be named serial_putchar().
%  */
% void
% serial_putc(ch)
% 	int ch;
% {
% 	int timeout;
% 
% 	for (timeout = 10000; --timeout != 0;)
% 		if ((inb(CONADDR + com_lsr) & LSR_TXRDY))
% 			break;
% 	outb(CONADDR + com_data, ch);
% }
% 
% /*
%  * Read a character from the serial port.
%  */
% /*
%  * XXX should be named serial_getchar().
%  */
% int
% serial_getc()
% {
% 	unsigned char ch;
% 
% 	while (!(inb(CONADDR + com_lsr) & LSR_RXRDY))
% 		;
% 	ch = inb(CONADDR + com_data);
% 
% 	/* Remove any parity bit. */
% 	ch &= 0x7f;
% 
% 	/* Convert DEL to BACKSPACE. */
% 	if (ch == 0x7f)
% 		ch = 8;
% 
% 	return (ch);
% }
% 
% /*
%  * Return nonzero if getc() wouldn't block.
%  */
% /*
%  * XXX should be named serial_ischar().
%  * XXX FIXME this should be static inline to save space.
%  */
% int
% serial_ischar()
% {
% 	return (inb(CONADDR + com_lsr) & LSR_RXRDY);
% }
% 
% /*
%  * Initialize port CONADDR to speed CONSPEED, line settings 8N1.
%  */
% /*
%  * XXX should be named serial_init().
%  * XXX FIXME his should be static inline to save space.
%  */
% void
% init_serial()
% {
% 	outb(CONADDR + com_cfcr, CFCR_DLAB);
% 	outb(CONADDR + com_dlbl, COMBRD(CONSPEED));
% 	outb(CONADDR + com_dlbh, COMBRD(CONSPEED) >> 8);
% 
% 	/* Disable fifo to reduce worst-case busy-wait. */
% 	outb(CONADDR + com_fcr, 0);
% 
% 	outb(CONADDR + com_cfcr, CFCR_8BITS);	/* 8N1 */
% 	outb(CONADDR + com_mcr, MCR_DTR | MCR_RTS);
% 
% 	/* Flush (unconditionally the first time) all input. */
% 	do
% 		inb(CONADDR + com_data);
% 	while (inb(CONADDR + com_lsr) & LSR_RXRDY);
% }

Code generated from the above by gcc-4.2.1 -Os -fomit-frame-pointer -Wa,-al:

% GAS LISTING /var/tmp//cc2FFF3R.s 			page 1
% 
% 
%    1              		.file	"w.c"
%    2              		.text
%    3              	.globl serial_getc
%    4              		.type	serial_getc, @function
%    5              	serial_getc:
%    6              	.L3:
%    7 0000 BAFD0300 		movl	$1021, %edx
%    7      00

This takes 5 bytes (1 byte opcode and 4-byte immediate value), but
should take 4 bytes (1 byte operand size prefix, 1 byte opcode and
2-byte immediate value.  My old asms in cpufunc.h are probably at fault
here.  Those intentionally use "u_int port" to avoid the prefix.  This
optimizes for time instead of space (the prefix costs a cycle on old
x86's).

The value is 0x3f8+3, obfuscated in the usual way by gcc (gcc generates
decimal constants even when the source code uses hex constants :-().

%    8              	#APP
%    9 0005 EC       		inb %dx,%al
%   10              	#NO_APP

Only the low 16 bits of %edx are used.  This allows a 16-bit load to work.
But the above did a 32-bit load to optimize for time.

%   11 0006 A801     		testb	$1, %al
%   12 0008 74F6     		je	.L3
%   13 000a B2F8     		movb	$-8, %dl

Here we see gcc perfectly optimizing the adjustment from 0x3f8+3 to 0x3f8
-- it doesn't subtract 3, but loads 0xf8.  The obfuscation of the value
is even better here -- 0xf8 technically can't be loaded since the immediate
byte is signed, but that is how the value should be written.

%   14              	#APP
%   15 000c EC       		inb %dx,%al
%   16              	#NO_APP
%   17 000d 83E07F   		andl	$127, %eax

It could use andb to save a byte, but this prepares a good return value at
a cost of only 1 byte.

%   18 0010 3C7F     		cmpb	$127, %al
%   19 0012 7502     		jne	.L5
%   20 0014 B008     		movb	$8, %al
%   21              	.L5:
%   22 0016 0FB6C0   		movzbl	%al, %eax

This wastes 3 bytes.  We already zero-extended using the andl.  Perhaps
the source code can be tweaked to make this clear to the compiler.

%   23 0019 C3       		ret

I think the interface is intentionally pessimized to return a 32 bit value
and thus require wasting 1 byte for the above andl, since if we don't
extend once here then the callers will tend to extend more than once later.

%   24              		.size	serial_getc, .-serial_getc
%   25              	.globl serial_putc
%   26              		.type	serial_putc, @function
%   27              	serial_putc:
%   28 001a B9102700 		movl	$10000, %ecx
%   28      00
%   29 001f EB0A     		jmp	.L11
%   30              	.L12:
%   31 0021 BAFD0300 		movl	$1021, %edx
%   31      00
%   32              	#APP
%   33 0026 EC       		inb %dx,%al
%   34              	#NO_APP
%   35 0027 A820     		testb	$32, %al
%   36 0029 7503     		jne	.L13
%   37              	.L11:
%   38 002b 49       		decl	%ecx
%   39 002c 75F3     		jne	.L12

gcc apparently doesn't know about the `loop' instruction which does this
in 2 bytes.  The `loop*' instructions are useless and normally slower
except:
- they save space
- on 8088's, the instruction fetcher was so weak that saving space in
   instruction space also saved time in just about all cases.

%   40              	.L13:
%   41 002e BAF80300 		movl	$1016, %edx
%   41      00

This loads the full register (32 bits, though 16 would be enough) since
it is jumped to.  Better generated code would load CONADDR into %[e]dx
initially so that it is available for low-byte-only modifications
throughout.

%   42 0033 8A442404 		movb	4(%esp), %al

This function returns void so it automatically avoids extra code to
extend the return value, but its parameter may have been excessively
extended when passed here.

%   43              	#APP
%   44 0037 EE       		outb %al,%dx
%   45              	#NO_APP
%   46 0038 C3       		ret
%   47              		.size	serial_putc, .-serial_putc
%   48              	.globl init_serial
%   49              		.type	init_serial, @function
%   50              	init_serial:
%   51 0039 B9FB0300 		movl	$1019, %ecx
%   51      00
%   52 003e B080     		movb	$-128, %al
% GAS LISTING /var/tmp//cc2FFF3R.s 			page 2
% 
% 
%   53 0040 89CA     		movl	%ecx, %edx
%   54              	#APP
%   55 0042 EE       		outb %al,%dx
%   56              	#NO_APP
%   57 0043 B00C     		movb	$12, %al
%   58 0045 B2F8     		movb	$-8, %dl
%   59              	#APP
%   60 0047 EE       		outb %al,%dx
%   61              	#NO_APP
%   62 0048 31C0     		xorl	%eax, %eax
%   63 004a B2F9     		movb	$-7, %dl
%   64              	#APP
%   65 004c EE       		outb %al,%dx
%   66              	#NO_APP
%   67 004d B2FA     		movb	$-6, %dl
%   68              	#APP
%   69 004f EE       		outb %al,%dx
%   70              	#NO_APP
%   71 0050 B003     		movb	$3, %al
%   72 0052 89CA     		movl	%ecx, %edx
%   73              	#APP
%   74 0054 EE       		outb %al,%dx
%   75              	#NO_APP
%   76 0055 B2FC     		movb	$-4, %dl
%   77              	#APP
%   78 0057 EE       		outb %al,%dx
%   79              	#NO_APP
%   80              	.L16:
%   81 0058 BAF80300 		movl	$1016, %edx
%   81      00
%   82              	#APP
%   83 005d EC       		inb %dx,%al
%   84              	#NO_APP
%   85 005e B2FD     		movb	$-3, %dl
%   86              	#APP
%   87 0060 EC       		inb %dx,%al
%   88              	#NO_APP
%   89 0061 A801     		testb	$1, %al
%   90 0063 75F3     		jne	.L16
%   91 0065 C3       		ret
%   92              		.size	init_serial, .-init_serial

I didn't try hard to optimize the initialization function for use here,
but may have written parts of the asm that it is based on and wrote
them for small size, and tried hard to make the cpufuncs not turn
small code into large code.  The results are not quite perfect -- the
initial load of is into %ecx when it should be direct into %dx, and
there is another load into %edx.

%   93              	.globl serial_ischar
%   94              		.type	serial_ischar, @function
%   95              	serial_ischar:
%   96 0066 BAFD0300 		movl	$1021, %edx
%   96      00
%   97              	#APP
%   98 006b EC       		inb %dx,%al
%   99              	#NO_APP
%  100 006c 83E001   		andl	$1, %eax

andb and a byte return should be enough.

%  101 006f C3       		ret
%  102              		.size	serial_ischar, .-serial_ischar

Inlining might reduce the loading of %dx even more.  This function is
called twice, so it is not quite small enough to always inline.

%  103              		.ident	"GCC: (GNU) 4.2.1 20070719  [FreeBSD]"

Bruce