x86 boot code build

Sat Oct 6 07:28:39 UTC 2012

On Fri, Oct 5, 2012 at 3:44 PM, Bruce Evans <brde at optusnet.com.au> wrote:
[..]
> Here are results of a current run of old test code: on core2
> (ref10-i386): results only for a data size of 4K (for much smaller
> sizes, simple methods are best, and for much larger sizes, all
> reasonable methods are limited by the speed of main memory and cache
> overheads, and all reasonable methods have the same speed, except ones
> using movnt* are faster since they bypass the caches):

A while ago I experimented with switching 32 bit binaries into 64 bit
mode while running under a 64 bit OS for things like data copies.  The
differences between 32 and 64 bit used to be substantial for the
dumber data copy methods.  And of course the overheads of getting into
and out of 64 bit mode at the time was prohibitive on an Intel
processor (compared to an AMD).

Short version to explain the concept:

bcopy64:
        pushl   %ebx
        pushl   %esi
        pushl   %edi
        call    base
base:
        popl    %esi
        movl    %esi,%edx
        addl    $to64-base,%edx
        pushl   $43     /* $GSEL(GUCODE_SEL, SEL_UPL) */
        pushl   %edx
        lretl
        .code64
to64:
        movq    %rsi,%r9
        addq    $to32-base,%r9
        movq    16(%rsp),%rsi   /* src */
        movq    24(%rsp),%rdi   /* dst */
        movq    32(%rsp),%rdx   /* len */
[... 64 bit bcopy trimmed...]

2:
        /* Jump back to 32 bit code segment */
        pushq   $27     /* GSEL(GUCODE32_UPL, SEL_UPL) */
        pushq   %r9
        lretq
        .code32
        .p2align 4
to32:
        popl    %edi
        popl    %esi
        popl    %ebx
        ret

Of course, this requires regular i386 code running on an amd64 kernel.
 At the time it was quite safe because signal delivery would reset %cs
to deliver signals in 32 bit mode and all 64 bits of all registers
were context switched, even for a 32 bit application.

This was part of a larger skunkworks project I did at work called
"EMM64". (A reference to the old dos EMM386).

It was a set of >4GB extensions and management code for a regular 32
bit app.  One of the things we used it for was to mmap a 16GB file
above the 4GB mark and use some hand-rolled hash search code.  This
allowed us to use some very large hashed key/value stores (before the
days of things like memcached).

Highlights:
mmap64() etc from a 32 bit process.. to put data above 4GB.
call64(): quick and easy trampoline to minimize assembler code.
dlopen64(), dlsym64(), dlcall64() etc: basically allowed you to
compile severely limited 64 bit .so file and (relatively) easily call
it from an otherwise unmodified 32 bit application.

The use case was to sneak some memory/performace critical patches into
certain 32 bit apps that couldn't/wouldn't be recompiled at work.

Allocating space above the 4GB mark was entirely different than simply
running a small chunk of 64 bit code in 4GB of address space. I had a
kernel module that "patched" a few things in the VM and elf32
wrappers.  If it crashed.. well.. elf32 core files didn't express 64
bit mappings too well.

Yeah, it was quite a hack.. by all definitions of the term.

In any case, I'd be curious to know if people could hand tune a hybrid
set of 32+64 data manipulation code to outperform pure i386.  It was
clear at the time that badly written hybrid code outperformed badly
written 32 bit code.

-- 
Peter Wemm - peter at wemm.org; peter at FreeBSD.org; peter at yahoo-inc.com; KI6FJV
"All of this is for nothing if we don't go to the stars" - JMS/B5
"If Java had true garbage collection, most programs would delete
themselves upon execution." -- Robert Sewell