[CFR] mge driver / elf reloc

Mon Jul 21 17:53:22 UTC 2014

On Mon, 21 Jul 2014, Ian Lepore wrote:

> On Mon, 2014-07-21 at 08:46 -0600, Warner Losh wrote:
>> On Jul 20, 2014, at 5:10 PM, John-Mark Gurney <jmg at funkthat.com> wrote:
>>
>>> Tim Kientzle wrote this message on Sun, Jul 20, 2014 at 15:25 -0700:
>>>>
>>>> On Jul 20, 2014, at 3:05 PM, John-Mark Gurney <jmg at funkthat.com> wrote:
>>>>
>>>>> Ian Lepore wrote this message on Sat, Jul 19, 2014 at 16:54 -0600:
>>>>>> Sorry to take so long to reply to this, I'm trying to get caught up.  I
>>>>>> see you've already committed the mge fixes.  I think the ELF alignment
>>>>>> fix looks good and should also be committed.
>>>>>
>>>>> So, re the elf alignment...
>>>>>
>>>>> I think we should get a set of macros that handle load/stores to/from
>>>>> unaligned addresses that are transparent to the caller....  I need
>>>>> these for some other code I'm writing...
>>>>>
>>>>> I thought Open/Net had these available, but I can't seem to find them
>>>>> right now...
>>>>
>>>> $ man 9 byteorder
>>>>
>>>> is most of what you want, lacking only some aliases to pick
>>>> the correct macro for native byte order.
>>>
>>> Um, those doesn't help if you want native endian order…
>>
>> Ummm, yes they do. enc converts from native order. dec decodes to native byte
>> order. They are more general cases than the ntoh* functions that are more traditional
>> since they also work on byte streams that may not be completely aligned when
>> sitting in memory. Which is what you are asking for.
>>
>>> Also, only the enc/dec functions are documented to work on non-aligned
>>> address, so that doesn't help in most cases…
>>
>> They work on all addresses. They are even documented to work on any address:
>>
>>      The be16enc(), be16dec(), be32enc(), be32dec(), be64enc(), be64dec(),
>>      le16enc(), le16dec(), le32enc(), le32dec(), le64enc(), and le64dec()
>>      functions encode and decode integers to/from byte strings on any align-
>>      ment in big/little endian format.
>>
>> So they are quite useful in general. Peeking under the covers at the implementation
>> also shows they will work for any alignment, so I’m having trouble understanding
>> where this objection is really coming from.
>
> The functionality requested was alignment-safe copy/assign without any
> endian change.  The code in question was conceptually something like
>
>   if (pointer & 0x03)
>      do-alignment-safe-thing
>   else
>      directly-deref-the-pointer

The enc/dec functions could be pessimized like that, but are actually
pessimized in other ways.

Pessimizations in the above include conditional branches for the check
and large code.  Everything has to be inlined else it is much slower
than direct dereference, but then it has scattered branches that may
exhaust the branch prediction cache (if any) and large code that may
exhaust other caches.

The enc/dec functions don't have any branches, but they have large code
like:

% static __inline void
% le32enc(void *pp, uint32_t u)
% {
% 	uint8_t *p = (uint8_t *)pp;
% 
% 	p[0] = u & 0xff;
% 	p[1] = (u >> 8) & 0xff;
% 	p[2] = (u >> 16) & 0xff;
% 	p[3] = (u >> 24) & 0xff;
% }

Fastest on x86 (unless alignment check exceptions are enabled in CR0)
is to just do the access and let the hardware combine the bytes.  The
accesses should probably be written someything like le32enc() above,
but more carefully.  Compilers should convert the above to a single
32-bit access on x86 (unless alignment check exceptions...).  The
pessimizations are that compilers aren't that smart, and/or the
enc/dec functions are written in such a way that compilers are
constrained from doing this.  IIRC, clang does this for one direction
only and gcc-4.2.1 is not smart enough to do this for either direction.
IIRC, the problematic direction is the above one, where the value is
returned indirectly.  Sprinkling __restrict might help.

When there is endianness conversion, there must be both an access
(hopefully 32 bits) and swapping operation on a register, unless all
the bytes are copied from memory to memory 1 or 2 at a time.  clang
is smart about converting the large expressions in __bswapNN_gen()
into single bswap instructions if the original rvalue is in a
register, but IIRC it is not so smart for the equivalent conversions
written with bytes and indirections.  (x86 and IIRC some other arches
have the __bswapNN_gen() macros.  These macros are just as MI as the
enc/dec inlines and more carefully written but they have not been
deduplicated due to namespace problems inhibiting a complete cleanup of
the MD endian.h files.)

Accesses are often pessimized using the __packed mistake.  This bug is
still uncommon in ipv4 headers, but is used with minimal damage in
struct ip.  struct ip is declared as __packed and __aligned(4).  Here
__aligned(4) tells that the struct aligned normally although it is
packed.  ipv6 headers ask for full pessimizations by declaring almost
everything as __packed without __aligned(N).  This tells the compiler
that the struct might only be 1-byte aligned, so all accesses to it
must be 1 byte at a time except on arches like x86 where the above
optimization applies (the optimization is much easier to do when the
accesses are not expressed bytewise in the code).  Bytewise accesses are
also less inherently atomic.  I think there used to be a problem with
__packed and __aligned() attributes not being inherited by function
pointers, but can't fdind any problem now.

ia64 code for loading an int from a __packed struct (p.x):

%         addl r16 = @ltoffx(p#), r1
%         ;;
%         ld8.mov r16 = [r16], p#
%         ;;
%         mov r14 = r16
%         ;;
%         ld1 r15 = [r14], 1
%         ;;
%         ld1 r14 = [r14]
%         ;;
%         shl r14 = r14, 8
%         ;;
%         or r14 = r15, r14
%         adds r15 = 2, r16
%         ;;
%         ld1 r15 = [r15]
%         ;;
%         shl r15 = r15, 16
%         ;;
%         or r15 = r14, r15
%         adds r16 = 3, r16
%         ;;
%         ld1 r8 = [r16]
%         ;;
%         shl r8 = r8, 24
%         ;;
%         or r8 = r15, r8

It seems to have 5 memory references.  The enc/dec32 functions produce
a similar mess on x86 when the compiler can't optimize them.

This is with gcc.  clang doesn't work on ia64 and/or pluto.

ia64 code for loading an int from a __packed __aligned(4) struct (p.x):

         addl r14 = @ltoffx(p#), r1
         ;;
         ld8.mov r14 = [r14], p#
         ;;
         ld4 r8 = [r14]

This behaviour can probable be exploited in the enc/dec functions.
When there is no endianness conversion, write 32-bit accesses as
p->x where p is a pointer to a __packed __aligned(1) struct containing
x.  The compiler will then produce the above mess on ia64 and a single
32-bit access if possible.  No ifdefs required.  When there is an
endianess conversion, do it on a register with value p->x.

Lightly tested implementation:

% struct _le32 {
% 	uint32_t _x;
% } __packed;
% 
% #define	le32enc(p, u)	(((struct _le32 *)(p))->_x = (u))

This is simpler than the inline version, and fixes some namespace errors.

If you want avoid the mess on ia64, this can be done at compile time if
the alignment is known to be more than 1 then.  I think if can often
be known, as for struct ip (don't generate misaligned struct ip, and
if you start with one then copy to an aligned one).  The enc/dec
inline functions could handle this by being converted to macros that
take the alignment parameter.  If the alignment is not known until
runtime...don't do that.

To add an alignment parameter to the above, use something like:

#define	_sle32(a)	struct _le32 { uint32_t _x; } __packed __aligned(a)
#define	le32enc(p, u, a)	(((_sle32(a) *)(p))->_x = (u))

Macroizing the struct declaration to add the alignment parameter to it
and avoiding backslashes made the code less readable but even shorter.

This was tested on i386 and ia64 with alignments 1 and 4 and and gave
the expected output in asm.

Bruce