Replace bcopy() to update ether_addr

Fri Aug 24 01:08:38 UTC 2012

luigi wrote:

> On Wed, Aug 22, 2012 at 03:21:06PM -0400, John Baldwin wrote:
> > On Wednesday, August 22, 2012 2:54:07 pm Adrian Chadd wrote:
> > > On 22 August 2012 05:02, John Baldwin <jhb at freebsd.org> wrote:
> > > > On Tuesday, August 21, 2012 12:34:42 pm Adrian Chadd wrote:
> > > >> Hi,
> > > >>
> > > >> What about just creating an ETHER_ADDR_COPY(dst, src) and putting that
> > > >> in a relevant include file, then hide the ugliness there?

Good for source code ugliness and bloat (not just in the macro).

> > > >> The same benefits will likely appear when copying wifi MAC addresses
> > > >> to/from headers.

Why stop there?  Change all assignments to use the ASSIGN() macro.  The
compiler doesn't understand assignment, but we do ;-).

> > > >> Thanks, I'm glad someone noticed this.
> > > >
> > > > I doubt we even _need_ the ugliness.  We should just use *dst = *src
> > > > unless there is a compelling reason not to.
> > > 
> > > Because it's not very clear? :-) I'd much prefer my array-of-things
> > > copies to be explicit.
> > 
> > Eh?  'struct foo *src, *dst; *dst = *src' is pretty bog-standard C.  That 
> > isn't really all that obtuse.

(Builtin) memcpy is better for this in general (no sarcasm now).  It
reduces to a struct copy if the object being copy is a struct, and is
not limited to structs.

However, ethernet address types are now fully pessimized, so compilers
are almost forced to assume that they are fully misaligned.  From
net/ethernet.h:

% /*
%  * Structure of a 10Mb/s Ethernet header.
%  */

Wow, 10Mb/s.

% struct ether_header {
% 	u_char	ether_dhost[ETHER_ADDR_LEN];
% 	u_char	ether_shost[ETHER_ADDR_LEN];
% 	u_short	ether_type;
% } __packed;

This used to be not quite guaranteed to be 16-bit aligned by the short
in it.  Perhaps it is still aligned in practice,  But in 2006, it was
fully pessimized by adding __packed without adding __aligned(2).  This
__packed prevents the struct being padded to a multiple of 32 bits in
size on arm.  It has the side effect of allowing the compiler to not
add padding for alignment before instances of this struct, and allowing
embedded short to be misaligned, so all accesses to this short might
require 2 1-byte load/store instructions and more instructions to
combine the bytes.  On x86, the accesses can be a single load/store
(or sometimes an arithmetic operation), with the hardware doing the
combination, possibly more slowly than for aligned accesses.  The
pessimization is better on other arches.

% /*
%  * Structure of a 48-bit Ethernet address.
%  */
% struct ether_addr {
% 	u_char octet[ETHER_ADDR_LEN];
% } __packed;

This was always just an array of bytes, so it was nevever required to
have more than byte alignment.  However, the compiler was permitted to
add padding before and after it to align it and the thing after it.
__packed on it may prevent this.  I hope that it actually takes a
__packed on the containing struct or on the struct member with this
type to prevent external padding, but __packed on everything may be
needed anyway for __packed on this to have much effect.

An ETHER_ADDR_COPY() macro can only do worse than the compiler here,
since the object is nothing more than an array of bytes after forgetting
the extra packing info that may be known to the compiler.  On x86, the
macro can use 2 possibly-misaligned load/stores, but so can the
compiler, or the macro can use a dynamic comparison of the address to
find the alignment point and then do 1, 2 or 3 aligned load/store
pairs, but so can the compiler, or the macro can use "rep movsb", but
so can the compiler...  The best choice depends on CFLAGS (a static
test that is easiest for the compiler) or on the CPU.  For a complete
implementation, the macro must of course patch the instructions used
to best suit the current CPU.  This would be a lot of work for an
unimportant (~0.1% max) optimization.  On non-x86 or any arch with
strict alignment requirements, the only obviously correct implementions
of the macro are 6 1-byte load/store pairs, or memcpy(), or bcopy().
The compiler can easily beat all of these.

> the thread has probably forked causing people to miss the explanation
> that Bruce gave: quite often the function is called by casting
> arbitrary pointers into 'struct foo *', so the compiler's expectations
> about alignment do not necessarily match the user's lies.

Except for struct ether_addr, the cast would actually reduce alignment
(unless the compiler is smart enough to not do it literally).

> Unfortunately we are building kernels with many compiler checks
> disabled, so there is a fair chance that the compiler will not
> detect such invalid casts.

Er, we check more in the kernel than almost everywhere else.

> Probably addresses are aligned to 2-byte boundaries, but certainly
> not on a 4-byte, which means that a safe copy might require 3
> instructions, even though a compiler could otherwise decide to align
> all non-packed 'struct foo' to a 4- or 8-byte boundary and possibly
> do the copy with 2 or even 1 instruction.

The code actually asks for 1-byte alignment.  From if_ethersubr.c:

% int
% ether_output(struct ifnet *ifp, struct mbuf *m,
% 	struct sockaddr *dst, struct route *ro)
% {
% 	short type;
% 	int error = 0, hdrcmplt = 0;
% 	u_char esrc[ETHER_ADDR_LEN], edst[ETHER_ADDR_LEN];

This asks for raw byte arrays.  gcc tends to align even arrays of char
more than asked for but the code asks for minimal alignment.

% 	struct llentry *lle = NULL;
% 	struct rtentry *rt0 = NULL;
% 	struct ether_header *eh;

This points into an array of data.  Such pointers may lose extra
alignment info that may be present in the original type.

% 	struct pf_mtag *t;
% 	int loop_copy = 1;
% 	int hlen;	/* link layer header length */

Another pessimization technique used in this code is to initialize
many variables in their declarations, despite unbroken versions of
style(9) forbidding this.  Out of order execution and x86 speed
hides the slowness from this about as well as it hides the slowness
from extra memcpy()'s.  (gcc likes to do all of these inititializations
"early" in program order, but with out of order execution there is
no strict order and you can't do much better by delaying or avoiding
the initializations.)

Bruce