svn commit: r315522 - in head: contrib/binutils/ld/emulparams sys/conf

Sun Mar 19 15:46:35 UTC 2017

> On Sun, 19 Mar 2017, Ed Maste wrote:
> 
> > Log:
> >  use INT3 instead of NOP for x86 binary padding
> >
> >  We should never end up executing the inter-function padding, so we
> >  are better off faulting than silently carrying on to whatever function
> >  happens to be next.
> >
> >  Note that LLD will soon do this by default (although it currently pads
> >  with zeros).
> >
> >  Reviewed by:	dim, kib
> >  MFC after:	1 month
> >  Sponsored by:	The FreeBSD Foundation
> >  Differential Revision:	https://reviews.freebsd.org/D10047
> 
> Is this a pessimization?  Instruction prefetch near the end of almost
> every function now fetches INT3 instead of NOP.  Both have to be
> decoded to decoded whether to speculatively execute them.  INT3 is
> unlikely to be speculatively executed, but it takes extra work to
> decide not to do so.
> 
> Functions normally end with a RET or unconditional JMP, and then branch
> prediction usually prevents speculative execution beyond the end, so the
> pessimization must be small.
> 
> Intra-function padding that is executed now uses "fat NOP" instructions
> like null LEA's since this is faster to execute than a long string of
> NOPs.  This is less readable than NOPs or even INT3's.  Of course, INT3
> can't be used for executed padding.  I think it is also used for intra-
> function padding that is not executed.  This is just harder to read
> unless it is needed to avoid the possible pessimization in this commit.
> The intra-function code with nops might look like:
> 
>  		jmp	over
>  		nop
>  		# 7 nops altogether
>  		nop
>  	over:
> 
> or
> 
>  		jmp	over
>  		nullpad7	# single 7 byte null padding instruction
>  	over:
> 
> and it is likely to be CPU-dependent whether 7 possibly-speculatively
> executed nops take more or less resources than 1 possibly-speculatively
> executed fancy instruction.   I would expect the fancy instructions to
> take more resources each.
> 
> Fancy LEAs don't seem such a good choice for executed padding either.
> amd64 uses lots of REX prefixes instead of fancy instructions, since
> these are designed to have low overheads.  They certainly aren't
> executed separately.  On i386, the same technique with lots of older
> prefixes is not used much, probably because all prefixes have high
> overheads on old i386's.  They can be as slow as NOPs although they
> aren't executed separately.

As an intermediate ground what about using N of something really 
easy for the decoder/branch predictor to grovel over, then a single
int3 at the end of the block so if we do fall into this we end
up getting the desired effect?
nop's followed by an

> Bruce
-- 
Rod Grimes                                                 rgrimes at freebsd.org