svn commit: r333461 - head/sys/amd64/amd64

Thu May 10 19:38:29 UTC 2018

On Fri, May 11, 2018 at 03:31:46AM +1000, Bruce Evans wrote:
> On Thu, 10 May 2018, Konstantin Belousov wrote:
> 
> > Log:
> >  Make fpusave() and fpurestore() on amd64 ifuncs.
> >
> >  From now on, linking amd64 kernel requires either lld or newer ld.bfd.
> 
> This breaks building with gcc:
> 
> XX cc1: warnings being treated as errors
> XX ../../../amd64/amd64/fpu.c:195: warning: 'ifunc' attribute directive ignored [-Wattributes]
> XX ../../../amd64/amd64/fpu.c:195: warning: redundant redeclaration of 'fpusave' [-Wredundant-decls]
> XX ./machine/fpu.h:64: warning: previous declaration of 'fpusave' was here
> XX ../../../amd64/amd64/fpu.c:202: warning: 'ifunc' attribute directive ignored [-Wattributes]
> XX ../../../amd64/amd64/fpu.c:202: warning: redundant redeclaration of 'fpurestore' [-Wredundant-decls]
> XX ./machine/fpu.h:62: warning: previous declaration of 'fpurestore' was here
Yes.  Nothing unexpected.

> 
> After building fpu.o with clang, linking doesn't reqire a newer ld, but
> booting does -- after linking with an older ld, the boot panics with a
> page fault near fork().
Yes.  I noted this in the commit message.

emaste will commit the check shortly which would prevent using inadequate
linker for building kernel.

> 
> I used the current as and ld with old gcc until early this year.  This is
> a bit fragile and regressed with lld.  The regression was minor -- lld
> expanded the bss in the data section, giving a few MB of bloat.  Now the
> bloat relative to old ld is only 29K in the text section.
> 
> Debugging and presumably profiling is also broken:
> 
> XX db> x/ia fpusususpend,20
> XX fpususpend:     pushq   %rbp
> XX fpususpend+0x1: movq    %rsp,%rbp
> XX fpususpend+0x4: pushq   %rbx
> XX fpususpend+0x5: pushq   %rax
> XX fpususpend+0x6: movl    %cr0,%ebx
> XX fpususpend+0x9: clts
> XX fpususpend+0xb: call    0xffffffff80299960
> XX fpususpend+0x10:        movl    %ebx,%cr0
> XX fpususpend+0x13:        addq    $0x8,%rsp
> XX fpususpend+0x17:        popq    %rbx
> XX fpususpend+0x18:        popq    %rbp
> XX fpususpend+0x19:        ret
> XX db> x/ia 0xffffffff80299960
> XX 0xffffffff80299960:     jmpl    *0x56769a(%rip)
> 
> ddb still doesn't understand pc-relative offsets.  Decoding this with
> "p/x 0xffffffff80299960+6" gives brwsection+0x1000.  At this address
> is a pointer to fpusave_xsave since the hardware supports xsave.
> 
> This looks like a small or null pessimization.  The branch in the old
> version is probably faster.  The following simple test on Haswell shows
> that all reasonable versions have almost the same speed when cached:
Yes, I already noted and mjg noted that ifuncs are directed through PLT.
I remember that it was not the case when I did it the first time, but then
both compiler and linker were different.

I tried a quick experiment with adding -fvisibility=hidden to the kernel
compilation, but it still leaves ifunc relocations on PLT (and PLT correctly
consists only of ifuncs).

I might look some more on it later.  Anyway, there are more uses, and SMAP
plus pmap applications of ifunc are more clear.  I selected amd64/fpu.c
for the initial commit to check toolchain in wild since this case is
simplest and easy to diagnose if failing, I hope.

> 
> XX #include <sys/cdefs.h>
> XX 
> XX #ifndef FUNC
> XX #define	FUNC	cfoo
> XX #endif
> XX 
> XX int FUNC(void);
> XX 
> XX int asmifunctargfoo(void);
> XX int (*ifuncaddr)(void) = asmifunctargfoo;
> XX int xsave = 1;
> XX 
> XX int
> XX cfoo(void)
> XX {
> XX 	return (xsave != 0 ? 2 : 1);
> XX }
> XX 
> XX __asm(".text; asmbranchfoo: cmpl $0,xsave(%rip); je 1f; movl $2,%eax; ret; 1: movl $1,%eax; ret");
> XX __asm(".text; asmnobranchfoo: cmpl $0,xsave(%rip); setne %al; movzbl %al,%eax; ret");
> XX __asm(".text; asmifuncfoo: jmp *ifuncaddr(%rip)");
> XX __asm(".text; asmifunctargfoo: movl $2,%eax; ret");	/* for xsave != 0 */
> XX 
> XX int
> XX main(void)
> XX {
> XX 	volatile int i, res;;
> XX 
> XX 	res = 0;
> XX 	for (i = 0; i < 1000000000; i++)
> XX 		res += FUNC();
> XX 	return (0);
> XX }
> 
> Compile this with gcc -o foo foo.c -O -FUNC={cfoo,asmbranchfoo,asmnobranchfoo,
> asmifuncfoo).
> 
> cfoo, asmnobranchfoo and asmifuncfoo take 1.71-1.72 seconds (~7 cycles at
> 4.08 Ghz) on Haswell.  Only asmnobranchfoo is slower -- it takes about
> 1.78 seconds.  That is a whole quarter of a cycle longer (7.25 cycles).
> 
> The uncached case is hard to test.  This loop runs in parallel with itself
> as far as possible, so the main limit is the number of dependencies and
> latency for branch prediction might not matter provided it is less than
> the loop time of 7 cycles.  In normal use, probably nothing stays cached
> because it is rarely called.  This also means that optimizations are in
> the noise.
> 
> I think the direct branch can be statically predicted better than the
> indirect branch, and it is easy to arrange using excessive __predict_*
> optimizations that it is right on some hardware.
> 
> Bruce