projects/clang380-import -r294962+ based powerpc (32-bit) buildworld -j 6: make gets SEGV, a partial smoking gun?

Mark Millard markmi at dsl-only.net
Fri Feb 5 10:39:01 UTC 2016


On 2016-Feb-4, at 11:46 PM, Mark Millard <markmi at dsl-only.net> wrote:
> 
> The problem:
> 
> For a clang 3.8.0 based buildworld TARGET_ARCH=powerpc installation attempting "make -j 6 buildworld"  (run on 4 powerpc cores) eventually gets a segmentation fault in a  make instance. (More details later.) "make buildworld" does not fault.
> 
> I expect that the details that I describe below implies some form of intermittency, such as a race condition.
> 
> (This is with the content of sys/powerpc/powerpc/sigcode32.S -r295186 in place so that signal delivery maintains the modulo 16 byte stack/frame alignment status instead of changing the alignment.)
> 
> (clang 3.8.0 targeting powerpc (32-bit) is known to be able to introduce more stack alignment dependencies by sometimes "or-ing" in offset-bits into some aligned-address lower bits instead of using addition. But I do not know if that is involved here somehow.)
> 
> 
> 
> What is always involved and what varies:
> 
> In all cases the failure was r31 being used as a frame-pointer with the value zero in r31 at the time of the address calculation, even when when the dereference of the address was later. r1 still seemed to be a valid stack pointer in all cases.
> 
> In every case the faulting routine had called one or more routines during its operation --and those had returned. There was an example or two of a self-contained routine that was recursive that got the failure.
> 
> In some cases prior calls in the faulting routine had non-zero r31 values when they returned. (There was later r31 usage that did not fault.)
> 
> Overall the call chains varied widely for various example faults, although some call context is more common as a failure point.
> 
> Use of ktrace with "-di -t cs" and use of kdump for extracting for the failing process shows the same 5 line sequence before every example "PSIG SIGSEGV". What was before those 5 lines varied across the various kdsump outputs.
> 
> I used ktrace/kdump commands of the structure:
> 
> ktrace -di -f /usr/obj/make.out -t cs -p ???
> kdump -E -f /usr/obj/make.out -p ??? > /var/tmp/make_ktrace_sigsegv_??.txt
> 
> Example results (showing the 5 lines and PSIG SIGSEGV):
> 
> (3 prior "sigreturn JUSTRETURN" among what is not shown)
>> 65158 make     0.205791 PSIG  SIGCHLD caught handler=0x180aae0 mask=0x0 code=CLD_EXITED
>> 65158 make     0.205822 CALL  write(0x3,0x189e914,0x1)
>> 65158 make     0.205847 RET   write 1
>> 65158 make     0.205869 CALL  sigreturn(0xffffbb50)
>> 65158 make     0.205923 RET   sigreturn JUSTRETURN
>> 65158 make     0.205962 PSIG  SIGSEGV SIG_DFL code=SEGV_MAPERR
> 
> (365 prior "sigreturn JUSTRETURN" among what is not shown)
>>   599 make     5.552305 PSIG  SIGCHLD caught handler=0x180aae0 mask=0x0 code=CLD_EXITED
>>   599 make     5.552323 CALL  write(0x3,0x189e914,0x1)
>>   599 make     5.552337 RET   write 1
>>   599 make     5.552347 CALL  sigreturn(0xffffbb30)
>>   599 make     5.552358 RET   sigreturn JUSTRETURN
>>   599 make     5.552381 PSIG  SIGSEGV SIG_DFL code=SEGV_MAPERR
> 
> (287 prior "sigreturn JUSTRETURN" among what is not shown)
>> 75728 make     4.141097 PSIG  SIGCHLD caught handler=0x180aae0 mask=0x0 code=CLD_EXITED
>> 75728 make     4.141116 CALL  write(0x3,0x189e914,0x1)
>> 75728 make     4.141154 RET   write 1
>> 75728 make     4.141349 CALL  sigreturn(0xffffbaa0)
>> 75728 make     4.141366 RET   sigreturn JUSTRETURN
>> 75728 make     4.141404 PSIG  SIGSEGV SIG_DFL code=SEGV_MAPERR
> 
> (273 prior "sigreturn JUSTRETURN" among what is not shown)
>> 12195 make     27.213277 PSIG  SIGCHLD caught handler=0x180aae0 mask=0x0 code=CLD_EXITED
>> 12195 make     27.213322 CALL  write(0x3,0x189e914,0x1)
>> 12195 make     27.213346 RET   write 1
>> 12195 make     27.213361 CALL  sigreturn(0xffffb1e0)
>> 12195 make     27.213383 RET   sigreturn JUSTRETURN
>> 12195 make     27.213418 PSIG  SIGSEGV SIG_DFL code=SEGV_MAPERR
> 
> (789 prior "sigreturn JUSTRETURN" among what is not shown)
>> 50545 make     80.255162 PSIG  SIGCHLD caught handler=0x180aae0 mask=0x0 code=CLD_EXITED
>> 50545 make     80.255192 CALL  write(0x3,0x189e914,0x1)
>> 50545 make     80.255219 RET   write 1
>> 50545 make     80.255241 CALL  sigreturn(0xffffafa0)
>> 50545 make     80.255265 RET   sigreturn JUSTRETURN
>> 50545 make     80.255317 PSIG  SIGSEGV SIG_DFL code=SEGV_MAPERR
> 
> The 5 line sequence is not sufficient for the problem to occur but appears to be necessary: There were sometimes hundreds of prior "PSIG SIGCHLD". . ."RET sigreturn JUSTRETURNS" sequences for which they were not followed by "PSIG SIGSEGV". But every failure tested with ktrace has the 5 lines as an immediate prefix in the list for the process.
> 
> Which instance of make varied and where in make the fault happens varied. The "-E" elapsed times above and those JUSTRETURN counts give a solid clue to there being variability in when the fault happens.
> 
> I'll use some script log file sizes for the buidlworld as another indication of variability. I've sorted them:
> 
> 2942664
> 3304207
> 3342660
> 3474585
> 3941983
> 
> so spanning from 2.9 MBytes to 3.9 MBytes. I've since gotten a few with less and some with more.
> 
> 
> Note: A couple of times with ktrace being involved it failed at an earlier stage than I've seen otherwise. It may be that ktrace being involved makes the problem more likely/frequent.
> 
> 
> 
> Context basics (quad core PowerMac running TARGET_ARCH=powerpc (32-bit)):
> 
> # freebsd-version -ku; uname -aKU
> 11.0-CURRENT
> 11.0-CURRENT
> FreeBSD FBSDG4C1 11.0-CURRENT FreeBSD 11.0-CURRENT #2 r294962M: Mon Feb  1 00:31:03 PST 2016     markmi at FreeBSDx64:/usr/obj/clang_gcc421/powerpc.powerpc/usr/src/sys/GENERICvtsc-NODEBUG  powerpc 1100097 1100097
> 
> This is with the content of sys/powerpc/powerpc/sigcode32.S -r295186 in place so that signal delivery maintains the modulo 16 byte stack/frame alignment status instead of changing the alignment.
> 
> buildkernel was via gcc 4.2.1
> buildworld was via clang 3.8.0
> 
> 
> 
> I'm not sure that I'm going to get much farther in tracking down the source of the race(?) that leads to the SEGV's.

I should have given an example of clang 3.8.0 produced post-amble code that ends up picking up the 0x0 and putting it is r31 (the Frame Pointer for clang's produced code).

0x1801b8c <Buf_AddBytes+104>:	lwz     r30,24(r31)
0x1801b90 <Buf_AddBytes+108>:	lwz     r29,20(r31)
0x1801b94 <Buf_AddBytes+112>:	lwz     r28,16(r31)
0x1801b98 <Buf_AddBytes+116>:	lwz     r27,12(r31)
0x1801b9c <Buf_AddBytes+120>:	lwz     r26,8(r31)
0x1801ba0 <Buf_AddBytes+124>:	addi    r1,r1,32   # Stack pointer adjusted first
0x1801ba4 <Buf_AddBytes+128>:	lwz     r0,4(r1)
0x1801ba8 <Buf_AddBytes+132>:	lwz     r31,-4(r1) # Then Frame Pointer load happens
                                                   # "outside" the new stack range.
0x1801bac <Buf_AddBytes+136>:	mtlr    r0
0x1801bb0 <Buf_AddBytes+140>:	blr

Here is the related bl and following code:

   0x182516c <VarSYSVMatch+248>:	bl      0x1801b24 <Buf_AddBytes>
   0x1825170 <VarSYSVMatch+252>:	lis     r3,396
=> 0x1825174 <VarSYSVMatch+256>:	lwz     r4,16(r31)
   0x1825178 <VarSYSVMatch+260>:	lwz     r3,-13260(r3)
   0x182517c <VarSYSVMatch+264>:	subf.   r3,r4,r3
   0x1825180 <VarSYSVMatch+268>:	bne     0x18251c4 <VarSYSVMatch+336>

The "=>" points to a "sometimes SEGV" place from r31=0x0 being involved in a clang 3.8.0 buildworld's based "make" after it resumes normal operation after handling a SIGCHLD delivery.

So the "lwz r31,-4(r1)" earlier above resulted in a 0x0 in r31 just before the SEGV.

By contrast the gcc 4.2.1 generated code does not use r31 as a frame pointer and has:

0x1801db8 <Buf_AddBytes+96>:	lwz     r0,52(r1)
0x1801dbc <Buf_AddBytes+100>:	lwz     r27,28(r1)
0x1801dc0 <Buf_AddBytes+104>:	lwz     r28,32(r1)
0x1801dc4 <Buf_AddBytes+108>:	mtlr    r0
0x1801dc8 <Buf_AddBytes+112>:	lwz     r29,36(r1)
0x1801dcc <Buf_AddBytes+116>:	lwz     r30,40(r1)
0x1801dd0 <Buf_AddBytes+120>:	lwz     r31,44(r1)
0x1801dd4 <Buf_AddBytes+124>:	addi    r1,r1,48   # Stack pointer is adjusted last.
                                                   # The prior code worked in the range.
0x1801dd8 <Buf_AddBytes+128>:	blr

and:

0x1823cac <VarSYSVMatch+340>:	bl      0x1801d58 <Buf_AddBytes>
0x1823cb0 <VarSYSVMatch+344>:	b       0x1823c0c <VarSYSVMatch+180>
. . . backing up to the jump point . . .
0x1823c0c <VarSYSVMatch+180>:	lwz     r0,28(r1) # Stack pointer used, not a Frame Pointer
0x1823c10 <VarSYSVMatch+184>:	lwz     r9,0(r25)
0x1823c14 <VarSYSVMatch+188>:	xor.    r0,r0,r9
0x1823c18 <VarSYSVMatch+192>:	li      r9,0
0x1823c1c <VarSYSVMatch+196>:	li      r3,1
0x1823c20 <VarSYSVMatch+200>:	bne-    0x1823cb4 <VarSYSVMatch+348>

As r31 is not used as a frame pointer the 4.2.1 code can not get the SEGV that the clang 3.8.0 code sometimes gets.

===
Mark Millard
markmi at dsl-only.net




More information about the freebsd-ppc mailing list