projects/clang380-import -r294962+ based powerpc (32-bit) buildworld -j 6: make gets SEGV, a partial smoking gun?
Mark Millard
markmi at dsl-only.net
Fri Feb 5 10:39:01 UTC 2016
On 2016-Feb-4, at 11:46 PM, Mark Millard <markmi at dsl-only.net> wrote:
>
> The problem:
>
> For a clang 3.8.0 based buildworld TARGET_ARCH=powerpc installation attempting "make -j 6 buildworld" (run on 4 powerpc cores) eventually gets a segmentation fault in a make instance. (More details later.) "make buildworld" does not fault.
>
> I expect that the details that I describe below implies some form of intermittency, such as a race condition.
>
> (This is with the content of sys/powerpc/powerpc/sigcode32.S -r295186 in place so that signal delivery maintains the modulo 16 byte stack/frame alignment status instead of changing the alignment.)
>
> (clang 3.8.0 targeting powerpc (32-bit) is known to be able to introduce more stack alignment dependencies by sometimes "or-ing" in offset-bits into some aligned-address lower bits instead of using addition. But I do not know if that is involved here somehow.)
>
>
>
> What is always involved and what varies:
>
> In all cases the failure was r31 being used as a frame-pointer with the value zero in r31 at the time of the address calculation, even when when the dereference of the address was later. r1 still seemed to be a valid stack pointer in all cases.
>
> In every case the faulting routine had called one or more routines during its operation --and those had returned. There was an example or two of a self-contained routine that was recursive that got the failure.
>
> In some cases prior calls in the faulting routine had non-zero r31 values when they returned. (There was later r31 usage that did not fault.)
>
> Overall the call chains varied widely for various example faults, although some call context is more common as a failure point.
>
> Use of ktrace with "-di -t cs" and use of kdump for extracting for the failing process shows the same 5 line sequence before every example "PSIG SIGSEGV". What was before those 5 lines varied across the various kdsump outputs.
>
> I used ktrace/kdump commands of the structure:
>
> ktrace -di -f /usr/obj/make.out -t cs -p ???
> kdump -E -f /usr/obj/make.out -p ??? > /var/tmp/make_ktrace_sigsegv_??.txt
>
> Example results (showing the 5 lines and PSIG SIGSEGV):
>
> (3 prior "sigreturn JUSTRETURN" among what is not shown)
>> 65158 make 0.205791 PSIG SIGCHLD caught handler=0x180aae0 mask=0x0 code=CLD_EXITED
>> 65158 make 0.205822 CALL write(0x3,0x189e914,0x1)
>> 65158 make 0.205847 RET write 1
>> 65158 make 0.205869 CALL sigreturn(0xffffbb50)
>> 65158 make 0.205923 RET sigreturn JUSTRETURN
>> 65158 make 0.205962 PSIG SIGSEGV SIG_DFL code=SEGV_MAPERR
>
> (365 prior "sigreturn JUSTRETURN" among what is not shown)
>> 599 make 5.552305 PSIG SIGCHLD caught handler=0x180aae0 mask=0x0 code=CLD_EXITED
>> 599 make 5.552323 CALL write(0x3,0x189e914,0x1)
>> 599 make 5.552337 RET write 1
>> 599 make 5.552347 CALL sigreturn(0xffffbb30)
>> 599 make 5.552358 RET sigreturn JUSTRETURN
>> 599 make 5.552381 PSIG SIGSEGV SIG_DFL code=SEGV_MAPERR
>
> (287 prior "sigreturn JUSTRETURN" among what is not shown)
>> 75728 make 4.141097 PSIG SIGCHLD caught handler=0x180aae0 mask=0x0 code=CLD_EXITED
>> 75728 make 4.141116 CALL write(0x3,0x189e914,0x1)
>> 75728 make 4.141154 RET write 1
>> 75728 make 4.141349 CALL sigreturn(0xffffbaa0)
>> 75728 make 4.141366 RET sigreturn JUSTRETURN
>> 75728 make 4.141404 PSIG SIGSEGV SIG_DFL code=SEGV_MAPERR
>
> (273 prior "sigreturn JUSTRETURN" among what is not shown)
>> 12195 make 27.213277 PSIG SIGCHLD caught handler=0x180aae0 mask=0x0 code=CLD_EXITED
>> 12195 make 27.213322 CALL write(0x3,0x189e914,0x1)
>> 12195 make 27.213346 RET write 1
>> 12195 make 27.213361 CALL sigreturn(0xffffb1e0)
>> 12195 make 27.213383 RET sigreturn JUSTRETURN
>> 12195 make 27.213418 PSIG SIGSEGV SIG_DFL code=SEGV_MAPERR
>
> (789 prior "sigreturn JUSTRETURN" among what is not shown)
>> 50545 make 80.255162 PSIG SIGCHLD caught handler=0x180aae0 mask=0x0 code=CLD_EXITED
>> 50545 make 80.255192 CALL write(0x3,0x189e914,0x1)
>> 50545 make 80.255219 RET write 1
>> 50545 make 80.255241 CALL sigreturn(0xffffafa0)
>> 50545 make 80.255265 RET sigreturn JUSTRETURN
>> 50545 make 80.255317 PSIG SIGSEGV SIG_DFL code=SEGV_MAPERR
>
> The 5 line sequence is not sufficient for the problem to occur but appears to be necessary: There were sometimes hundreds of prior "PSIG SIGCHLD". . ."RET sigreturn JUSTRETURNS" sequences for which they were not followed by "PSIG SIGSEGV". But every failure tested with ktrace has the 5 lines as an immediate prefix in the list for the process.
>
> Which instance of make varied and where in make the fault happens varied. The "-E" elapsed times above and those JUSTRETURN counts give a solid clue to there being variability in when the fault happens.
>
> I'll use some script log file sizes for the buidlworld as another indication of variability. I've sorted them:
>
> 2942664
> 3304207
> 3342660
> 3474585
> 3941983
>
> so spanning from 2.9 MBytes to 3.9 MBytes. I've since gotten a few with less and some with more.
>
>
> Note: A couple of times with ktrace being involved it failed at an earlier stage than I've seen otherwise. It may be that ktrace being involved makes the problem more likely/frequent.
>
>
>
> Context basics (quad core PowerMac running TARGET_ARCH=powerpc (32-bit)):
>
> # freebsd-version -ku; uname -aKU
> 11.0-CURRENT
> 11.0-CURRENT
> FreeBSD FBSDG4C1 11.0-CURRENT FreeBSD 11.0-CURRENT #2 r294962M: Mon Feb 1 00:31:03 PST 2016 markmi at FreeBSDx64:/usr/obj/clang_gcc421/powerpc.powerpc/usr/src/sys/GENERICvtsc-NODEBUG powerpc 1100097 1100097
>
> This is with the content of sys/powerpc/powerpc/sigcode32.S -r295186 in place so that signal delivery maintains the modulo 16 byte stack/frame alignment status instead of changing the alignment.
>
> buildkernel was via gcc 4.2.1
> buildworld was via clang 3.8.0
>
>
>
> I'm not sure that I'm going to get much farther in tracking down the source of the race(?) that leads to the SEGV's.
I should have given an example of clang 3.8.0 produced post-amble code that ends up picking up the 0x0 and putting it is r31 (the Frame Pointer for clang's produced code).
0x1801b8c <Buf_AddBytes+104>: lwz r30,24(r31)
0x1801b90 <Buf_AddBytes+108>: lwz r29,20(r31)
0x1801b94 <Buf_AddBytes+112>: lwz r28,16(r31)
0x1801b98 <Buf_AddBytes+116>: lwz r27,12(r31)
0x1801b9c <Buf_AddBytes+120>: lwz r26,8(r31)
0x1801ba0 <Buf_AddBytes+124>: addi r1,r1,32 # Stack pointer adjusted first
0x1801ba4 <Buf_AddBytes+128>: lwz r0,4(r1)
0x1801ba8 <Buf_AddBytes+132>: lwz r31,-4(r1) # Then Frame Pointer load happens
# "outside" the new stack range.
0x1801bac <Buf_AddBytes+136>: mtlr r0
0x1801bb0 <Buf_AddBytes+140>: blr
Here is the related bl and following code:
0x182516c <VarSYSVMatch+248>: bl 0x1801b24 <Buf_AddBytes>
0x1825170 <VarSYSVMatch+252>: lis r3,396
=> 0x1825174 <VarSYSVMatch+256>: lwz r4,16(r31)
0x1825178 <VarSYSVMatch+260>: lwz r3,-13260(r3)
0x182517c <VarSYSVMatch+264>: subf. r3,r4,r3
0x1825180 <VarSYSVMatch+268>: bne 0x18251c4 <VarSYSVMatch+336>
The "=>" points to a "sometimes SEGV" place from r31=0x0 being involved in a clang 3.8.0 buildworld's based "make" after it resumes normal operation after handling a SIGCHLD delivery.
So the "lwz r31,-4(r1)" earlier above resulted in a 0x0 in r31 just before the SEGV.
By contrast the gcc 4.2.1 generated code does not use r31 as a frame pointer and has:
0x1801db8 <Buf_AddBytes+96>: lwz r0,52(r1)
0x1801dbc <Buf_AddBytes+100>: lwz r27,28(r1)
0x1801dc0 <Buf_AddBytes+104>: lwz r28,32(r1)
0x1801dc4 <Buf_AddBytes+108>: mtlr r0
0x1801dc8 <Buf_AddBytes+112>: lwz r29,36(r1)
0x1801dcc <Buf_AddBytes+116>: lwz r30,40(r1)
0x1801dd0 <Buf_AddBytes+120>: lwz r31,44(r1)
0x1801dd4 <Buf_AddBytes+124>: addi r1,r1,48 # Stack pointer is adjusted last.
# The prior code worked in the range.
0x1801dd8 <Buf_AddBytes+128>: blr
and:
0x1823cac <VarSYSVMatch+340>: bl 0x1801d58 <Buf_AddBytes>
0x1823cb0 <VarSYSVMatch+344>: b 0x1823c0c <VarSYSVMatch+180>
. . . backing up to the jump point . . .
0x1823c0c <VarSYSVMatch+180>: lwz r0,28(r1) # Stack pointer used, not a Frame Pointer
0x1823c10 <VarSYSVMatch+184>: lwz r9,0(r25)
0x1823c14 <VarSYSVMatch+188>: xor. r0,r0,r9
0x1823c18 <VarSYSVMatch+192>: li r9,0
0x1823c1c <VarSYSVMatch+196>: li r3,1
0x1823c20 <VarSYSVMatch+200>: bne- 0x1823cb4 <VarSYSVMatch+348>
As r31 is not used as a frame pointer the 4.2.1 code can not get the SEGV that the clang 3.8.0 code sometimes gets.
===
Mark Millard
markmi at dsl-only.net
More information about the freebsd-ppc
mailing list