svn commit: r339876 - head/libexec/rtld-elf

Wed Nov 7 23:09:52 UTC 2018

[I note what I've failed to find a way to investigate.]

On 2018-Nov-7, at 11:53, Mark Millard <marklmi26-fbsd at yahoo.com> wrote:

> [I trace code associated with bl <00001322.plt_call.getenv>
> in the two contexts and extend the range over which things
> appear to match: up to some point after the branch
> b <__glink_PLTresolve> .]
> 
> On 2018-Nov-6, at 19:12, Mark Millard <marklmi26-fbsd at yahoo.com> wrote:
> 
>> [I've present a little information about the longer-existing
>> failure's odd backtrace for /libexec/ld-elf.so.1 /bin/ls
>> --but on powerpc64 FreeBSD instead of 32-bit powerpc FreeBSD.]
>> 
>> On 2018-Nov-2, at 11:50, Konstantin Belousov <kostikbel at gmail.com> wrote:
>> 
>>> On Fri, Nov 02, 2018 at 10:38:08AM -0700, Mark Millard wrote:
>>>> On 2018-Nov-2, at 8:52 AM, Konstantin Belousov <kostikbel at gmail.com> wrote:
>>>> 
>>>>> . . .
>>>> 
>>>> That seems better. But it crashes during /bin/ls execution
>>>> ( 0x0180???? addresses ), apparently in a library routine
>>>> ( 0x41?????? addresses ):
>>>> 
>>>> Program received signal SIGSEGV, Segmentation fault.
>>>> 0x411220b4 in ?? ()
>>>> (gdb) bt
>>>> #0  0x411220b4 in ?? ()
>>>> #1  0x4112200c in ?? ()
>>>> #2  0x01803c84 in ?? ()
>>>> #3  0x018023b4 in ?? ()
>>>> #4  0x010121a0 in .rtld_start () at /usr/src/libexec/rtld-elf/powerpc/rtld_start.S:112
>>>> 
>>>> Using a normal gdb run of /bin/ls suggests:
>>>> 
>>>> #2  0x01803c84 in ?? () should be in main and seems to be: bl 0x1818914 <getopt_long at plt>
>>>> #3  0x018023b4 in ?? () should be in _start
>>>> 
>>>> Looking in the test context:
>>>> 
>>>> 0x1803c80:	bl      0x1818914
>>>> 0x1803c84:	cmpwi   cr7,r3,-1
>>>> 
>>>> and:
>>>> 
>>>> 0x1818914:	li      r11,59
>>>> 0x1818918:	b       0x18186f4
>>>> 
>>>> and:
>>>> 
>>>> 0x18186f4:	rlwinm  r11,r11,2,0,29
>>>> 0x18186f8:	addis   r11,r11,386
>>>> 0x18186fc:	lwz     r11,-30316(r11)
>>>> 0x1818700:	mtctr   r11
>>>> 0x1818704:	bctr
>>>> 
>>>> Breaking at the bctr and using info reg:
>>>> 
>>>> r11            0x4125ffa0	1093009312
>>>> 
>>>> It looks like there is some amount of
>>>> activity before the traceback addresses
>>>> show up.
>>>> 
>>>> I've not found a good way to fill in the "in ??()"
>>>> (or analogous) information. The addresses 0x411220??
>>>> do not match up with a normal run of /bin/ls from
>>>> gdb: the addresses can not be accessed.
>>>> 
>>>> 
>>>> 
>>>> It does appear that the code is in /lib/libc.so.7 in the
>>>> test context:
>>>> 
>>>> Breakpoint 2, reloc_non_plt (obj=0x41041600, obj_rtld=0x41104b57, flags=4, lockstate=0x0) at /usr/src/libexec/rtld-elf/powerpc/reloc.c:338
>>>> . . .
>>>> 
>>> There seems to be an issue with the direct execution mode on ppc.
>>> Even otherwise working ld-elf.so.1 segfaults if I try to use it as
>>> standalone binary.
>>> 
>>> But if I specify patched ld-elf.so.1 as the interpreter for some program,
>>> using 'cc -Wl,-I,<path>/ld-elf.so.1' it works.  So I see there two bugs,
>>> one is regression due to textsize calculation, which should be fixed by
>>> my patch.  Another is the direct exec problem.
>> 
>> I've got a little more information about the odd backtrace
>> from the /libexec/ld-elf.so.1 /bin/ls failure that the
>> prior patch allowed getting to, although for a powerpc64
>> example context.
>> 
>> The information is only identifying where the code was
>> in /bin/ls and /lib/libc.so.1 in the backtrace. For
>> libc.so.1 I found the same code sequences in a gdb of
>> /bin/ls directly, matching one first, using the addresses
>> vs. in the /libexec/ld-elf.so.1 /bin/ls process to
>> find offsets for going back and forth, and then used
>> that two find the 2nd backtrace addresses material.
>> 
>> Overall it suggests to me that (in somewhat 
>> symbolic terms):
>> 
>> bl     <00001322.plt_call.getenv>
>> 
>> eventually lead to executing the wrong code.
>> 
>> 
>> The supporting detail is as follows.
>> 
>> The /libexec/ld-elf.so.1 part of the backtrace was
>> easy to find where the code was:
>> 
>> (gdb) run /bin/ls
>> Starting program: /libexec/ld-elf.so.1 /bin/ls
>> 
>> Program received signal SIGSEGV, Segmentation fault.
>> 0x000000080118d81c in ?? ()
>> (gdb) bt
>> #0  0x000000080118d81c in ?? ()
>> #1  0x000000080118d920 in ?? ()
>> #2  0x0000000010002558 in ?? ()
>> #3  0x00000000100037b0 in ?? ()
>> #4  0x0000000001018450 in ._rtld_start () at /usr/src/libexec/rtld-elf/powerpc64/rtld_start.S:104
>> Backtrace stopped: frame did not save the PC
>> 
>> (gdb) 
>> 101		ld      %r7,128(%r1)	/* exit proc */
>> 102		ld      %r8,136(%r1)	/* ps_strings */
>> 103	
>> 104		blrl	/* _start(argc, argv, envp, obj, cleanup, ps_strings) */
>> 105	
>> 106		li      %r0,1		/* _exit() */
>> 107		sc
>> 
>> 
>> The /bin/ls part of the backtrace was easy to find
>> were the code was:
>> 
>> (gdb) symbol-file /bin/ls
>> Load new symbol table from "/bin/ls"? (y or n) y
>> Reading symbols from /bin/ls...Reading symbols from /usr/lib/debug//bin/ls.debug...done.
>> done.
>> (gdb) bt
>> #0  0x000000080118d81c in ?? ()
>> #1  0x000000080118d920 in ?? ()
>> #2  0x0000000010002558 in main (argc=<optimized out>, argv=0x80134bdb0) at /usr/src/bin/ls/ls.c:268
>> #3  0x00000000100037b0 in _start (argc=<optimized out>, argv=0x3fffffffffffdb70, env=0x3fffffffffffdb88, obj=<optimized out>, cleanup=<optimized out>, ps_strings=<optimized out>)
>>   at /usr/src/lib/csu/powerpc64/crt1.c:96
>> #4  0x0000000001018450 in ?? ()
>> #5  0x0000000000000000 in ?? ()
>> 
>> (gdb) fr 3 
>> #3  0x00000000100037b0 in _start (argc=<optimized out>, argv=0x3fffffffffffdb70, env=0x3fffffffffffdb88, obj=<optimized out>, cleanup=<optimized out>, ps_strings=<optimized out>)
>>   at /usr/src/lib/csu/powerpc64/crt1.c:96
>> 96		exit(main(argc, argv, env));
>> (gdb) down
>> #2  0x0000000010002558 in main (argc=<optimized out>, argv=0x80134bdb0) at /usr/src/bin/ls/ls.c:268
>> 268		while ((ch = getopt_long(argc, argv,
>> 
>> 
>> 
>> For the messy lib.libc.so.1 part of the backtrace both
>> addresses are in getopt_internal. I show extractions from
>> the the gdb /bin/ls output because it has helpful symbolic
>> information displayed. But that means that the addresses
>> are offset from those in the bt for the failure process.
>> 
>> For #1  0x000000080118d920 in ?? () I end up with:
>> 
>> (gdb) x/32i 0x81019b6c0+0xad0-0x880
>>  0x81019b910 <getopt_internal+592>:	stw     r9,0(r18)
>>  0x81019b914 <getopt_internal+596>:	addis   r3,r2,-5
>>  0x81019b918 <getopt_internal+600>:	addi    r3,r3,30120
>>  0x81019b91c <getopt_internal+604>:	bl      0x81018dfe0 <00001322.plt_call.getenv>
>>  0x81019b920 <getopt_internal+608>:	ld      r2,40(r1)
>> 
>> (The machine code around it all matches around
>> 0x000000080118d920 in the failure context.)
>> 
>> The getenv call in the source is the 2nd line of:
>> 
>>       if (posixly_correct == -1 || optreset)
>>               posixly_correct = (getenv("POSIXLY_CORRECT") != NULL);
>> 
>> For #0  0x000000080118d81c in ?? () I end up with:
>> 
>> (gdb) x/32i 0x81019b6c0+0xad0-0x880-0x110
>>  0x81019b800 <getopt_internal+320>:	bne     cr7,0x81019b868 <getopt_internal+424>
>>  0x81019b804 <getopt_internal+324>:	lwa     r5,0(r29)
>>  0x81019b808 <getopt_internal+328>:	stw     r17,0(r18)
>>  0x81019b80c <getopt_internal+332>:	cmpw    cr7,r5,r19
>>  0x81019b810 <getopt_internal+336>:	bge     cr7,0x81019ba60 <getopt_internal+928>
>>  0x81019b814 <getopt_internal+340>:	rldicr  r9,r5,3,60
>>  0x81019b818 <getopt_internal+344>:	ldx     r10,r20,r9
>>  0x81019b81c <getopt_internal+348>:	lbz     r9,0(r10)
>> 
>> with the failure being that r10 is zero in that last
>> line above. Again the surrounding code matches.
>> 
>> The source code line is reported to be:
>> 
>>               if (*(place = nargv[optind]) != '-' ||
>> 
>> I got the line number information from breakpoints 3 and 4
>> below (from the gdb /bin/ls process):
>> 
>> (gdb) info br
>> Num     Type           Disp Enb Address            What
>> 1       breakpoint     keep y   0x0000000010002360 in main at /usr/src/bin/ls/ls.c:231
>> 	breakpoint already hit 1 time
>> 3       breakpoint     keep y   0x000000081019b81c in getopt_internal at /usr/src/lib/libc/stdlib/getopt_long.c:411
>> 4       breakpoint     keep y   0x000000081019b91c in getopt_internal at /usr/src/lib/libc/stdlib/getopt_long.c:379
>> 
>> Line 379 has the getenv call, matching the machine code showing
>> the call.
>> 
>> (I set the breakpoints just as a way of using "info br" to list
>> the information later.)
>> 
>> Overall this seems to suggest that:
>> 
>> bl     <00001322.plt_call.getenv>
>> 
>> lead to something odd happening and got to the wrong
>> code.
>> 
>> That is all the additional information that I have
>> at this point. I hope it is of some use.
>> 
> 
> I'll note that the normal cases execution does the
> getenv call but does not execute the lbz r9,0(r10)
> related code.
> 
> I'll also note that for the libc.so.1 code
> the /libexec/ld-elf.so.1 /bin/ls code
> addresses are bigger than the /bin/ls
> addresses by:
> 
> 0x81019b920 - 0x80118d920 = 0xF00E000
> 
> I use this to go back and forth, checking for matching
> code as I go.
> 
> Presenting the normal /bin/ls use in gdb first for
> up to b <__glink_PLTresolve> :
> 
> I'd already shown:
> 
>  0x81019b91c <getopt_internal+604>:	bl      0x81018dfe0 <00001322.plt_call.getenv>
> 
> Looking:
> 
>   0x81018dfe0 <00001322.plt_call.getenv>:	std     r2,40(r1)
>   0x81018dfe4 <00001322.plt_call.getenv+4>:	ld      r12,480(r2)
>   0x81018dfe8 <00001322.plt_call.getenv+8>:	mtctr   r12
>   0x81018dfec <00001322.plt_call.getenv+12>:	ld      r11,496(r2)
>   0x81018dff0 <00001322.plt_call.getenv+16>:	ld      r2,488(r2)
>   0x81018dff4 <00001322.plt_call.getenv+20>:	cmpldi  r2,0
>   0x81018dff8 <00001322.plt_call.getenv+24>:	bnectr+ 
>   0x81018dffc <00001322.plt_call.getenv+28>:	b       0x81030f3dc <getenv at plt>
> 
> As for getenv at pl :
> 
>   0x81030f3dc <getenv at plt>:	li      r0,925
>   0x81030f3e0 <getenv at plt+4>:	b       0x81030d6c8 <__glink_PLTresolve>
> 
> 
> Note that 0x81018dfe0 - 0xF00E000 = 0x80117ffe0 .
> 
> Back in the /libexec/ld-elf.so.1 /bin/ls context:
> 
> (gdb) bt
> #0  0x000000080118d81c in ?? ()
> #1  0x000000080118d920 in ?? () [Just after the bl <00001322.plt_call.getenv> .]
> #2  0x0000000010002558 in ?? ()
> #3  0x00000000100037b0 in ?? ()
> #4  0x0000000001018450 in ?? ()
> #5  0x0000000000000000 in ?? ()
> 
> (gdb) x/i 0x000000080118d920-0x4
>   0x80118d91c:	bl      0x80117ffe0
> 
> So matching what was calculated earlier.
> 
> (gdb) x/32i 0x81018dfe0-0xf00e000 
>   0x80117ffe0:	std     r2,40(r1)
>   0x80117ffe4:	ld      r12,480(r2)
>   0x80117ffe8:	mtctr   r12
>   0x80117ffec:	ld      r11,496(r2)
>   0x80117fff0:	ld      r2,488(r2)
>   0x80117fff4:	cmpldi  r2,0
>   0x80117fff8:	bnectr+ 
>   0x80117fffc:	b       0x8013013dc
> 
> (gdb) x/2i 0x8013013dc
>   0x8013013dc:	li      r0,925
>   0x8013013e0:	b       0x8012ff6c8
> 
> 0x81030d6c8 - 0x8012ff6c8 = 0xF00E000
> 
> Still matching tp to this point.
> 
> So the two contexts seem to match up to
> some point after: b <__glink_PLTresolve> .
> 
> I've not looked beyond this.

[ Based on normal-case text for better symbolic information . . . ]

For the failing context and its use of the below
code (presentation edited):

  00001322.plt_call.getenv>:	std     r2,40(r1)
  00001322.plt_call.getenv+4>:	ld      r12,480(r2)
  00001322.plt_call.getenv+8>:	mtctr   r12
  00001322.plt_call.getenv+12>:	ld      r11,496(r2)
  00001322.plt_call.getenv+16>:	ld      r2,488(r2)
  00001322.plt_call.getenv+20>:	cmpldi  r2,0
  00001322.plt_call.getenv+24>:	bnectr+ 

I've not come up with a way to investigate the potential
indirect jump (bnectr+) and what sets up for it. (The
branch following the bnectr+ seems okay.)

Similarly relative to the bctr in (edited):

(gdb) disass __glink_PLTresolve
Dump of assembler code for function __glink_PLTresolve:
   0>:	mflr    r12
   +4>:	bcl     20,4*cr7+so, <__glink_PLTresolve+8>
   +8>:	mflr    r11
   +12>:	ld      r2,-16(r11)
   +16>:	mtlr    r12
   +20>:	add     r11,r2,r11
   +24>:	ld      r12,0(r11)
   +28>:	ld      r2,8(r11)
   +32>:	mtctr   r12
   +36>:	ld      r11,16(r11)
   +40>:	bctr
End of assembler dump.

Registers such as ctr, r12, and r11 seem to have
been replaced by the time of the crash. (r12
seems to point to strncmp and ctr has the value
0xf . r11 is 0x0 .)

At this point, it does not look like I'll be much
help for analyzing this failure on the powerpc
families.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)