A potential fix for arm64's: sh`forkshell child-process path after fork sometimes has a bad stack pointer value
Mark Millard
markmi at dsl-only.net
Wed Feb 15 01:08:31 UTC 2017
On 2017-Feb-14, at 9:17 AM, Mark Millard <markmi at dsl-only.net> wrote:
> On 2017-Feb-14, at 8:56 AM, Andrew Turner <andrew at fubar.geek.nz> wrote:
>
> On Tue, 14 Feb 2017 08:35:54 -0800
>> Mark Millard <markmi at dsl-only.net> wrote:
>>
>>> The following change has let my test run for 8.5 hours so far without
>>> a fork-failure in sh`forkshell :
>>>
>>> # svnlite diff /usr/src/sys/arm64/arm64/swtch.S
>>> Index: /usr/src/sys/arm64/arm64/swtch.S
>>> ===================================================================
>>> --- /usr/src/sys/arm64/arm64/swtch.S (revision 312982)
>>> +++ /usr/src/sys/arm64/arm64/swtch.S (working copy)
>>> @@ -241,6 +241,12 @@
>>> mov fp, #0 /* Stack traceback stops here. */
>>> bl _C_LABEL(fork_exit)
>>>
>>> + /*
>>> + * Disable interrupts to avoid
>>> + * overwriting sp_el0 and spsr_el1 by an IRQ exception.
>>> + */
>>> + msr daifset, #2
>>> +
>>> /* Restore sp and lr */
>>> ldp x0, x1, [sp]
>>> msr sp_el0, x0
>>> @@ -263,12 +269,6 @@
>>> ldp x28, x29, [sp, #TF_X + 28 * 8]
>>> /* Skip x30 as it was restored above as lr */
>>>
>>> - /*
>>> - * Disable interrupts to avoid
>>> - * overwriting spsr_el1 by an IRQ exception.
>>> - */
>>> - msr daifset, #2
>>> -
>>> /* Restore elr and spsr */
>>> ldp x0, x1, [sp, #16]
>>> msr elr_el1, x0
>>>
>>> I'm going to switch to attempting a self-hosted buildworld
>>> buildkernel again.
>>
>> Can you try the patch in https://reviews.freebsd.org/D9593. It moves
>> loading of sp_el0 until after interrupts have been disabled.
>>
>> Andrew
>
> Sure. I'll stop the self-hosted buildworld buildkernel and
> switch over to your source.
>
> One minor point:
>
> /* Skip x30 as it was restored above as lr */
>
> now should say something like:
>
> /* Skip x30 as it is restored below as lr */
As reported on https://reviews.freebsd.org/D9593 the
buildworld buildkernel test stopped in buildworld
with two sh processed failing.
But the core files do not suggest a stack corruption
to me, nor was fork active. My test code
recorded its before and after fork stack address
examples and they were equal as they should be.
It appeared that simply starting the buildworld
buildkernel would continue on so I restarted it.
It has in fact continued on and is still building.
I see no reason to take the stoppage as something
to count against the change. And I'll say so in
new comments in https://reviews.freebsd.org/D9593
once the build completes or fails and I report on
that.
Failure details (both cores are basically the same
for these details):
(lldb) up
frame #9: 0x000000004054c82c libc.so.7`ifree(tsd=<unavailable>, ptr=<unavailable>, tcache=<unavailable>, slow_path=<unavailable>) + 304 at jemalloc_jemalloc.c:1889
1886 usize = isalloc(tsd_tsdn(tsd), ptr, config_prof);
1887 prof_free(tsd, ptr, usize);
1888 } else if (config_stats || config_valgrind)
-> 1889 usize = isalloc(tsd_tsdn(tsd), ptr, config_prof);
1890 if (config_stats)
1891 *tsd_thread_deallocatedp_get(tsd) += usize;
1892
(lldb) print config_stats
(const bool) $0 = true
(lldb) print config_valgrind
(const bool) $1 = false
So the new failure was actually during config_stats activity,
which is apparently enabled by default for how I built
-r312982 .
The actual abort initiation was from:
(lldb) up
frame #3: 0x00000000405340fc libc.so.7`huge_node_get [inlined] __je_rtree_get(dependent=true) + 308 at rtree.h:328
325 RTREE_GET_LEAF(RTREE_HEIGHT_MAX-1)
326 #undef RTREE_GET_SUBTREE
327 #undef RTREE_GET_LEAF
-> 328 default: not_reached();
329 }
330 #undef RTREE_GET_BIAS
331 not_reached();
The back traces look similar to this one of the pair:
(lldb) bt
* thread #1: tid = 100137, 0x0000000040554e54 libc.so.7`_thr_kill + 8, name = 'sh', stop reason = signal SIGABRT
* frame #0: 0x0000000040554e54 libc.so.7`_thr_kill + 8
frame #1: 0x0000000040554e18 libc.so.7`__raise(s=6) + 64 at raise.c:52
frame #2: 0x0000000040554d8c libc.so.7`abort + 84 at abort.c:65
frame #3: 0x00000000405340fc libc.so.7`huge_node_get [inlined] __je_rtree_get(dependent=true) + 308 at rtree.h:328
frame #4: 0x00000000405340dc libc.so.7`huge_node_get [inlined] __je_chunk_lookup(dependent=true) at chunk.h:89
frame #5: 0x00000000405340dc libc.so.7`huge_node_get(ptr=<unavailable>) + 276 at jemalloc_huge.c:11
frame #6: 0x0000000040534114 libc.so.7`__je_huge_salloc(tsdn=<unavailable>, ptr=<unavailable>) + 24 at jemalloc_huge.c:434
frame #7: 0x000000004054c84c libc.so.7`ifree [inlined] __je_arena_salloc(demote=false) + 32 at arena.h:1426
frame #8: 0x000000004054c82c libc.so.7`ifree [inlined] __je_isalloc(demote=false) at jemalloc_internal.h:1045
frame #9: 0x000000004054c82c libc.so.7`ifree(tsd=<unavailable>, ptr=<unavailable>, tcache=<unavailable>, slow_path=<unavailable>) + 304 at jemalloc_jemalloc.c:1889
frame #10: 0x000000004054cd94 libc.so.7`__free(ptr=0x0000000040a17520) + 148 at jemalloc_jemalloc.c:2016
frame #11: 0x0000000000411328 sh`ckfree(p=<unavailable>) + 32 at memalloc.c:88
frame #12: 0x0000000000407cd8 sh`clearcmdentry + 76 at exec.c:505
frame #13: 0x0000000000406bfc sh`evalcommand(cmd=<unavailable>, flags=<unavailable>, backcmd=<unavailable>) + 3476 at eval.c:1182
frame #14: 0x0000000000405570 sh`evaltree(n=0x0000000040a1c270, flags=<unavailable>) + 212 at eval.c:290
frame #15: 0x000000000041105c sh`cmdloop(top=<unavailable>) + 252 at main.c:231
frame #16: 0x0000000000410ed0 sh`main(argc=<unavailable>, argv=<unavailable>) + 660 at main.c:178
frame #17: 0x0000000000402f30 sh`__start + 360
frame #18: 0x0000000040434658 ld-elf.so.1`.rtld_start + 24 at rtld_start.S:41
===
Mark Millard
markmi at dsl-only.net
More information about the freebsd-arm
mailing list