ptrace(2) debugging

Konstantin Belousov kostikbel at gmail.com
Mon Nov 25 14:45:55 UTC 2019


On Sun, Nov 24, 2019 at 09:26:40PM -0600, Kyle Evans wrote:
> On Sun, Nov 24, 2019 at 8:25 PM Kyle Evans <kevans at freebsd.org> wrote:
> >
> > On Sun, Nov 24, 2019 at 9:17 AM Kyle Evans <kevans at freebsd.org> wrote:
> > >
> > > On Sun, Nov 24, 2019 at 5:40 AM Konstantin Belousov <kostikbel at gmail.com> wrote:
> > > >
> > > > On Sun, Nov 24, 2019 at 12:01:04AM -0600, Kyle Evans wrote:
> > > > > Hi,
> > > > >
> > > > > I'm working on implementing `reptyr -T` on FreeBSD because I'm pretty
> > > > > bad about starting long-running jobs outside of tmux and often desire
> > > > > to reparent these jobs into tmux. I've gotten to a point where it's
> > > > > getting stuck in waitpid(2) when attempting to work over the session
> > > > > leader to ignore SIGHUP. The chain of operations looks roughly like
> > > > > this:
> > > > >
> > > > > PT_ATTACH -> waitpid -> kill(SIGCONT) -> PT_TO_SCE -> waitpid ->
> > > > > PT_TO_SCE -> waitpid
> > > > >
> > > > > Each of the waitpids are paired with a PT_LWPINFO. The first waitpid
> > > > > observes SIGSTOP. The second waitpid observes SIGCONT. I would expect
> > > > > the third to observe PL_FLAG_SCE on ptrace_lwpinfo->pl_flags, but
> > > > > instead it actually hangs as the target process is now sleep-inhibited
> > > > > and stuck in "pause" wchan.
> > > > >
> > > > > I've uploaded a truss excerpt at [0] in case it's helpful -- pid=10204
> > > > > is the process I'm reparenting, initially just attached/detached to
> > > > > make sure reptyr *can* do this. pid=10187 is the sshd that it's
> > > > > running under, and pid=10188 is the shell running under that.
> > > > >
> > > > > Anyone have good advice on debugging this? It seems like it might be
> > > > > some kind of kernel bug, as it's already done this same dance once
> > > > > before when grabbing sshd and my attempts to distill it down to a
> > > > > simple test case failed. The FreeBSD part of reptyr needed some love,
> > > > > though, so that can't be discounted either.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Kyle Evans
> > > > >
> > > > > [0] https://people.freebsd.org/~kevans/truss.log
> > > > How much work would be to provide a self-contained standalone test ?
> > >
> > > I'm still struggling to write a self-contained example...
> > > unfortunately a basic attach and trace them all to syscall entry isn't
> > > sufficient. I'm slowly removing surface area from reptyr to try and
> > > narrow it down- its operations between attaching to sshd and the
> > > misbehavior are quite extensive, as it mmaps a page into the target,
> > > opens a socket established by reptyr and passes an fd back over it.
> >
> > I managed to narrow it down, kind of. The problem is specifically with
> > trying to trace zsh as a session leader. Easiest reproducer is to
> > change shell to zsh and run this:
> > https://people.freebsd.org/~kevans/ptrace_test.c -> you'll hang and
> > have to ^C that sucker. My experiments showed that running this on zsh
> > spawned any other way is fine, and changing shell to /bin/sh is also
> > fine.
> >
> 
> Follow up part three, zsh is in sigsuspend() while a child is
> executing and this is the cause. More effective reproducer:
> https://people.freebsd.org/~kevans/ptrace_test2.c -> the behavior
> makes a little more sense to me, but that seems less than ideal.
It is still not quite a reproducer.  Can you modify the test to clearly
indicate what you want vs. what you get ?

I see the legitimate loop of the parent (debugger) doing
 59326 ptrace_test2 CALL  ptrace(PT_ATTACH,0xe7bf,0x1,0)
 59326 ptrace_test2 RET   ptrace 0
 59326 ptrace_test2 CALL  wait4(0xe7bf,0x7fffffffe3bc,0,0)
 59326 ptrace_test2 RET   wait4 59327/0xe7bf
 59326 ptrace_test2 CALL  ptrace(PT_LWPINFO,0xe7bf,0x7fffffffe3c0,0xa0)
 59326 ptrace_test2 RET   ptrace 0
 59326 ptrace_test2 CALL  ptrace(PT_FOLLOW_FORK,0xe7bf,0,0x1)
 59326 ptrace_test2 RET   ptrace 0
 59326 ptrace_test2 CALL  ptrace(PT_GETREGS,0xe7bf,0x7fffffffe300,0)
 59326 ptrace_test2 RET   ptrace 0
 59326 ptrace_test2 CALL  kill(0xe7bf,SIGCONT)
 59326 ptrace_test2 RET   kill 0
 59326 ptrace_test2 CALL  ptrace(PT_TO_SCE,0xe7bf,0x1,0)
 59326 ptrace_test2 RET   ptrace 0
 59326 ptrace_test2 CALL  wait4(0xe7bf,0x7fffffffe25c,0,0)
 59326 ptrace_test2 RET   wait4 59327/0xe7bf
 59326 ptrace_test2 CALL  ptrace(PT_LWPINFO,0xe7bf,0x7fffffffe260,0xa0)
 59326 ptrace_test2 RET   ptrace 0
 59326 ptrace_test2 CALL  ptrace(PT_DETACH,0xe7bf,0x1,0)
 59326 ptrace_test2 RET   ptrace 0
in a loop, and I do not see anything wrong with it.


More information about the freebsd-hackers mailing list