Race condition in debugger?
David Xu
davidxu at freebsd.org
Mon Apr 18 21:37:47 PDT 2005
Peter Edwards wrote:
>[Very late response: I just experienced the same problem and
>remembered the issue had been brought up before]
>
>On 2/14/05, Greg 'groggy' Lehey <grog at freebsd.org> wrote:
>
>
>>I'm having some problems with userland gdb on recent -CURRENT builds:
>>at some point it hangs.
>>
>>Specifically, I'm setting a conditional breakpoint like this:
>>
>> b Minsert_blockletpointer if I->inode_num == 0x1f0bb
>>
>>inode_num increments for 1, so I hit this breakpoint about 100,000
>>times. Or I should. What happens is that the debugger hangs at some
>>point on the way. ktrace shows multiple copies of:
>>
>> 12325 gdb CALL ptrace(12,0x3026,0xbfbfd5e0,0)
>> 12325 gdb RET ptrace 0
>> 12325 gdb CALL ptrace(PT_STEP,0x3026,0x1,0)
>> 12325 gdb RET ptrace 0
>> 12325 gdb CALL wait4(0xffffffff,0xbfbfd808,0,0) <-- stops here
>> 12325 gdb RET wait4 12326/0x3026
>> 12325 gdb CALL kill(0x3026,0)
>> 12325 gdb RET kill 0
>> 12325 gdb CALL ptrace(PT_GETREGS,0x3026,0xbfbfd5c0,0)
>>
>>When it hangs, it's at the call to wait4, as shown. It looks like the
>>completion of the ptrace request isn't being reported back.
>>
>>
>
>I think I know what's going on with this, and I have a feeling that
>there's a couple of other wait()-related issues that were left open on
>the lists that might be explained by the issue.
>
>Here's my hypothesis: kern_wait() checks each child of the current
>process to see if they have exited, or should otherwise report status
>to wait/wait3/wait4/waitpid, If it finds that all candidate children
>have nothing to report, it goes asleep, waiting to be awoken by the/a
>child reporting status, and repeats the process: it looks a bit like
>this:
>
>kern_wait()
>{
>loop:
> foreach child of self {
> if (child has status to report)
> return status;
> }
> lock self
> msleep(on "self")
> unlock self
> goto loop;
>}
>
>Problem is, that there's no lock protecting that the conditions in the
>inner loop hold by the time the current process locks its own "struct
>proc" and invokes msleep(). (It's probably most likely the race will
>happen on an SMP machine or with PREEMPTION, but the aquiry of
>curproc's lock could possibly cause the issue if it needed to sleep.),
>i.e., you can miss the wakeup generated by a particular child between
>checking the process in the inner loop, and going to sleep.
>
>I can at least reproduce this for the ptrace/gdb case, but AFAICT, it
>could happen for the standard wait()/exit() path, too. I worked up a
>patch to fix the problem by having those parts of the kernel that wake
>the process up flag the fact in the parent's flags and doing the
>wakeup while holding tha parent process lock, and noticing if this
>flag has been set before sleeping. (A simpler solution would be to
>hold the parent lock across the bulk of kern_wait, but from what I can
>gather this will lead to at least one LOR)
>
>I've been unable to reproduce the problem with a kernel with this
>patch, and using a nice sprinkling of printfs can show that when GDB
>hangs, the race has just occurred.
>
>Anyone got opinions on this?
>Cheers,
>Peadar.
>
>
I just found another case that if the parent masks SIGCHLD, then we will
get the race
too. I have tested the patch, it works, I will tweak the patch and
commit it soon.
David Xu
More information about the freebsd-current
mailing list