sem(4) lockup in python?

Adrian Chadd adrian at freebsd.org
Wed Jan 11 16:51:17 UTC 2012


... yes, enable WITNESS already and see if you can find LORs. :)

(Sheesh, that's what it's for! :)


Adrian

On 11 January 2012 08:47, Garrett Cooper <yanegomi at gmail.com> wrote:
> On Wed, Jan 11, 2012 at 6:33 AM, Ivan Voras <ivoras at freebsd.org> wrote:
>> On 11 January 2012 14:06, John Baldwin <jhb at freebsd.org> wrote:
>>> On Wednesday, January 11, 2012 6:21:18 am Ivan Voras wrote:
>>>> The lang/python27 port can optionally be built with the support for
>>>> POSIX semaphores - i.e. sem(4). This option is labeled as experimental
>>>> so it may be that the code is simply incorrect. I've tried it and get
>>>> frequent hangs with the python process in the "usem" state. The kernel
>>>> stack is as follows and looks reasonable:
>>>>
>>>> # procstat -kk 19008
>>>>    PID    TID COMM             TDNAME           KSTACK
>>>>
>>>> 19008 101605 python           -                mi_switch+0x174
>>>> sleepq_catch_signals+0x2f4 sleepq_wait_sig+0x16 _sleep+0x269
>>>> do_sem_wait+0xa19 __umtx_op_sem_wait+0x51 amd64_syscall+0x450
>>>> Xfast_syscall+0xf7
>>>>
>>>> The process doesn't react to SIGINT or SIGTERM but fortunately reacts to
>>>> SIGKILL.
>>>>
>>>> This could be an error in Python code but OTOH this code is not
>>>> FreeBSD-specific so it's unlikely.
>>>
>>> This is using the new umtx-based semaphore code that David Xu wrote.  He is
>>> probably the best person to ask (cc'd).
>>>
>>
>> Ok, I've encountered the problem repeatedly while building databases/tdb:
>>  it uses Python in the build process (but maybe it needs something else in
>> parallel to provoke the problem).
>
> Glad to see that iXsystems isn't the only one ([1] -- please add a "me
> too" to the PR). The problem is that we do FreeNAS nightlies and they
> frequently get stuck building tdb (10%~20% of the time) and it sticks
> when doing interactive builds as well. The issue appears to be
> exacerbated when we have more builds running in parallel on the same
> machine. I've also run into the same issue compiling talloc because it
> uses the same waf infrastructure as tdb, which was designed to "speed
> things up by forcing builds to be parallelized" (It builds
> kern.smp.ncpus jobs instead of -j 1). Furthermore, it seems to occur
> regardless of whether or not we have the WITH_SEM enabled in python or
> not (build.ix's copy of python doesn't have it enabled, but
> streetfighter.ix, my system bayonetta, etc do).
>
> I haven't actually enabled WITNESS or the deadlock resolver and
> checked for LORs / deadlocks, but that might be an alternate avenue to
> pursue in debugging the issue; my gut is that the issue exists within
> the code that handles the subprocessing stuff and/or the GIL stuff in
> the python interpreter and that the race condition between a command
> actually finishing and not is relatively small (in most cases) and in
> most cases python's code wins and continues on as usual. It could also
> be some non-threadsafe code trying to run in parallel touching things
> that it shouldn't in the python interpreter. It would also be
> interesting to see what python3k brings to the table, but using that
> would be introducing some extra unknowns into the equation.
>
> It can be reproduced by running continuous builds of talloc or tdb.
>
> Thanks!
> -Garrett
>
> 1. http://www.freebsd.org/cgi/query-pr.cgi?pr=ports/163489
> _______________________________________________
> freebsd-hackers at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe at freebsd.org"


More information about the freebsd-hackers mailing list