sem(4) lockup in python?

Garrett Cooper yanegomi at gmail.com
Wed Jan 11 16:47:13 UTC 2012


On Wed, Jan 11, 2012 at 6:33 AM, Ivan Voras <ivoras at freebsd.org> wrote:
> On 11 January 2012 14:06, John Baldwin <jhb at freebsd.org> wrote:
>> On Wednesday, January 11, 2012 6:21:18 am Ivan Voras wrote:
>>> The lang/python27 port can optionally be built with the support for
>>> POSIX semaphores - i.e. sem(4). This option is labeled as experimental
>>> so it may be that the code is simply incorrect. I've tried it and get
>>> frequent hangs with the python process in the "usem" state. The kernel
>>> stack is as follows and looks reasonable:
>>>
>>> # procstat -kk 19008
>>>    PID    TID COMM             TDNAME           KSTACK
>>>
>>> 19008 101605 python           -                mi_switch+0x174
>>> sleepq_catch_signals+0x2f4 sleepq_wait_sig+0x16 _sleep+0x269
>>> do_sem_wait+0xa19 __umtx_op_sem_wait+0x51 amd64_syscall+0x450
>>> Xfast_syscall+0xf7
>>>
>>> The process doesn't react to SIGINT or SIGTERM but fortunately reacts to
>>> SIGKILL.
>>>
>>> This could be an error in Python code but OTOH this code is not
>>> FreeBSD-specific so it's unlikely.
>>
>> This is using the new umtx-based semaphore code that David Xu wrote.  He is
>> probably the best person to ask (cc'd).
>>
>
> Ok, I've encountered the problem repeatedly while building databases/tdb:
>  it uses Python in the build process (but maybe it needs something else in
> parallel to provoke the problem).

Glad to see that iXsystems isn't the only one ([1] -- please add a "me
too" to the PR). The problem is that we do FreeNAS nightlies and they
frequently get stuck building tdb (10%~20% of the time) and it sticks
when doing interactive builds as well. The issue appears to be
exacerbated when we have more builds running in parallel on the same
machine. I've also run into the same issue compiling talloc because it
uses the same waf infrastructure as tdb, which was designed to "speed
things up by forcing builds to be parallelized" (It builds
kern.smp.ncpus jobs instead of -j 1). Furthermore, it seems to occur
regardless of whether or not we have the WITH_SEM enabled in python or
not (build.ix's copy of python doesn't have it enabled, but
streetfighter.ix, my system bayonetta, etc do).

I haven't actually enabled WITNESS or the deadlock resolver and
checked for LORs / deadlocks, but that might be an alternate avenue to
pursue in debugging the issue; my gut is that the issue exists within
the code that handles the subprocessing stuff and/or the GIL stuff in
the python interpreter and that the race condition between a command
actually finishing and not is relatively small (in most cases) and in
most cases python's code wins and continues on as usual. It could also
be some non-threadsafe code trying to run in parallel touching things
that it shouldn't in the python interpreter. It would also be
interesting to see what python3k brings to the table, but using that
would be introducing some extra unknowns into the equation.

It can be reproduced by running continuous builds of talloc or tdb.

Thanks!
-Garrett

1. http://www.freebsd.org/cgi/query-pr.cgi?pr=ports/163489


More information about the freebsd-hackers mailing list