cvs commit: src/lib/libkse/thread thr_kern.c

Arno J. Klaassen arno at heho.snv.jussieu.fr
Fri Jan 18 13:46:34 PST 2008


Hello,


>LF On Dec 2, 2007, at 09:31, Arno J. Klaassen wrote:

>LF > For info, the attached patch, which partially reverts mfc of rev 1.286
>LF > of kern_fork.c, seems to work as well (without the above patch to
>LF > be clear),

>LF I just upgraded our 8-core build server from pre-november 6-STABLE to
>LF 6.3-RELEASE, and ran into this issue, causing our fork-heavy builder
>LF processes to lock up regularly.

>LF Your suggested patch (reverting the 1.286 MFC to sys/kern/
>LF kern_fork.c) allows our builds to run to completion;

Bon, I can just say that the box of my problem is a heavily used
production server, running flawlessly, and uninterrupted, since the
end of November with the kern_fork.c partial revert.

It doesn't seem to hurt or disrupt anything else (I use).

> JE .. the reason it was changed was that the
> JE previous code results in heavily loaded threaded processes that
> JE fork, hanging in indefinite lockups IN THE KERNEL. Eventually
> JE the whole machine would become unuseable.  In particular when
> JE there is NFS being used but in other situations too. SO I'm
> JE damned if I do and damned if I don't on this.

maybe; we almost exclusively (now) use FreeBSD for Java + NFS (some
vestiges of C[++] resisting); I only got this problem on 2X2-smp
RELENG_6, not on RELENG[67] UP or 1x2 SMP; I had a 'similar' problem
in 2x2-SMp RELENG_7 with was bandaided with rev 1.128 of
lib/libkse/thread/thr_kern.

JE> > please do a ktrace of the program and send that to me
JE> >
JE> Here's my guess as to what is happening:
JE> thos is not based on code..
JE>
JE> thread 1 calls the dummy fork(3)
JE> thread 2 calls the dummy fork(3)
JE> thread 1 calls fork(2), (the syscall, from within the dummy fork)
JE> thread2 calls fork(2) (the real one in the kernel)
JE>      thread 1 proceeds
JE>      thread 2 blocks on a VM lock until thread 1 completes
JE>      kernel duplicates the memory space
JE > thread 1 returns from fork(2)
JE > thread 1 takes out mutex X inside dummy fork(3)
JE>      thread 2 proceeeds in the kernel on forking.
JE>      kernel duplicates the memory space (including mutex X)
JE > thread 2 returns from kernel and looks for mutex X
JE > thread 2 in client tries to take out mutex X inside dummy fork(3) and
JE > waits.
JE > thread 1 releases mutex X
JE > thread 2 proceeeds
JE > ================================
JE > in child1 thread1 runs fine.
JE > in child2 thread2 waits for thread 1 to drop the mutex
JE>   (there is no thread1)

[ .. alternatif .. ]

DE > I suppose it is malloc() that is getting into an inconsistent
DE > state in the child.


I'm not qualified for the FreeBSD internals, though both sound
plausible to me in the sense that the thread-library does not seem to
matter : easy to provoke with libpthread on RELENG_6, just a bit less
easy to provoke with libthr on RELENG_6 (see PR 116667 and 166668),
harder to provoke with libpthread on RELENG_7 (with above band-aid
sufficient for me to not be able to reproduce it again) and /me unable
to provoke it with libthr on RELENG_7.


Hope this helps tot get a clue


Best regards,

Arno


More information about the freebsd-java mailing list