fork wedging (I think)

Daniel O'Connor doconnor at gsoft.com.au
Thu Mar 1 01:19:09 UTC 2007


We run a set of meteor and wind radar systems based on FreeBSD and one of them
I am seeing an odd problem that does not occur very often (once a month
perhaps).

The main process we use for gathering data regularly fork()s children off to
process data (at each 30 minute acquisition cycle). The data is read from an 
external acquisition chassis using a (slow) custom PCI card. Typical data 
rates are 1-2Mbyte/sec. It will also fork off processes to send email,
although unfortunately I don't know how to extract a Tcl stack trace from 
inside gdb :(

The problem seems to occur when the acquisition process (called the Recorder)
tries to fork off a new child. I gathered some information during an earlier
failure (mid January) show below. I tried to get a crash dump today when it
failed again but as I ran sysctl -a the system wedged. I think processes got
stuck in allproc (I pressed ctrl-t on an ls process running another session
opened before I ran sysctl).

  UID   PID  PPID CPU PRI NI   VSZ   RSS MWCHAN STAT  TT       TIME COMMAND
19999  1453     1  44  -8 -5 17056 11380 piperd I<s   ??  144:41.70 /usr/home/radar/skiymet/libexec/Recorder /usr/home/radar/skiymet/libexec/acquisition/sks.tcl /usr/home/radar/skiymet/etc/ud3
19999 12293  1453  44  96 -5 17056    16 user m D<L   ??    0:00.00 /usr/home/radar/skiymet/libexec/Recorder /usr/home/radar/skiymet/libexec/acquisition/sks.tcl /usr/home/radar/skiymet/etc/ud3

Note that under normal circumstances there would only be one Recorder process.

eureka:~>gdb $GSHOME/libexec/Recorder
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...
(gdb) eureka:~>fg
gdb /usr/home/radar/skiymet/libexec/Recorder

(gdb)
(gdb) attach 1453
Attaching to program: /usr/home/radar/skiymet/libexec/Recorder, process 1453
Reading symbols from /usr/local/lib/libtcl84.so.1...done.
Loaded symbols for /usr/local/lib/libtcl84.so.1
Reading symbols from /lib/libm.so.4...done.
Loaded symbols for /lib/libm.so.4
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /usr/home/radar/skiymet/lib/acquisition//synccheck.plugin...done.
Loaded symbols for /usr/home/radar/skiymet/lib/acquisition//synccheck.plugin
Reading symbols from /usr/home/radar/skiymet/lib/acquisition//integrate.plugin...done.
Loaded symbols for /usr/home/radar/skiymet/lib/acquisition//integrate.plugin
Reading symbols from /libexec/ld-elf.so.1...done.
Loaded symbols for /libexec/ld-elf.so.1
0x00000008009c395c in read () from /lib/libc.so.6
(gdb) bt
#0  0x00000008009c395c in read () from /lib/libc.so.6
#1  0x000000080072f77f in TclpCreateProcess () from /usr/local/lib/libtcl84.so.1
#2  0x0000000800717d25 in TclCreatePipeline () from /usr/local/lib/libtcl84.so.1
#3  0x00000008007186d0 in Tcl_OpenCommandChannel () from /usr/local/lib/libtcl84.so.1
#4  0x0000000800704af8 in Tcl_ExecObjCmd () from /usr/local/lib/libtcl84.so.1
#5  0x00000008006c9c72 in TclEvalObjvInternal () from /usr/local/lib/libtcl84.so.1
#6  0x00000008006f0b03 in TclExprFloatError () from /usr/local/lib/libtcl84.so.1
#7  0x00000008006f462d in TclCompEvalObj () from /usr/local/lib/libtcl84.so.1
#8  0x000000080071b7ad in TclObjInterpProc () from /usr/local/lib/libtcl84.so.1
#9  0x00000008006c9c72 in TclEvalObjvInternal () from /usr/local/lib/libtcl84.so.1
#10 0x00000008006f0b03 in TclExprFloatError () from /usr/local/lib/libtcl84.so.1
#11 0x00000008006f462d in TclCompEvalObj () from /usr/local/lib/libtcl84.so.1
#12 0x000000080071b7ad in TclObjInterpProc () from /usr/local/lib/libtcl84.so.1
#13 0x00000008006c9c72 in TclEvalObjvInternal () from /usr/local/lib/libtcl84.so.1
#14 0x00000008006cb1e4 in Tcl_EvalEx () from /usr/local/lib/libtcl84.so.1
#15 0x0000000800721511 in Tcl_AfterObjCmd () from /usr/local/lib/libtcl84.so.1
#16 0x0000000800720ccc in Tcl_CreateTimerHandler () from /usr/local/lib/libtcl84.so.1
#17 0x0000000800711fea in Tcl_ServiceEvent () from /usr/local/lib/libtcl84.so.1
#18 0x0000000800712357 in Tcl_DoOneEvent () from /usr/local/lib/libtcl84.so.1
#19 0x000000000041b272 in gsio_read (interp=0x58d400, iov=0xe14408, iovcnt=4, timeout=81000) at gsio.c:234
#20 0x000000000040a3fb in rawdata_wakeup (interp=0x58d400, inqueue=0x7fffffff6aa0, outqueue=0xdf,
    flags=0x7fffffffa584, token=0xffffff000713c340) at data.c:179
#21 0x00000000004327b3 in proc_run (interp=0x58d400) at procedure.c:528
#22 0x00000000004306b5 in ud_Stream (cd=0xd, interp=0x58d400, argc=2, argv=0x0) at procedure.c:157
#23 0x00000008006c840b in TclInvokeStringCommand () from /usr/local/lib/libtcl84.so.1
#24 0x00000008006c9c72 in TclEvalObjvInternal () from /usr/local/lib/libtcl84.so.1
#25 0x00000008006f0b03 in TclExprFloatError () from /usr/local/lib/libtcl84.so.1
#26 0x00000008006f462d in TclCompEvalObj () from /usr/local/lib/libtcl84.so.1
#27 0x000000080071b7ad in TclObjInterpProc () from /usr/local/lib/libtcl84.so.1
#28 0x00000008006c9c72 in TclEvalObjvInternal () from /usr/local/lib/libtcl84.so.1
#29 0x00000008006f0b03 in TclExprFloatError () from /usr/local/lib/libtcl84.so.1
#30 0x00000008006f462d in TclCompEvalObj () from /usr/local/lib/libtcl84.so.1
#31 0x000000080071b7ad in TclObjInterpProc () from /usr/local/lib/libtcl84.so.1
#32 0x00000008006c9c72 in TclEvalObjvInternal () from /usr/local/lib/libtcl84.so.1
#33 0x00000008006f0b03 in TclExprFloatError () from /usr/local/lib/libtcl84.so.1
#34 0x00000008006f462d in TclCompEvalObj () from /usr/local/lib/libtcl84.so.1
#35 0x00000008006cbc86 in Tcl_EvalObjEx () from /usr/local/lib/libtcl84.so.1
#36 0x00000008006ceca2 in Tcl_CatchObjCmd () from /usr/local/lib/libtcl84.so.1
#37 0x00000008006c9c72 in TclEvalObjvInternal () from /usr/local/lib/libtcl84.so.1
#38 0x00000008006f0b03 in TclExprFloatError () from /usr/local/lib/libtcl84.so.1
#39 0x00000008006f485a in Tcl_ExprObj () from /usr/local/lib/libtcl84.so.1
#40 0x00000008006ca3b3 in Tcl_ExprBooleanObj () from /usr/local/lib/libtcl84.so.1
#41 0x00000008006d1979 in Tcl_IfObjCmd () from /usr/local/lib/libtcl84.so.1
#42 0x00000008006c9c72 in TclEvalObjvInternal () from /usr/local/lib/libtcl84.so.1
#43 0x00000008006cb1e4 in Tcl_EvalEx () from /usr/local/lib/libtcl84.so.1
#44 0x000000080070b092 in Tcl_FSEvalFile () from /usr/local/lib/libtcl84.so.1
#45 0x000000080070ea30 in Tcl_Main () from /usr/local/lib/libtcl84.so.1
#46 0x00000000004281fe in main (argc=13, argv=0x7fffffff6aa0) at main.c:49
(gdb)
#19 0x000000000041b272 in gsio_read (interp=0x58d400, iov=0xe14408, iovcnt=4, timeout=81000) at gsio.c:234
234                 Tcl_DoOneEvent(TCL_ALL_EVENTS | TCL_DONT_WAIT);
(gdb)
#20 0x000000000040a3fb in rawdata_wakeup (interp=0x58d400, inqueue=0x7fffffff6aa0, outqueue=0xdf,
    flags=0x7fffffffa584, token=0xffffff000713c340) at data.c:179
179         if (dop->do_ifmod->m.ifmod.if_read(interp, dop->do_iov, DATA_NBUFS, timeout))

eureka:~>kill 1453
eureka:~>kill 1453
1453: No such process
eureka:~>ps -axlwww | grep Rec
  UID   PID  PPID CPU PRI NI   VSZ   RSS MWCHAN STAT  TT       TIME COMMAND
19999 12293     1  44  96 -5 17056    16 user m D<L   ??    0:00.00 /usr/home/radar/skiymet/libexec/Recorder /usr/home/radar/skiymet/libexec/acquisition/sks.tcl /usr/home/radar/skiymet/etc/ud3
19999 56552 53088   0  96  0   388   272 -      R+    p3    0:00.00 grep Rec
eureka:~>kill 12293
eureka:~>kill -9 12293
eureka:~>kill -9 12293
eureka:~>fstat /dev/gsio0
USER     CMD          PID   FD MOUNT      INUM MODE         SZ|DV R/W NAME
radar    Recorder   12293    3 /dev        149 crw-rw-r--   gsio0 rw  /dev/gsio0

Interestingly close-on-exec is set on that file descriptor, so it shouldn't 
be keeping the gsio0 device busy like that.

Does anyone have any suggestions for debugging it? I will try and get a crash
dump the next time it happens although it's going to be difficult if I can't
use sysctl :(

(I don't have console access and the system is in a very remote location just
to make it extra fun)

The FreeBSD version is 6.1 with some minor mods to add the various custom
drivers we're using.

Anyone have any suggestions for tracking the problem down? Or better yet a
solution :) I can upgrade it to RELENG_6 if it will fix the problem but I'd
rather not do so without a reasonable idea it will improve things.

Thanks.

-- 
Daniel O'Connor software and network engineer
for Genesis Software - http://www.gsoft.com.au
"The nice thing about standards is that there
are so many of them to choose from."
  -- Andrew Tanenbaum
GPG Fingerprint - 5596 B766 97C0 0E94 4347 295E E593 DC20 7B3F CE8C
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20070301/9b9c428c/attachment.pgp


More information about the freebsd-stable mailing list