misc/166340: Process under FreeBSD 9.0 hangs in uninterruptable
sleep with apparently no syscall (empty wchan)
Christian Esken
Christian.Esken at trivago.com
Tue Mar 27 16:40:15 UTC 2012
The following reply was made to PR kern/166340; it has been noted by GNATS.
From: Christian Esken <Christian.Esken at trivago.com>
To: bug-followup at freebsd.org
Cc: Konstantin Belousov <kostikbel at gmail.com>, avg at freebsd.org
Subject: Re: misc/166340: Process under FreeBSD 9.0 hangs in uninterruptable
sleep with apparently no syscall (empty wchan)
Date: Tue, 27 Mar 2012 17:30:48 +0200
--=-3sl93MaMYlu/dkvvUUBe
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
Konstantin Belousov wrote:
> Thank you for the data. Semi-obviously, the callout_stop() call in
> sleepq_check_timeout() have to return 0, otherwise we would not call
> mi_switch() there. But I do not see how this can happen, because
> the callout state, printed from kgdb, still indicates that callout
> is pending. Callout cannot be reset while in sleepq code.
>
> So there are two possible routes to go forward: preferrable is for
> you to extract the self-contained C program that would illustrate
> the issue and send this sample to me. Second is to recompile your
> kernel with INVARIANTS/WITNESS and possibly KTR and see what happen.
I repeated the test with INVARIANTS/WITNESS and KTR compiled in
(actually WITNESS was already included during the last test).
I ran KTR with nothing filtered out, and formatted the dump with
"ktrdump -cftH -i ktr.out". The whole log is excessive (1GB), so
I have extrated two short sections (see attachment).
The first section shows the last action of the application, namely a
succselful sendto() to a TCP socket, and then waiting for an answer via
recvfrom().
The second section illustrates the lock/unlock sequence of the sleep
mutex for the recfrom(). It goes like LOCK, LOCK, UNLOCK.
This time the signal status is different. We have a pending signal:
USER PID PPID PENDING CAUGHT IGNORED BLOCKED STAT WCHAN
nobody 9163 1 4000 80005006 79f88010 0 D -
Looks like SIGPROF (27). Just wondering where it comes from.
By the way: I evaluated the possibility to implement a standalone test
case. It would be extremely complicated, as the issue is while writing
to the socket, and thus it would require extracting the socket code from
the Thrift procect (http://thrift.apache.org/ ).
Christian
--=-3sl93MaMYlu/dkvvUUBe
Content-Disposition: attachment; filename="wait_recvfrom.txt"
Content-Type: text/plain; name="wait_recvfrom.txt"; charset="UTF-8"
Content-Transfer-Encoding: 7bit
Last actions of pid 9163 (serelog):
sendto() succesful
recvfrom() waits for data, uisng sleep mutex
7463551 5 5644560314159 /usr/src/sys/kern/kern_sx.c:352 0xfffffe01972b2480 XUNLOCK (sx) so_snd_sx 0xfffffe0344b07490 r = 0 at /usr/src/sys/kern/uipc_sockbuf.c:160
7463552 1 5644560316280 /usr/src/sys/kern/kern_mutex.c:205 0xfffffe01bcf4d000 LOCK (sleep mutex) kqueue 0xfffffe03eef5eb00 r = 0 at /usr/src/sys/kern/kern_event.c:1779
7463553 5 5644560319107 /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:149 0xfffffe01972b2480 syscall: p=0xfffffe01bc1b0910 error=0 return 0x77d 0x77d
7463557 5 5644560329931 /usr/src/sys/kern/subr_trap.c:101 0xfffffe01972b2480 userret: thread 0xfffffe01972b2480 (pid 9163, serelog)
7463559 4 5644560336733 /usr/src/sys/vm/uma_core.c:1975 0xfffffe0008364480 uma_zalloc_arg thread 8364480 zone mbuf flags 1
7463561 6 5644560344432 /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:84 0xfffffe01972ae900 syscall: td=0xfffffe01972ae900 pid 9767 gettimeofday (0x7fffffffac40, 0, 0x43fd18)
7463562 1 5644560347528 /usr/src/sys/kern/kern_mutex.c:222 0xfffffe01bcf4d000 UNLOCK (sleep mutex) kqueue 0xfffffe03eef5eb00 r = 0 at /usr/src/sys/kern/kern_event.c:1796
7463563 6 5644560348788 /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:149 0xfffffe01972ae900 syscall: p=0xfffffe01bc1ad000 error=0 return 0 0x43fd18
7463564 5 5644560351047 /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:187 0xfffffe01972b2480 syscall sendto exit thread 0xfffffe01972b2480 pid 9163 proc serelog
7463565 1 5644560354848 /usr/src/sys/kern/kern_mutex.c:205 0xfffffe01bcf4d000 LOCK (sleep mutex) kqueue 0xfffffe02f7525700 r = 0 at /usr/src/sys/kern/kern_event.c:1779
7463567 5 5644560360499 /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:84 0xfffffe01972b2480 syscall: td=0xfffffe01972b2480 pid 9163 gettimeofday (0x7fffffffc600, 0, 0xffffffff
)
7463568 2 5644560364617 /usr/src/sys/kern/kern_mutex.c:356 0xfffffe004e384000 _mtx_lock_sleep: taskqueue contested (lock=0xfffffe0008375000) at /usr/src/sys/kern/kern_mutex.c
:147
7463569 5 5644560366559 /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:149 0xfffffe01972b2480 syscall: p=0xfffffe01bc1b0910 error=0 return 0 0xffffffff
7463570 3 5644560369374 /usr/src/sys/kern/kern_mutex.c:222 0xfffffe0008375000 UNLOCK (sleep mutex) taskqueue 0xfffffe004e2604b0 r = 0 at /usr/src/sys/kern/subr_taskqueue.c:21
6
7463572 6 5644560378372 /usr/src/sys/kern/subr_trap.c:101 0xfffffe01972ae900 userret: thread 0xfffffe01972ae900 (pid 9767, httpd)
7463573 3 5644560380378 /usr/src/sys/kern/kern_mutex.c:205 0xfffffe0008375000 LOCK (sleep mutex) bio queue 0xffffffff81146ad0 r = 0 at /usr/src/sys/geom/geom_io.c:77
7463574 4 5644560385085 /usr/src/sys/kern/kern_mutex.c:205 0xfffffe0008364480 LOCK (sleep mutex) mbuf 0xfffffe043ffe5e10 r = 0 at /usr/src/sys/vm/uma_core.c:2013
7463575 1 5644560388724 /usr/src/sys/kern/kern_mutex.c:222 0xfffffe01bcf4d000 UNLOCK (sleep mutex) kqueue 0xfffffe02f7525700 r = 0 at /usr/src/sys/kern/kern_event.c:1796
7463576 6 5644560391648 /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:187 0xfffffe01972ae900 syscall gettimeofday exit thread 0xfffffe01972ae900 pid 9767 proc httpd
7463577 3 5644560394202 /usr/src/sys/kern/kern_mutex.c:222 0xfffffe0008375000 UNLOCK (sleep mutex) bio queue 0xffffffff81146ad0 r = 0 at /usr/src/sys/geom/geom_io.c:84
7463578 1 5644560396480 /usr/src/sys/kern/kern_mutex.c:205 0xfffffe01bcf4d000 LOCK (sleep mutex) kqueue 0xfffffe01e96bd000 r = 0 at /usr/src/sys/kern/kern_event.c:1779
7463580 5 5644560404315 /usr/src/sys/kern/subr_trap.c:101 0xfffffe01972b2480 userret: thread 0xfffffe01972b2480 (pid 9163, serelog)
7463581 2 5644560408289 /usr/src/sys/kern/kern_mutex.c:374 0xfffffe004e384000 _mtx_lock_sleep: spinning on 0xfffffe004e2604b0 held by 0xfffffe0008375000
7463582 5 5644560410639 /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:187 0xfffffe01972b2480 syscall gettimeofday exit thread 0xfffffe01972b2480 pid 9163 proc serelog
7463583 2 5644560414981 /usr/src/sys/kern/kern_mutex.c:205 0xfffffe004e384000 LOCK (sleep mutex) taskqueue 0xfffffe004e2604b0 r = 0 at /usr/src/sys/kern/kern_mutex.c:147
7463584 5 5644560419167 /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:84 0xfffffe01972b2480 syscall: td=0xfffffe01972b2480 pid 9163 recvfrom (0x6, 0x7fffffffc734, 0x4)
7463585 1 5644560421688 /usr/src/sys/kern/kern_mutex.c:222 0xfffffe01bcf4d000 UNLOCK (sleep mutex) kqueue 0xfffffe01e96bd000 r = 0 at /usr/src/sys/kern/kern_event.c:1796
7463586 5 5644560423707 /usr/src/sys/kern/kern_sx.c:291 0xfffffe01972b2480 XLOCK (sx) so_rcv_sx 0xfffffe0344b073a0 r = 0 at /usr/src/sys/kern/uipc_sockbuf.c:148
7463587 3 5644560426330 /usr/src/sys/geom/geom_io.c:678 0xfffffe0008375000 g_up biodone bp 0xfffffe004e8a0740 provider da0s1 off 134525138944 len 131072
7463588 5 5644560428231 /usr/src/sys/kern/kern_mutex.c:205 0xfffffe01972b2480 LOCK (sleep mutex) so_rcv 0xfffffe0344b07380 r = 0 at /usr/src/sys/kern/uipc_socket.c:1488
This is the last place where I see 0xfffffe01972b2480 or "pid 9163".
The value behind so_rcv os 0xfffffe0344b07380, and that one comes twice again:
7464953 4 5644563693233 /usr/src/sys/kern/kern_mutex.c:205 0xfffffe0008364480 LOCK (sleep mutex) so_rcv 0xfffffe0344b07380 r = 0 at /usr/src/sys/netinet/tcp_input.c:2834
7464954 0 5644563696896 /usr/src/sys/kern/kern_mutex.c:205 0xfffffe004e75a900 LOCK (sleep mutex) KNOTE 0xfffffe004e104010 r = 0 at /usr/src/sys/vm/uma_core.c:2013
7464955 6 5644563699004 /usr/src/sys/kern/subr_trap.c:101 0xfffffe01972ae900 userret: thread 0xfffffe01972ae900 (pid 9767, httpd)
7464956 1 5644563701340 /usr/src/sys/kern/kern_mutex.c:222 0xfffffe01bcf4d000 UNLOCK (sleep mutex) kqueue 0xfffffe01f3fb8900 r = 0 at /usr/src/sys/kern/kern_event.c:1796
7464957 6 5644563704636 /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:187 0xfffffe01972ae900 syscall gettimeofday exit thread 0xfffffe01972ae900 pid 9767 proc httpd
7464958 3 5644563706986 /usr/src/sys/geom/geom_io.c:165 0xfffffe0008375000 #2 0xffffffff807c5a45 at g_io_schedule_up+0x175
7464959 4 5644563708981 /usr/src/sys/kern/kern_mutex.c:222 0xfffffe0008364480 UNLOCK (sleep mutex) so_rcv 0xfffffe0344b07380 r = 0 at /usr/src/sys/kern/uipc_sockbuf.c:201
What I find notewothy is, that the lock/unlock sequence with so_rcv 0xfffffe0344b07380 looks odd:
LOCK, LOCK, UNLOCK
Natively one would expect to see one more UNLOCK like this:
LOCK, UNLOCK, LOCK, UNLOCK
--=-3sl93MaMYlu/dkvvUUBe--
More information about the freebsd-bugs
mailing list