How best to debug locking/scheduler problems

John Baldwin jhb at freebsd.org
Wed Jun 17 21:18:11 UTC 2009


On Wednesday 17 June 2009 3:52:54 pm Mel Flynn wrote:
> On Wednesday 17 June 2009 04:15:26 John Baldwin wrote:
> > On Tuesday 16 June 2009 7:01:45 pm Mel Flynn wrote:
> > > On Tuesday 16 June 2009 11:02:42 John Baldwin wrote:
> > > > On Tuesday 16 June 2009 1:52:23 pm Mel Flynn wrote:
> > > > > Hi John,
> > > > >
> > > > > On Tuesday 16 June 2009 04:19:57 John Baldwin wrote:
> > > > > > On Monday 15 June 2009 5:53:05 pm Mel Flynn wrote:
> > > > > > >   PID    TID COMM             TDNAME           KSTACK
> > > > > > >  4283 100215 kdeinit4         -                mi_switch
> > > > > > > turnstile_wait _mtx_lock_sleep uipc_peeraddr kern_getpeername
> > > > > > > getpeername syscall Xint0x80_syscall
> > > > > > > % ps -ww 4283
> > > > > > >   PID  TT  STAT      TIME COMMAND
> > > > > > >  4283  ??  T      0:00.38 kdeinit4: kdeinit4: kio_http http
> > > > > > > local:/tmp/ksocket-mel/klauncherxJ1635.slave-socket
> > > > > > > local:/tmp/ksocket- mel/plasmayC1653.slave-socket (kdeinit4)
> > > > > > >
> > > > > > > %ls -l /tmp/ksocket-mel/
> > > > > > >
> > > > > > > total 2
> > > > > > > -rw-rw-r--  1 mel  wheel  62 Jun 14 22:55 KSMserver__0
> > > > > > > srw-------  1 mel  wheel   0 Jun 14 22:55 kdeinit4__0
> > > > > > > srwxrwxr-x  1 mel  wheel   0 Jun 14 22:55
> > > > > > > klauncherxJ1635.slave-socket
> > > > > >
> > > > > > You can use kgdb and the scripts at www.freebsd.org/~jhb/gdb. 
> > > > > > Simply run 'kgdb' as root and do 'lcd /folder/with/scripts' and
> > > > > > 'source gdb6'. You can then do 'lockchain 4283' to find who holds
> > > > > > the lock this thread is blocked on and what state they are in.
> > > > >
> > > > > Looks like a deadlock:
> > > > >
> > > > > (kgdb) lockchain 4283
> > > > >  thread 100215 (pid 4283, kdeinit4) blocked on lock 0xc64374a0
> > > > > "unp_mtx" thread 100122 (pid 1635, klauncher) blocked on lock
> > > > > 0xc6806348 "unp_mtx" thread 100215 (pid 4283, kdeinit4) blocked on
> > > > > lock 0xc64374a0 "unp_mtx" thread 100122 (pid 1635, klauncher) 
blocked
> > > > > on lock 0xc6806348 "unp_mtx" thread 100215 (pid 4283, kdeinit4)
> > > > > blocked on lock 0xc64374a0 "unp_mtx" thread 100122 (pid 1635,
> > > > > klauncher) blocked on lock 0xc6806348 "unp_mtx" thread 100215 (pid
> > > > > 4283, kdeinit4) blocked on lock 0xc64374a0 "unp_mtx" thread 100122
> > > > > (pid 1635, klauncher) blocked on lock 0xc6806348 "unp_mtx" thread
> > > > > 100215 (pid 4283, kdeinit4) blocked on lock 0xc64374a0 "unp_mtx"
> > > > > thread 100122 (pid 1635, klauncher) blocked on lock 0xc6806348
> > > > > "unp_mtx" thread 100215 (pid 4283, kdeinit4) blocked on lock
> > > > > 0xc64374a0 "unp_mtx" thread 100122 (pid 1635, klauncher) blocked on
> > > > > lock 0xc6806348 "unp_mtx" thread 100215 (pid 4283, kdeinit4) blocked
> > > > > on lock 0xc64374a0 "unp_mtx" thread 100122 (pid 1635, klauncher)
> > > > > blocked on lock 0xc6806348 "unp_mtx" thread 100215 (pid 4283,
> > > > > kdeinit4) blocked on lock 0xc64374a0 "unp_mtx" thread 100122 (pid
> > > > > 1635, klauncher) blocked on lock 0xc6806348 "unp_mtx" thread 100215
> > > > > (pid 4283, kdeinit4) blocked on lock 0xc64374a0 "unp_mtx" thread
> > > > > 100122 (pid 1635, klauncher) blocked on lock 0xc6806348 "unp_mtx"
> > > > > thread 100215 (pid 4283, kdeinit4) blocked on lock 0xc64374a0
> > > > > "unp_mtx" thread 100122 (pid 1635, klauncher) blocked on lock
> > > > > 0xc6806348 "unp_mtx" DEADLOCK
> > > > >
> > > > > Looking through the scripts now to see how I can get more info on 
the
> > > > > call chain and hoping I don't panic the machine ;). It is quite
> > > > > random to reproduce.
> > > >
> > > > In kgdb you can simply do 'tid 100122' followed by 'bt' and 'tid
> > > > 100215' followed by 'bt'.
> > >
> > > Cool, thanks for helping John. Of course it pretty much shows me what
> > > procstat -k shows and can't get any info on the userland part, but I can
> > > fully inspect the locks and threads.
> > >
> > > Both threads are in TDS_INHIBITED state, and blocked on:
> > > (kgdb) frame 0
> > > #0  sched_switch (td=0xc5971240, newtd=0xc4d39900, flags=259)
> > >     at /usr/src/sys/kern/sched_ule.c:1864
> > > 1864                    cpuid = PCPU_GET(cpuid);
> >
> > That doesn't really tell us anything except that it isn't running.  We 
know
> > it is actually blocked on a lock, and we need the full stack trace to see
> > where the two threads were trying to acquire the locks, hence 'bt'.  '
> > procstat -k' output would be fine, too.
> 
> the respective bt's:
> (kgdb) tid 100122
>     at /usr/src/sys/kern/kern_mutex.c:447
> #4  0xc06a68a5 in uipc_peeraddr (so=0xc64309a8, nam=0xe79a2c70)
>     at /usr/src/sys/kern/uipc_usrreq.c:682
> #5  0xc06a1e71 in kern_getpeername (td=0xc56e8900, fd=12, sa=0xe79a2c70, 
> alen=0xe79a2c6c)
>     at /usr/src/sys/kern/uipc_syscalls.c:1566
> 
> (kgdb) tid 100215
> (kgdb) bt
>     at /usr/src/sys/kern/kern_mutex.c:447
> #4  0xc06a68a5 in uipc_peeraddr (so=0xc6976338, nam=0xe9ae9c70)
>     at /usr/src/sys/kern/uipc_usrreq.c:682
> #5  0xc06a1e71 in kern_getpeername (td=0xc5971240, fd=7, sa=0xe9ae9c70, 
> alen=0xe9ae9c6c)
>     at /usr/src/sys/kern/uipc_syscalls.c:1566

These are the key frames.  It looks like uipc_peeraddr() tries to lock two unp 
locks w/o any protection from the global unp linkage lock.  I've changed it 
to use the same locking as uipc_accept() where it first grabs a read lock on 
the linkage lock and then just locks the other end of the connection to copy 
out its sockaddr.

--- //depot/user/jhb/socket/kern/uipc_usrreq.c
+++ /home/jhb/work/p4/socket/kern/uipc_usrreq.c
@@ -671,7 +671,7 @@
 	KASSERT(unp != NULL, ("uipc_peeraddr: unp == NULL"));
 
 	*nam = malloc(sizeof(struct sockaddr_un), M_SONAME, M_WAITOK);
-	UNP_PCB_LOCK(unp);
+	UNP_LINK_RLOCK();
 	/*
 	 * XXX: It seems that this test always fails even when connection is
 	 * established.  So, this else clause is added as workaround to
@@ -681,7 +681,7 @@
 	if (unp2 != NULL) {
 		UNP_PCB_LOCK(unp2);
 		if (unp2->unp_addr != NULL)
-			sa = (struct sockaddr *) unp->unp_conn->unp_addr;
+			sa = (struct sockaddr *) unp2->unp_addr;
 		else
 			sa = &sun_noname;
 		bcopy(sa, *nam, sa->sa_len);
@@ -690,7 +690,7 @@
 		sa = &sun_noname;
 		bcopy(sa, *nam, sa->sa_len);
 	}
-	UNP_PCB_UNLOCK(unp);
+	UNP_LINK_RUNLOCK();
 	return (0);
 }
 
@@ -850,7 +850,7 @@
 		 * return the slightly counter-intuitive but otherwise
 		 * correct error that the socket is not connected.
 		 *
-		 * Locking here must be done carefully: the inkage lock
+		 * Locking here must be done carefully: the linkage lock
 		 * prevents interconnections between unpcbs from changing, so
 		 * we can traverse from unp to unp2 without acquiring unp's
 		 * lock.  Socket buffer locks follow unpcb locks, so we can

-- 
John Baldwin


More information about the freebsd-hackers mailing list