LOR with turnstile chain and sio

Wed Jan 21 23:17:39 PST 2004

On Wed, 21 Jan 2004, Kris Kennaway wrote:

> On Wed, Jan 21, 2004 at 10:21:38AM -0800, Kris Kennaway wrote:
> > Well, I managed to panic bento in under 12 hours of use, after Joe had
> > been using it for the past 2 months without problems :-)
>
> It panicked again a few minutes after rebooting:

It got a page fault (probably a null pointer) in propagate_priority().
The rest is bugs in the debugging code.  It's interesting that it managed
to print messages about a bug in siocnputc() (actually mostly in witness)
using siocnputc().

> lock order reversal
>  1st 0xc06e6864 turnstile chain (turnstile chain) @ kern/subr_turnstile.c:233

The turnstile lock seems to be tc->tc_lock.  sched_lock is also held in
propagate_priority() and doesn't seem to cause any LORs or other problems
with serial consoles, but holding it for calls to printf() at least used
to be fatal for syscons consoles.

>  2nd 0xc0717580 sio (sio) @ dev/sio/sio.c:3203
> Stack backtrace:
> backtrace(c06923f4,c0717580,c06d2020,c06d2020,c06a0ede) at backtrace+0x17

The LOR is bogus.  cnputc() must be callable at almost any time so that it
can print debugging messages like this one, so acquiring sio_lock must work
no matter what other locks are held.

> witness_lock(c0717580,8,c06a0ede,c83,3f8) at witness_lock+0x672
> _mtx_lock_spin_flags(c0717580,0,c06a0ed5,c83,c6758aa8) at _mtx_lock_spin_flags+0xda
> siocnputc(c06d21a0,6b,5,e1bffa1c,6b) at siocnputc+0x81
> cnputc(6b,c06e9688,1,c64f1640,c06a51d9) at cnputc+0x7a
> putchar(6b,e1bffa1c,c06e97c8,c06e9688,c06e7c80) at putchar+0x6c
> kvprintf(c06a51d8,c0529640,e1bffa1c,a,e1bffa3c) at kvprintf+0x8d

I think the printf is about a trap with interrupts disabled.  That shouldn't
happen any more than the page fault, but it is certain to happen for page
faults in propagate_priority() in -current, since sched_lock is held and
spinlocks mask interrupts too much.

> printf(c06a51d8,c,1,c068e464,14b) at printf+0x57
> trap(c06e0018,ffc00010,c06e0010,b4,c64f2740) at trap+0xd7

Got a page fault in propagate_priority().

> calltrap() at calltrap+0x5
> --- trap 0xc, eip = 0xc052e9f0, esp = 0xe1bffac0, ebp = 0xe1bffae4 ---
> propagate_priority(c64f1640,0,c0691bdc,1dd,c06e7080) at propagate_priority+0x220
> turnstile_wait(0,c06e34e0,c64f13c0,1cc,c06e34e0) at turnstile_wait+0x33d
> _mtx_lock_sleep(c06e34e0,0,c06929fb,2e7,c6846500) at _mtx_lock_sleep+0x125
> _mtx_lock_flags(c06e34e0,0,c06929fb,2e7,0) at _mtx_lock_flags+0x98
> kern_select(c64f1640,7,bfbfed40,0,0) at kern_select+0x47
> select(c64f1640,e1bffd14,c06a525a,3ee,5) at select+0x66
> syscall(2f,2f,2f,bfbfedc0,80810a0) at syscall+0x2c0
> Xint0x80_syscall() at Xint0x80_syscall+0x1d
> --- syscall (93), eip = 0x28105b8f, esp = 0xbfbfed0c, ebp = 0xbfbfee58 ---
>
> and then page faulted and went into a hard lockup.

Perhaps it recursed some more into the debugging code.

I used the following in an old uncommitted version of sio to prevent sio
locking going anywhere near witness even in the non-console case:

%%%
Index: sio.c
===================================================================
RCS file: /home/ncvs/src/sys/dev/sio/sio.c,v
retrieving revision 1.418
diff -u -2 -r1.418 sio.c

--- sio.c	18 Jan 2004 12:26:33 -0000	1.418
+++ sio.c	19 Jan 2004 00:18:06 -0000
@@ -522,7 +640,7 @@
 	while (sio_inited != 2)
 		if (atomic_cmpset_int(&sio_inited, 0, 1)) {
+			/* XXX might still need MTX_QUIET. */
 			mtx_init(&sio_lock, sio_driver_name, NULL,
-			    (comconsole != -1) ?
-			    MTX_SPIN | MTX_QUIET : MTX_SPIN);
+				 MTX_SPIN | MTX_NOWITNESS);
 			atomic_store_rel_int(&sio_inited, 2);
 		}
%%%

The current version uses lower level locking so witnessing the locking
is just unavailable.

Bruce