Re: Patched gpsd and /dev/pps0 results in "sleeping thread" kernel panic

From: Ian Lepore <ian_at_freebsd.org>
Date: Wed, 01 Sep 2021 15:07:33 UTC
On Tue, 2021-08-31 at 22:35 -0600, Warner Losh wrote:
> On Tue, Aug 31, 2021 at 9:25 PM Warner Losh <imp@bsdimp.com> wrote:
> 
> > 
> > 
> > On Tue, Aug 31, 2021 at 9:13 PM Ian Lepore <ian@freebsd.org> wrote:
> > 
> > > On Tue, 2021-08-31 at 21:07 -0600, Warner Losh wrote:
> > > > On Tue, Aug 31, 2021 at 8:11 PM Craig Leres <leres@freebsd.org>
> > > > wrote:
> > > > 
> > > > > 
> > > > > On 8/31/21 6:54 PM, Warner Losh wrote:
> > > > > > Can you post a traceback? We recently tightened up the
> > > > > > sleeping
> > > > > conditions
> > > > > > and many of the panics are straight forward to fix
> > > > > 
> > > > > Is this sufficient?
> > > > > 
> > > > 
> > > > Maybe. I'm guessing that the pps driver isn't setting
> > > > driver_mtx for
> > > > the
> > > > pps_state that it keeps, so when we're sleeping in pps_ioctl
> > > > we're
> > > > not
> > > > dropping the the ppbus lock, which leads to this diagnostic.
> > > > 
> > > > Warner
> > > > 
> > > > 
> > > 
> > > 
> > > Exactly so, I've just been digging through the code.  It's not an
> > > easy
> > > fix because the ppbus pps driver doesn't have access to the ppbus
> > > mutex
> > > at the point where it's registering to be a pps driver.
> > > 
> > 
> > I just came to the conclusion that either we'd need to create a
> > call in
> > ppbus
> > to set this mutex, or to return its lock...
> > 
> > 
> > > Craig, is there any chance you can configure your hardware so
> > > that the
> > > PPS signal is connected to either the CTS or DCD pin of a uart
> > > (even
> > > the same uart that's communicating with the gps receiver should
> > > work if
> > > you use DCD)?  You'll get jitter and accuracy performance
> > > comparable to
> > > the parallel port pin, and avoid this mutex/sleeping problem.
> > > 
> > 
> > That's easier, but depending on the hardware might be harder :(
> > 
> 
> Either I'm missing something (likely am), or this might fix it up,
> or at least get away from the warning:
> 
> https://reviews.freebsd.org/D31763
> 
> Note: I can't recall why ppbus has to be locked for this call.
> This code dates from the very earliest days of locking and
> so may do things simply because it seemed like a good idea
> without a specific notion as to what that lock is protecting. If
> so, the real fix may be to not take the lock in pps_ioctl at
> all and maybe instead use a reference count (the most
> often reason for 'a good idea' was to keep the device
> from going away, though this is a parent lock, not a
> child one so I'm less sure about that being the reason).
> 
> Warner
> 
> 

Yeah, figuring out whether it's safe to drop/re-aquire the ppsbus lock
at this point was the harder part of the work; I'm not very familiar
with the ppsbus stuff in general.

Warner, I pointed out a small change needed to your proposed patch in
D31763.  In the past I've found that one can't propose a different
patch in Phab without commandeering the entire change, so I just posted
an inline comment with the couple extra lines of code needed, hoping
you have more time today than I do to prepare a new patch.

-- Ian


> > Warner
> > 
> > 
> > > -- Ian
> > > 
> > > 
> > > 
> > > > >                 Craig
> > > > > 
> > > > > toc2 4 # kgdb /boot/kernel/kernel /var/crash/vmcore.1
> > > > > GNU gdb (GDB) 10.2 [GDB v10.2 for FreeBSD]
> > > > > Copyright (C) 2021 Free Software Foundation, Inc.
> > > > > License GPLv3+: GNU GPL version 3 or later
> > > > > <http://gnu.org/licenses/gpl.html>
> > > > > This is free software: you are free to change and
> > > > > redistribute it.
> > > > > There is NO WARRANTY, to the extent permitted by law.
> > > > > Type "show copying" and "show warranty" for details.
> > > > > This GDB was configured as "x86_64-portbld-freebsd12.2".
> > > > > Type "show configuration" for configuration details.
> > > > > For bug reporting instructions, please see:
> > > > > <https://www.gnu.org/software/gdb/bugs/>.
> > > > > Find the GDB manual and other documentation resources online
> > > > > at:
> > > > >      <http://www.gnu.org/software/gdb/documentation/>.
> > > > > 
> > > > > For help, type "help".
> > > > > Type "apropos word" to search for commands related to
> > > > > "word"...
> > > > > Reading symbols from /boot/kernel/kernel...
> > > > > Reading symbols from /boot/kernel.LBLNET/kernel.debug...
> > > > > 
> > > > > Unread portion of the kernel message buffer:
> > > > > Sleeping thread (tid 101408, pid 2046) owns a non-sleepable
> > > > > lock
> > > > > KDB: stack backtrace of thread 101408:
> > > > > sched_switch() at sched_switch+0x630/frame 0xfffffe0070e4f760
> > > > > mi_switch() at mi_switch+0xd4/frame 0xfffffe0070e4f790
> > > > > sleepq_catch_signals() at sleepq_catch_signals+0x403/frame
> > > > > 0xfffffe0070e4f7e0
> > > > > sleepq_timedwait_sig() at sleepq_timedwait_sig+0x14/frame
> > > > > 0xfffffe0070e4f820
> > > > > _sleep() at _sleep+0x1b3/frame 0xfffffe0070e4f8a0
> > > > > pps_ioctl() at pps_ioctl+0x298/frame 0xfffffe0070e4f8f0
> > > > > ppsioctl() at ppsioctl+0x48/frame 0xfffffe0070e4f920
> > > > > devfs_ioctl() at devfs_ioctl+0xb0/frame 0xfffffe0070e4f970
> > > > > VOP_IOCTL_APV() at VOP_IOCTL_APV+0x7b/frame
> > > > > 0xfffffe0070e4f9a0
> > > > > vn_ioctl() at vn_ioctl+0x16a/frame 0xfffffe0070e4fab0
> > > > > devfs_ioctl_f() at devfs_ioctl_f+0x1e/frame
> > > > > 0xfffffe0070e4fad0
> > > > > kern_ioctl() at kern_ioctl+0x2b7/frame 0xfffffe0070e4fb30
> > > > > sys_ioctl() at sys_ioctl+0xfa/frame 0xfffffe0070e4fc00
> > > > > amd64_syscall() at amd64_syscall+0x387/frame
> > > > > 0xfffffe0070e4fd30
> > > > > fast_syscall_common() at fast_syscall_common+0xf8/frame
> > > > > 0xfffffe0070e4fd30
> > > > > --- syscall (54, FreeBSD ELF64, sys_ioctl), rip =
> > > > > 0x8004c899a, rsp
> > > > > =
> > > > > 0x7fffdfdfc6a8, rbp = 0x7fffdfdfc730 ---
> > > > > panic: sleeping thread
> > > > > cpuid = 8
> > > > > time = 1630456142
> > > > > KDB: stack backtrace:
> > > > > db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
> > > > > 0xfffffe005ab73ab0
> > > > > vpanic() at vpanic+0x17b/frame 0xfffffe005ab73b00
> > > > > panic() at panic+0x43/frame 0xfffffe005ab73b60
> > > > > propagate_priority() at propagate_priority+0x282/frame
> > > > > 0xfffffe005ab73b90
> > > > > turnstile_wait() at turnstile_wait+0x30c/frame
> > > > > 0xfffffe005ab73be0
> > > > > __mtx_lock_sleep() at __mtx_lock_sleep+0x199/frame
> > > > > 0xfffffe005ab73c70
> > > > > ppcintr() at ppcintr+0x2a0/frame 0xfffffe005ab73c90
> > > > > ithread_loop() at ithread_loop+0x23c/frame 0xfffffe005ab73cf0
> > > > > fork_exit() at fork_exit+0x7e/frame 0xfffffe005ab73d30
> > > > > fork_trampoline() at fork_trampoline+0xe/frame
> > > > > 0xfffffe005ab73d30
> > > > > --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
> > > > > Uptime: 17m20s
> > > > > Dumping 607 out of 12240
> > > > > MB:..3%..11%..22%..32%..43%..51%..61%..72%..82%..93%
> > > > > 
> > > > > __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
> > > > > 55      /usr/src/sys/amd64/include/pcpu_aux.h: No such file
> > > > > or
> > > > > directory.
> > > > > (kgdb) bt
> > > > > #0  __curthread () at
> > > > > /usr/src/sys/amd64/include/pcpu_aux.h:55
> > > > > #1  doadump (textdump=1) at ../../../kern/kern_shutdown.c:371
> > > > > #2  0xffffffff80b83ada in kern_reboot (howto=260)
> > > > >      at ../../../kern/kern_shutdown.c:451
> > > > > #3  0xffffffff80b83f33 in vpanic (fmt=<optimized out>,
> > > > > ap=<optimized out>)
> > > > >      at ../../../kern/kern_shutdown.c:880
> > > > > #4  0xffffffff80b83d53 in panic (fmt=<unavailable>)
> > > > >      at ../../../kern/kern_shutdown.c:807
> > > > > #5  0xffffffff80be7152 in propagate_priority
> > > > > (td=0xfffff80196959740)
> > > > >      at ../../../kern/subr_turnstile.c:228
> > > > > #6  0xffffffff80be7d1c in turnstile_wait
> > > > > (ts=0xfffff8000357a780,
> > > > >      owner=<optimized out>, queue=0) at
> > > > > ../../../kern/subr_turnstile.c:785
> > > > > #7  0xffffffff80b62ca9 in __mtx_lock_sleep
> > > > > (c=0xfffff80003969bd0,
> > > > >      v=<optimized out>) at ../../../kern/kern_mutex.c:654
> > > > > #8  0xffffffff8086fcc0 in ppcintr (arg=0xfffff80003969b00)
> > > > >      at ../../../dev/ppc/ppc.c:1546
> > > > > #9  0xffffffff80b4637c in intr_event_execute_handlers
> > > > > (p=<optimized
> > > > > out>,
> > > > >      ie=0xfffff800030d9d00) at ../../../kern/kern_intr.c:1143
> > > > > #10 ithread_execute_handlers (p=<optimized out>,
> > > > > ie=0xfffff800030d9d00)
> > > > >      at ../../../kern/kern_intr.c:1156
> > > > > #11 ithread_loop (arg=0xfffff800039aa7e0) at
> > > > > ../../../kern/kern_intr.c:1236
> > > > > #12 0xffffffff80b42e1e in fork_exit (
> > > > >      callout=0xffffffff80b46140 <ithread_loop>,
> > > > > arg=0xfffff800039aa7e0,
> > > > >      frame=0xfffffe005ab73d40) at
> > > > > ../../../kern/kern_fork.c:1080
> > > > > #13 <signal handler called>
> > > > > (kgdb)
> > > > > 
> > > 
> > > 
> > >