Help debugging stable/10

Tue Dec 16 23:27:03 UTC 2014

On 17/12/2014 02:59, John Baldwin wrote:
> On Wednesday, December 10, 2014 8:38:21 pm Shane Ambler wrote:
>> Since upgrading to 10.1 (RC2) I have had trouble getting uptimes greater
>> than 1 day. I have little experience debugging the OS so could use some
>> help.
>>
>> # uname -a
>> FreeBSD leader.local 10.1-STABLE FreeBSD 10.1-STABLE #0 r275364: Tue Dec
>> 2 08:13:06 ACDT 2014
>> root at leader.local:/usr/obj/usr/src-stable/sys/GENERIC amd64
>>
>> This is on an ASUS P8H61-M LE/USB3 corei5 8GB with 3x 2TB Seagate drives
>> in raidz.
>> Full backtraces and dmesg at http://shaneware.biz/freebsddebugdata/
>>
>> The thing that breaks which forces me to reset the machine is that I am
>> unable to start new processes. Existing processes continue to work I
>> just can't start new ones. Some simple commands do work but top ps
>> procstat usbconfig all fail to start. I have been able use script to
>> get some backtraces from kgdb before restarting.
>
> It looks like your processes are hanging on a global lock used by sysctl(3).
> That would explain hangs in top/ps/procstat as they all use sysctls.  For
> example:
>
> Thread 709 (Thread 101172):
> #0  sched_switch (td=0xfffff8006cccd490, newtd=<value optimized out>, flags=<value optimized out>)
>      at /usr/src-stable/sys/kern/sched_ule.c:1945
> #1  0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at /usr/src-stable/sys/kern/kern_synch.c:493
> #2  0xffffffff8097110a in sleepq_wait (wchan=0x0, pri=0) at /usr/src-stable/sys/kern/subr_sleepqueue.c:617
> #3  0xffffffff8093284a in _sx_xlock_hard (sx=0xffffffff8157adc0, tid=18446735279441892496,
>      opts=<value optimized out>, file=0x0, line=0) at /usr/src-stable/sys/kern/kern_sx.c:681
> #4  0xffffffff80936004 in userland_sysctl (td=<value optimized out>, name=0xfffffe021e889930, namelen=3,
>      old=<value optimized out>, oldlenp=<value optimized out>, inkernel=<value optimized out>,
>      new=<value optimized out>, newlen=<value optimized out>, retval=<value optimized out>, flags=0) at sx.h:152
> #5  0xffffffff80935e64 in sys___sysctl (td=0xfffff8006cccd490, uap=0xfffffe021e889a40)
>      at /usr/src-stable/sys/kern/kern_sysctl.c:1536
> #6  0xffffffff80d29c51 in amd64_syscall (td=0xfffff8006cccd490, traced=0) at subr_syscall.c:134
> #7  0xffffffff80d0eceb in Xfast_syscall () at /usr/src-stable/sys/amd64/amd64/exception.S:391
> #8  0x0000000800b6e64a in ?? ()
> Previous frame inner to this frame (corrupt stack?)
> ---Type <return> to continue, or q <return> to quit---
>
> To hang like this means that some sysctl handler is running for a long time.
>
> Looking in your traces, this appears to be the problematic sysctl:
>
> Thread 397 (Thread 100422):
> #0  sched_switch (td=0xfffff8001049f920, newtd=<value optimized out>, flags=<value optimized out>)
>      at /usr/src-stable/sys/kern/sched_ule.c:1945
> #1  0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at /usr/src-stable/sys/kern/kern_synch.c:493
> #2  0xffffffff8097189a in sleepq_timedwait (wchan=0x0, pri=0) at /usr/src-stable/sys/kern/subr_sleepqueue.c:652
> #3  0xffffffff8093320e in _sleep (ident=<value optimized out>, lock=<value optimized out>,
>      priority=<value optimized out>, wmesg=<value optimized out>, sbt=<value optimized out>, pr=<value optimized out>,
>      flags=<value optimized out>) at /usr/src-stable/sys/kern/kern_synch.c:251
> #4  0xffffffff80891ae3 in g_waitfor_event (func=<value optimized out>, arg=<value optimized out>,
>      flag=<value optimized out>) at /usr/src-stable/sys/geom/geom_event.c:427
> #5  0xffffffff808933d9 in sysctl_kern_geom_conftxt (oidp=<value optimized out>, arg1=<value optimized out>,
>      arg2=<value optimized out>, req=0xfffffe021dc13868) at /usr/src-stable/sys/geom/geom_kern.c:169
> #6  0xffffffff80935a9e in sysctl_root (arg1=<value optimized out>, arg2=<value optimized out>)
>      at /usr/src-stable/sys/kern/kern_sysctl.c:1500
> #7  0xffffffff80936078 in userland_sysctl (td=<value optimized out>, name=0xfffffe021dc13930,
>      namelen=<value optimized out>, old=<value optimized out>, oldlenp=<value optimized out>,
>      inkernel=<value optimized out>, new=<value optimized out>, newlen=<value optimized out>,
>      retval=<value optimized out>, flags=0) at /usr/src-stable/sys/kern/kern_sysctl.c:1610
> #8  0xffffffff80935e64 in sys___sysctl (td=0xfffff8001049f920, uap=0xfffffe021dc13a40)
>      at /usr/src-stable/sys/kern/kern_sysctl.c:1536
> #9  0xffffffff80d29c51 in amd64_syscall (td=0xfffff8001049f920, traced=0) at subr_syscall.c:134
> #10 0xffffffff80d0eceb in Xfast_syscall () at /usr/src-stable/sys/amd64/amd64/exception.S:391
> #11 0x00000008019c864a in ?? ()
> Previous frame inner to this frame (corrupt stack?)
>
> (Note that HEAD has some changes by mjg@ (cc'd) to allow concurrent sysctls
> that would at least make this less painful.)
>
> The root issue is why is geom hanging.  Hmm, that thread is also stuck on an
> sx lock:
>
> Thread 259 (Thread 100015):
> #0  sched_switch (td=0xfffff80005170000, newtd=<value optimized out>, flags=<value optimized out>)
>      at /usr/src-stable/sys/kern/sched_ule.c:1945
> #1  0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at /usr/src-stable/sys/kern/kern_synch.c:493
> #2  0xffffffff8097110a in sleepq_wait (wchan=0x0, pri=0) at /usr/src-stable/sys/kern/subr_sleepqueue.c:617
> #3  0xffffffff8093284a in _sx_xlock_hard (sx=0xffffffff816451e0, tid=18446735277701922816,
>      opts=<value optimized out>, file=0x0, line=0) at /usr/src-stable/sys/kern/kern_sx.c:681
> #4  0xffffffff808912b2 in g_run_events () at sx.h:152
> #5  0xffffffff808fad3a in fork_exit (callout=0xffffffff80893240 <g_event_procbody>, arg=0x0, frame=0xfffffe01ef98aac0)
>      at /usr/src-stable/sys/kern/kern_fork.c:996
> #6  0xffffffff80d0ef3e in fork_trampoline () at /usr/src-stable/sys/amd64/amd64/exception.S:606
> #7  0x0000000000000000 in ?? ()
>
> That is probably g_topology_lock.  If so, it is held by this thread
> (g_dev_open() grabs it around g_access()):
>
> Thread 673 (Thread 101936):
> #0  sched_switch (td=0xfffff80162200490, newtd=<value optimized out>, flags=<value optimized out>)
>      at /usr/src-stable/sys/kern/sched_ule.c:1945
> #1  0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at /usr/src-stable/sys/kern/kern_synch.c:493
> #2  0xffffffff8097110a in sleepq_wait (wchan=0x0, pri=0) at /usr/src-stable/sys/kern/subr_sleepqueue.c:617
> #3  0xffffffff80933227 in _sleep (ident=<value optimized out>, lock=<value optimized out>,
>      priority=<value optimized out>, wmesg=<value optimized out>, sbt=<value optimized out>, pr=<value optimized out>,
>      flags=<value optimized out>) at /usr/src-stable/sys/kern/kern_synch.c:255
> #4  0xffffffff8030b9b3 in daopen (dp=<value optimized out>) at /usr/src-stable/sys/cam/scsi/scsi_da.c:1295
> #5  0xffffffff8089013b in g_disk_access (pp=0xfffff800097f4c00, r=<value optimized out>, w=<value optimized out>,
>      e=<value optimized out>) at /usr/src-stable/sys/geom/geom_disk.c:136
> #6  0xffffffff80894f3e in g_access (cp=0xfffff800097f2e00, dcr=1, dcw=0, dce=0)
>      at /usr/src-stable/sys/geom/geom_subr.c:913
> #7  0xffffffff8088df12 in g_dev_open (dev=<value optimized out>, flags=<value optimized out>,
>      fmt=<value optimized out>, td=<value optimized out>) at /usr/src-stable/sys/geom/geom_dev.c:361
> #8  0xffffffff808180f2 in devfs_open (ap=<value optimized out>) at /usr/src-stable/sys/fs/devfs/devfs_vnops.c:1086
> ---Type <return> to continue, or q <return> to quit---
> #9  0xffffffff80e44fd1 in VOP_OPEN_APV (vop=<value optimized out>, a=<value optimized out>) at vnode_if.c:469
> #10 0xffffffff809d9bb4 in vn_open_vnode (vp=0xfffff8017d14d000, fmode=1, cred=0xfffff8000c2e4b00,
>      td=0xfffff80162200490, fp=0xfffff80081060230) at vnode_if.h:196
> #11 0xffffffff809d97ac in vn_open_cred (ndp=0xfffffe021e97b880, flagp=0xfffffe021e97b95c,
>      cmode=<value optimized out>, vn_open_flags=<value optimized out>, cred=0x0, fp=0xfffff80081060230)
>      at /usr/src-stable/sys/kern/vfs_vnops.c:256
> #12 0xffffffff809d2def in kern_openat (td=0xfffff80162200490, fd=-100,
>      path=0x7fffffffe99e <Error reading address 0x7fffffffe99e: Bad address>, pathseg=UIO_USERSPACE, flags=1,
>      mode=<value optimized out>) at /usr/src-stable/sys/kern/vfs_syscalls.c:1091
> #13 0xffffffff80d29c51 in amd64_syscall (td=0xfffff80162200490, traced=0) at subr_syscall.c:134
> #14 0xffffffff80d0eceb in Xfast_syscall () at /usr/src-stable/sys/amd64/amd64/exception.S:391
> #15 0x0000000800f8dcea in ?? ()
> Previous frame inner to this frame (corrupt stack?)
>
> Line 1295 of scsi_da.c in a stable/10 checkout of today is this:
>
>          /* Wait for the disk size update.  */
>          error = cam_periph_sleep(periph, &softc->disk->d_mediasize, PRIBIO,
>              "dareprobe", 0);
>
> To be stuck here means that the request dareprobe() queued is stuck behind
> some other request on the device.  Looking in your traces, there are several
> threads all waiting for an ioctl() to a passX device to complete:
>
> Thread 432 (Thread 100542):
> #0  sched_switch (td=0xfffff801acb8a920, newtd=<value optimized out>, flags=<value optimized out>)
>      at /usr/src-stable/sys/kern/sched_ule.c:1945
> #1  0xffffffff80933801 in mi_switch (flags=260, newtd=0x0) at /usr/src-stable/sys/kern/kern_synch.c:493
> #2  0xffffffff8097110a in sleepq_wait (wchan=0x0, pri=0) at /usr/src-stable/sys/kern/subr_sleepqueue.c:617
> #3  0xffffffff80933227 in _sleep (ident=<value optimized out>, lock=<value optimized out>,
>      priority=<value optimized out>, wmesg=<value optimized out>, sbt=<value optimized out>, pr=<value optimized out>,
>      flags=<value optimized out>) at /usr/src-stable/sys/kern/kern_synch.c:255
> ---Type <return> to continue, or q <return> to quit---
> #4  0xffffffff802ddc0e in cam_periph_runccb (ccb=0xfffff80187888000, error_routine=0xffffffff8030d2a0 <passerror>,
>      camflags=CAM_RETRY_SELTO, sense_flags=18, ds=0xfffff8000984fa20) at /usr/src-stable/sys/cam/cam_periph.c:969
> #5  0xffffffff8030d233 in passdoioctl (dev=<value optimized out>, cmd=<value optimized out>,
>      addr=0xfffff800b50f4800 "", flag=<value optimized out>, td=<value optimized out>)
>      at /usr/src-stable/sys/cam/scsi/scsi_pass.c:680
> #6  0xffffffff8030cf32 in passioctl (dev=0xfffff80009835e00, cmd=3302496258, addr=0xfffff800b50f4800 "", flag=3,
>      td=0xfffff801acb8a920) at /usr/src-stable/sys/cam/scsi/scsi_pass.c:533
> #7  0xffffffff80819a99 in devfs_ioctl_f (fp=0xfffff800b5d6f320, com=3302496258, data=0xfffff800b50f4800,
>      cred=<value optimized out>, td=0xfffff801acb8a920) at /usr/src-stable/sys/fs/devfs/devfs_vnops.c:759
> #8  0xffffffff8097d865 in kern_ioctl (td=0xfffff801acb8a920, fd=<value optimized out>, com=0) at file.h:320
> #9  0xffffffff8097d560 in sys_ioctl (td=0xfffff801acb8a920, uap=0xfffffe021de9aa40)
>      at /usr/src-stable/sys/kern/sys_generic.c:718
> #10 0xffffffff80d29c51 in amd64_syscall (td=0xfffff801acb8a920, traced=0) at subr_syscall.c:134
> #11 0xffffffff80d0eceb in Xfast_syscall () at /usr/src-stable/sys/amd64/amd64/exception.S:391
> #12 0x0000000800ff3f7a in ?? ()
> Previous frame inner to this frame (corrupt stack?)
>
> I suspect the daopen() is hanging on the disk associated with the given
> passX device.
>
> Are you running smartd or anything else that accesses passX devices?
>
> Also, do you have a coredump from this or were you just running kgdb against
> the live system?  If you have a coredump you can run ps against it (or use
> some kgdb macros I have) to see the command name of the process doing the
> passX ioctls.  We could also see which passX devices were blocking to see if
> this is related to either AHCI or USB.  Also, do you possibly have any errors
> in your dmesg from the disk devices?
>

Thanks for looking at this.

I updated a few days ago -
FreeBSD leader.local 10.1-STABLE FreeBSD 10.1-STABLE #0 r275731: Sat
Dec 13 08:36:29 ACDT 2014 
root at leader.local:/usr/obj/usr/src/sys/GENERIC  amd64

I haven't got any core dumps as I only get to the stage of no process
creation (yesterday I had it freeze completely) leaving no choice but
to reset. I leave a root terminal open so that I can run kgdb before
resetting.

I have sysutils/smartmontools installed but I'm not running smartd,
I just have the smart status in daily logs.

I'll vote for USB - I have two USB memsticks that I use almost daily -
previously (since 10.1RC2) I have had trouble with them. Failure to get
the dev entries created, followed with no console messages of device
insert/remove. I had noticed this before being unable to create
processes so wasn't sure which came first. usbconfig was sometimes the
first process I noticed failure to run. I haven't seen this for a
while, so not in stable, maybe in release but definitely in RC 2-4

I think it was in RC3 I got errors of -
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
a couple of times with the USB memstick - again not recently.

Before release I had an issue with constant zfs writes - I haven't been
able to re-create a lockup since RC4 so don't know if that is related.
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=194654

I'm not seeing any device errors in dmesg, log/messages or log/console.log

Someone had mentioned lockups with latest nvidia driver that going back
to the previous had fixed, I am getting some x related crashes so was
going to look at that.

Dec 17 03:21:16 leader kernel: pid 62431 (xcmmtst_big), uid 0: exited on 
signal 10 (core dumped)
Dec 17 03:22:11 leader kernel: pid 69694 (xprobe_gas_x8632), uid 0: 
exited on signal 11 (core dumped)
Dec 17 03:22:11 leader kernel: pid 69779 (xprobe_3dnow), uid 0: exited 
on signal 4 (core dumped)
Dec 17 03:22:42 leader kernel: pid 78410 (xprobe_gas_x8632), uid 0: 
exited on signal 11 (core dumped)
Dec 17 03:22:43 leader kernel: pid 78554 (xprobe_3dnow), uid 0: exited 
on signal 4 (core dumped)
Dec 17 03:46:21 leader kernel: pid 40770 (xdfc), uid 0: exited on signal 
10 (core dumped)
Dec 17 03:46:26 leader kernel: pid 41000 (xdmmtst), uid 0: exited on 
signal 10 (core dumped)
Dec 17 03:50:55 leader kernel: pid 52533 (xdfc), uid 0: exited on signal 
10 (core dumped)
Dec 17 03:51:30 leader kernel: pid 54062 (xdfc), uid 0: exited on signal 
11 (core dumped)

I could run a build of kernel and world with debugging or (is it 
witness?) for a few days if that would help get more info.

-- 
FreeBSD - the place to B...Software Developing

Shane Ambler