Suspected libkvm infinite loop

Tue Mar 10 21:59:19 UTC 2015

On Tue, Mar 10, 2015 at 02:10:09PM -0400, John Baldwin wrote:
> On Tuesday, March 10, 2015 10:17:07 AM Nick Frampton wrote:
> > Hi,
> > 
> > For the past several months, we have had an intermittent problem where a
> > process calling kvm_openfiles(3) or kvm_getprocs(3) (not sure which) gets
> > stuck in an infinite loop and goes to 100% cpu. We have just observed
> > "fstat -m" do the same thing and suspect it may be the same problem.
> > 
> > Our environment is a 10.1-RELEASE-p6 amd64 guest running in VirtualBox, with
> > ufs root and zfs /home.
> > 
> > Has anyone else experienced this? Is there anything we can do to investigate
> > the problem further?
> 
> Often loops using libkvm are due to programs using libkvm are trying to read 
> kernel data structures while they are changing.  However, if you use sysctls 
> to fetch this data instead, you should be able to get a stable snapshot of the 
> system state without getting stuck in a possible loop.  I believe for libkvm 
> to use sysctl instead of /dev/kmem you have to pass a NULL for the kernel and 
> "/dev/null" for the core image.  fstat -m should be doing that by default 
> however, so if it is not that, can you ktrace fstat when it is spinning to see 
> if it is spinning userland or in the kernel?  If you see no activity via 
> ktrace, then it is spinning in one of the two places without making any system 
> calls, etc.  You can attach to it with gdb to pause it, then see where gdb 
> thinks it is.  If gdb hangs attaching to it, then it is stuck in the kernel.  
> 
> If gdb attaches to it ok, then it is spinning in userland.  Unfortunately, for 
> gdb to be useful, you really need debug symbols.  We don't currently provide 
> those for release binaries or binaries provided via freebsd-update (though 
> that is being worked on for 11.0).  If you build from source, then the 
> simplest way to get this is to add 'WITH_DEBUG_FILES=yes' to /etc/src.conf and 
> rebuild your world without NO_CLEAN.  If you are building from source and are 
> able to reproduce with those binaries, then after attaching to the process 
> with gdb, use 'bt' to see where it is hung and reply with that.
> 
> If it is hanging in the kernel, then you will need to use the kernel debugger 
> to see where it is hanging.  The simplest way to do this is probably to force 
> a crash via the debug.kdb.panic sysctl (set it to a non-zero value).  You will 
> then need to fire up kgdb on the crash dump after it reboots, switch to the 
> fstat process via the 'proc <pid>' command and get a backtrace via 'bt'.

It sounds like this issue might be the one fixed in r272566: if the
KERN_PROC_ALL sysctl is read with an insufficiently large buffer, an
sbuf error return value could bubble up and be treated as ERESTART,
resulting in a loop.

This can be confirmed with something like

  dtrace -n 'syscall:::entry /pid == $target/{@[probefunc] = count();} tick-3s {exit(0);}' -p <pid of looping proc>

If the output consists solely of __sysctl, this bug is likely the
culprit.

-Mark