svn commit: r247116 - in head/sys: fs/nfs fs/nfsclient kern nfsclient sys tools

Mon Feb 25 10:36:31 UTC 2013

On Mon, 25 Feb 2013 10:50:19 +0200
Konstantin Belousov <kostikbel at gmail.com> wrote:

> On Mon, Feb 25, 2013 at 08:13:13PM +1300, Andrew Turner wrote:
> > On Thu, 21 Feb 2013 19:02:50 +0000 (UTC)
> > John Baldwin <jhb at FreeBSD.org> wrote:
> > 
> > > Author: jhb
> > > Date: Thu Feb 21 19:02:50 2013
> > > New Revision: 247116
> > > URL: http://svnweb.freebsd.org/changeset/base/247116
> > > 
> > > Log:
> > >   Further refine the handling of stop signals in the NFS client.
> > > The changes in r246417 were incomplete as they did not add
> > > explicit calls to sigdeferstop() around all the places that
> > > previously passed SBDRY to _sleep().  In addition,
> > > nfs_getcacheblk() could trigger a write RPC from getblk()
> > > resulting in sigdeferstop() recursing. Rather than manually
> > > deferring stop signals in specific places, change the VFS_*() and
> > > VOP_*() methods to defer stop signals for filesystems which
> > > request this behavior via a new VFCF_SBDRY flag. Note that this
> > > has to be a VFC flag rather than a MNTK flag so that it works
> > > properly with VFS_MOUNT() when the mount is not yet fully
> > > constructed.  For now, only the NFS clients are set this new flag
> > > in VFS_SET(). A few other related changes:
> > >   - Add an assertion to ensure that TDF_SBDRY doesn't leak to
> > > userland.
> > >   - When a lookup request uses VOP_READLINK() to follow a symlink,
> > > mark the request as being on behalf of the thread performing the
> > > lookup (cnp_thread) rather than using a NULL thread pointer.  This
> > > causes NFS to properly handle signals during this VOP on an
> > > interruptible mount.
> > >   
> > >   PR:		kern/176179
> > >   Reported by:	Russell Cattelan (sigdeferstop() recursion)
> > >   Reviewed by:	kib
> > >   MFC after:	1 month
> > 
> > This change is causing init to crash for me on armv6. I'm
> > netbooting a PandaBoard and it appears init is receiving a SIGABRT
> > before it gets into main().
> > 
> > Do you have any idea where I could look to track down why it is
> > doing this?
> 
> It is weird. SIGABRT sent by the kernel usually means that execve(2)
> already destroyed the previous address space of the process, but the
> new image cannot be activated, most likely due to image format error
> discovered too late, or resource shortage.
> 
> Could it be that some NFS RPC fails after the patch, but I cannot
> imagine why. You would need to track this. Also, verify that the init
> binary is correct.
> 
> I tried amd64 netboot, and it worked fine.

It looks like this change is not the issue, it just changed the
symptom enough for me to not realise I was seeing an issue where
it would crash the kernel before. I reinstated this change but only
allowed the kernel to access half the memory and it booted correctly.

The real issue appears to be related to something in the vm layer not
working on ARM boards with too much memory (somewhere between 512MiB
and 1GiB).

Andrew