9.1-stable crashes while copying data from a NFS mounted directory

Fri Jan 25 02:39:43 UTC 2013

John Baldwin wrote:
> On Thursday, January 24, 2013 4:22:12 pm Konstantin Belousov wrote:
> > On Thu, Jan 24, 2013 at 09:50:52PM +0100, Christian Gusenbauer
> > wrote:
> > > On Thursday 24 January 2013 20:37:09 Konstantin Belousov wrote:
> > > > On Thu, Jan 24, 2013 at 07:50:49PM +0100, Christian Gusenbauer
> > > > wrote:
> > > > > On Thursday 24 January 2013 19:07:23 Konstantin Belousov
> > > > > wrote:
> > > > > > On Thu, Jan 24, 2013 at 08:03:59PM +0200, Konstantin
> > > > > > Belousov wrote:
> > > > > > > On Thu, Jan 24, 2013 at 06:05:57PM +0100, Christian
> > > > > > > Gusenbauer wrote:
> > > > > > > > Hi!
> > > > > > > >
> > > > > > > > I'm using 9.1 stable svn revision 245605 and I get the
> > > > > > > > panic below
> > > > > > > > if I execute the following commands (as single user):
> > > > > > > >
> > > > > > > > # swapon -a
> > > > > > > > # dumpon /dev/ada0s3b
> > > > > > > > # mount -u /
> > > > > > > > # ifconfig age0 inet 192.168.2.2 mtu 6144 up
> > > > > > > > # mount -t nfs -o rsize=32768 data:/multimedia /mnt
> > > > > > > > # cp /mnt/Movies/test/a.m2ts /tmp
> > > > > > > >
> > > > > > > > then the system panics almost immediately. I'll attach
> > > > > > > > the stack
> > > > > > > > trace.
> > > > > > > >
> > > > > > > > Note, that I'm using jumbo frames (6144 byte) on a 1Gbit
> > > > > > > > network,
> > > > > > > > maybe that's the cause for the panic, because the bcopy
> > > > > > > > (see stack
> > > > > > > > frame #15) fails.
> > > > > > > >
> > > > > > > > Any clues?
> > > > > > >
> > > > > > > I tried a similar operation with the nfs mount of
> > > > > > > rsize=32768 and mtu
> > > > > > > 6144, but the machine runs HEAD and em instead of age. I
> > > > > > > was unable
> > > > > > > to reproduce the panic on the copy of the 5GB file from
> > > > > > > nfs mount.
> > > > >
> > > > > Hmmm, I did a quick test. If I do not change the MTU, so just
> > > > > configuring
> > > > > age0 with
> > > > >
> > > > > # ifconfig age0 inet 192.168.2.2 up
> > > > >
> > > > > then I can copy all files from the mounted directory without
> > > > > any
> > > > > problems, too. So it's probably age0 related?
> > > >
> > > > From your backtrace and the buffer printout, I see somewhat
> > > > strange thing.
> > > > The buffer data address is 0xffffff8171418000, while kernel
> > > > faulted
> > > > at the attempt to write at 0xffffff8171413000, which is is lower
> > > > then
> > > > the buffer data pointer, at the attempt to bcopy to the buffer.
> > > >
> > > > The other data suggests that there were no overflow of the data
> > > > from the
> > > > server response. So it might be that mbuf_len(mp) returned
> > > > negative number
> > > > ? I am not sure is it possible at all.
> > > >
> > > > Try this debugging patch, please. You need to add INVARIANTS etc
> > > > to the
> > > > kernel config.
> > > >
> > > > diff --git a/sys/fs/nfs/nfs_commonsubs.c
> > > > b/sys/fs/nfs/nfs_commonsubs.c
> > > > index efc0786..9a6bda5 100644
> > > > --- a/sys/fs/nfs/nfs_commonsubs.c
> > > > +++ b/sys/fs/nfs/nfs_commonsubs.c
> > > > @@ -218,6 +218,7 @@ nfsm_mbufuio(struct nfsrv_descript *nd,
> > > > struct uio
> > > > *uiop, int siz) }
> > > >  				mbufcp = NFSMTOD(mp, caddr_t);
> > > >  				len = mbuf_len(mp);
> > > > + KASSERT(len > 0, ("len %d", len));
> > > >  			}
> > > >  			xfer = (left > len) ? len : left;
> > > >  #ifdef notdef
> > > > @@ -239,6 +240,8 @@ nfsm_mbufuio(struct nfsrv_descript *nd,
> > > > struct uio
> > > > *uiop, int siz) uiop->uio_resid -= xfer;
> > > >  		}
> > > >  		if (uiop->uio_iov->iov_len <= siz) {
> > > > + KASSERT(uiop->uio_iovcnt > 1, ("uio_iovcnt %d",
> > > > + uiop->uio_iovcnt));
> > > >  			uiop->uio_iovcnt--;
> > > >  			uiop->uio_iov++;
> > > >  		} else {
> > > >
> > > > I thought that server have returned too long response, but it
> > > > seems to
> > > > be not the case from your data. Still, I think the patch below
> > > > might be
> > > > due.
> > > >
> > > > diff --git a/sys/fs/nfsclient/nfs_clrpcops.c
> > > > b/sys/fs/nfsclient/nfs_clrpcops.c index be0476a..a89b907 100644
> > > > --- a/sys/fs/nfsclient/nfs_clrpcops.c
> > > > +++ b/sys/fs/nfsclient/nfs_clrpcops.c
> > > > @@ -1444,7 +1444,7 @@ nfsrpc_readrpc(vnode_t vp, struct uio
> > > > *uiop, struct
> > > > ucred *cred, NFSM_DISSECT(tl, u_int32_t *, NFSX_UNSIGNED);
> > > >  			eof = fxdr_unsigned(int, *tl);
> > > >  		}
> > > > - NFSM_STRSIZ(retlen, rsize);
> > > > + NFSM_STRSIZ(retlen, len);
> > > >  		error = nfsm_mbufuio(nd, uiop, retlen);
> > > >  		if (error)
> > > >  			goto nfsmout;
> > >
I think this patch is appropriate, although I don't see it as too
critical. It just tightens the "sanity check" on the read reply
length (which should never exceed what the client requested).

nfsm_mbufuio() shouldn't transfer more than the uio structure can
handle, even if the replied read size is larger than requested.

It does seem that nfsm_mbufuio() should apply a sanity check on
m_len. I think m_len == 0 is ok, but negative or very large should
be checked for. Maybe just return EBADRPC after a printf() instead
of a KASSERT(), as a safety belt against a trashed m_len from a driver
or ???

rick

> > > I applied your patches and now I get a
> > >
> > > panic: len -4
> > > cpuid = 1
> > > KDB: enter: panic
> > > Dumping 377 out of 6116
> > > MB:..5%..13%..22%..34%..43%..51%..64%..73%..81%..94%
> > >
> > This means that the age driver either produced corrupted mbuf chain,
> > or filled wrong negative value into the mbuf len field. I am quite
> > certain that the issue is in the driver.
> >
> > I added the net@ to Cc:, hopefully you could get help there.
> 
> And I've cc'd Pyun who has written most of this driver and is likely
> the one
> most familiar with its handling of jumbo frames.
> 
> --
> John Baldwin
> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"