Debugging newnfs

Fri Jun 20 14:58:43 UTC 2014

The server side is a set of vlans on a lagg of 4 igbs.  The Xen side is the
same setup, with the VMs in question attached to two different vlans.

Many different mounts, but the mount options all look like this:

nfsv3,tcp,resvport,hard,cto,lockd,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,negnametimeo=60,rsize=65536,wsize=65536,readdirsize=65536,readahead=1,wcommitsize=4048762,timeout=120,retrans=2

The permissions do not change, but repeat operations succeed and fail
randomly.

There aren't any clients concurrently accessing the same mount.

On Fri, Jun 20, 2014 at 9:16 AM, Rick Macklem <rmacklem at uoguelph.ca> wrote:

> Daniel Mayfield wrote:
> > I have a very strange problem between an NFS server running FreeBSD
> > 10 w/ ZFS and a number of FreeBSD 10 VMs running on a XenServer 6.2
> > SP1 host.  The problem manifests as seemingly random permissions
> > issues and/or IO errors on the clients when the ZFS pool is busy.
> >  There are no entries in dmesg on either side, and no errors logged
> > in nfsstat either.  If I keep the traffic down, the errors subside,
> > but not completely.  Other than tcpdump, how can I go about
> > debugging this?
> >
> Well, you didn't mention what mount options you are using or what
> network interfaces that you are using, but here's a few things that
> might be worth looking at...
>
> The TSO max transmit segments issue:
> - Without going into all the details (there have been some recent
>   commits like r264630 to try and alleviate this), if a net device
>   driver cannot handle 35 mbufs in a transmit TSO segment, things
>   will get broken.
>   - Xen/netfront is a weird exception, which I think is ok so long
>     as lagg or a vlan isn't layered on top of it.
> --> If can disable TSO on both server and clients or reduce rsize,wsize
>     to 32K on all client mounts and see if the problem persists, that
>     is probably the best way to check this. (Since Xen/netfront is
>     such a weird case, I am not 100% sure if doing the above will fix
>     this problem, if it is being used)
>
> I also don't know if it is possible to have corrupted packets due to
> a hardware problem (bad memory or...) where the Xen/netfront world
> doesn't catch it.
>
> If you use the "soft" mount option, you could easily get this when
> the server is slow to respond. I'd strongly recommend using "tcp"
> and not "soft" for your mounts. ("nfsstat -m" on the client will
> show you what the actual mount options is use are. This can be
> somewhat different than what is specified on the command line, since
> servers limit rsize/wsize, as an example.)
>
> When you get a "permissions failure" case, check on the server to
> see if the permissions for the file appear correct on ZFS. If they
> are (or the problem disappears when you retry a command without
> changing permissions), you could have a caching issue. Other than
> capturing the packets and looking at them in wireshark (which knows
> NFS, unlike tcpdump) all you can do is try fiddling with the mount
> options related to caching and see if that helps. (Note that NFS
> does not have a cache coherency protocol, so if files are concurrently
> shared among multiple clients, all bets are off w.r.t. what the
> behaviour is. jhb@ is much better at this than I, since he seems
> to find lots of these weird cases at his workplace.)
>
> Good luck with it, rick
>
> > Dan
> > _______________________________________________
> > freebsd-fs at freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> > To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
> >
>