FreeBSD 10.1 can't "make -j5 buildworld" over NFS?

Rick Macklem rmacklem at uoguelph.ca
Thu Apr 16 21:47:19 UTC 2015


J David wrote:
> On Wed, Apr 15, 2015 at 10:18 AM, Rick Macklem <rmacklem at uoguelph.ca>
> wrote:
> > Well, the NFS client is almost identical in the two systems. (A
> > couple
> > of NFSv4 specific changes and a removal of a redundant check for
> > creation
> > of a hard link across mount points are the only ones I can see.)
> >
> > As such, I'd suspect userland differences. There is a different
> > "make"
> > in 10 (which I don't think is in 9.3?), so this would be a good
> > starting
> > point.
> 
> That may be, but this problem only occurs over NFS.  It does not
> happen with local UFS or ZFS.  So perhaps the new make is exercising
> the NFS client differently than the old one, revealing the problem.
> 
> > Btw, "stale NFS file handle" means that the file has been deleted
> > on the
> > server.
> 
> Yes it does.  And the make always dies during cleandir, during which
> things are being aggressively deleted.
> 
> It does seem like that's the *only* stage that has problems.  I.e. if
> "make cleanworld" is run before "make -j5 buildworld" then the
> parallel build will succeed.  Hopefully that means it will be
> relatively easy to narrow down / reproduce the problem behavior.
> 
> However, in my experience, stale NFS file handles usually occur when
> one client deletes things out from under another client (and/or after
> a server reboot, which is not the case here).  In this case, this is
> the only client that can even mount the relevant partition as
> read-write, much less writing to it.  It's like the 10.1 client is
> caching that stuff exists even after it removes it, leading to errors
> from the server when it tries to access them again.  It's pretty
> unusual (again, in my experience) for a single client to trip over
> *itself* when deleting things.
> 
> Thanks!
> 
First, I will point out that the NFS protocol is not POSIX compliant and,
as such, there will be always cases where apps. that work on POSIX compliant
file systems don't work on NFS.

When the NFS Remove RPC is done, a file is removed. (NFS does not know if
the file is open and does not maintain POSIX opens on files.)
A "trick" used by the NFS client to approximate POSIX is called "silly rename".
When the client sees that a file is open by another process on the machine,
an unlink(2) becomes "rename file to .nfsXXX and then do a Remove RPC on it
when the open count goes to 0". This normally avoids "stale NFS file handle"
within a single client. This "trick" is not "race free" when done between
multiple clients, but for a single client I am not aware of a problem with it.
However, the FreeBSD client does this, so I doubt this is the problem.

It may be something as simple as make expecting ENOENT for a remove and
instead gets ESTALE from the NFS when the file has already been deleted.

rick



More information about the freebsd-fs mailing list