FreeBSD 9.1 NFSv4 client attribute cache not caching ?

Mon Apr 15 10:28:40 UTC 2013

On Sun, 14 Apr 2013, Rick Macklem wrote:

> Paul van der Zwan wrote:
>> On 14 Apr 2013, at 5:00 , Rick Macklem <rmacklem at uoguelph.ca> wrote:
>>
>> Thanks for taking the effort to send such an extensive reply.
>>
>>> Paul van der Zwan wrote:
>>>> On 12 Apr 2013, at 16:28 , Paul van der Zwan <paulz at vanderzwan.org>
>>>> wrote:
> ...
>>> In NFSv3, each RPC is defined and usually includes attributes for
>>> files
>>> before and after the operation (implicit getattrs not counted in the
>>> RPC
>>> counts reported by nfsstat).
>>>
>>> For NFSv4, every RPC is a compound built up of a list of Operations
>>> like
>>> Getattr. Since the NFSv4 server doesn't know what the compound is
>>> doing,
>>> nfsstat reports the counts of Operations for the NFSv4 server, so
>>> the counts
>>> will be much higher than with NFSv3, but do not reflect the number
>>> of RPCs being done.
>>> To get NFSv4 nfsstat output that can be compared to NFSv3, you need
>>> to
>>> do the command on the client(s) and it still is only roughly the
>>> same.
>>> (I just realized this should be documented in man nfsstat.)
>>>
>> I ran nfsstat -s -v 4 on the server and saw the number of requests
>> being done.
>> They were in the order of a few thousand per second for a single
>> FreeBSD 9.1 client
>> doing a make build world.
>>
> Yes, but as I noted above, for NFSv4, these are counts of operations,
> not RPCs. Each RPC in NFSv4 consists of several operations. For example,
> for read it is something like:
> - PutFH, Read, Getattr
>
> As such, you need to do "nfsstat -e -c" on the client in order to
> see how many RPCs are happening.

Does it show the number of physical RPC or only "roughly the same"?

>>> For the FreeBSD NFSv4 client, the compounds include Getattr
>>> operations
>>> similar to what NFSv3 does. It doesn't do a Getattr on the directory
>>> for Lookup, because that would have made the compound much more
>>> complex.
>>> I don't think this will have a significant performance impact, but
>>> will
>>> result in some additional Getattr RPCs.
>>>
>> I ran snoop on port 2049 on the server and I saw a large number of
>> lookups.
>> A lot of them seem to be for directories which are part of the
>> filenames of
>> the compiler and include files which on the nfs mounted /usr/obj.
>> The same names keep reappering so it looks like there is no caching
>> being done on
>> the client.

When I worked on this in ~2007, unnecessary RPCs for lookup was a
large cause of slowness.  This was fixed in at least nfsv3.  Almost
all RPCs for makeworld (closer to 99% than 90%) should now be for open
of the excessively layered and polluted include files, since they are
opened so often compared with other files and every open goes to the
server (except "nocto" should fix this).  There are lots of lookups
for the include files too, but the lookups are properly cached.

>> I tried the nocto option in /etc/fstab but it does not show when mount
>> shows
>> the mounted filesystems so I am not sure if it is being used.
> Head (and I think stable9) is patched so that ``nfsstat -m`` shows
> all the options actually being used. For 9.1, you just have to trust
> that it has been set.

This doesn't work on ref10-amd64 running 10.0-CURRENT Apr 5.  nfsstat -m
gives null output.  Plain nfsstat confirms that there are some nfs mounts,
with so much activity on them that man of the cache counts are negative
after 9 days of uptime.

> ...
>> I tried a make buildworld buildkernel with /usr/obj a local FS in the
>> Vbox VM
>> that completed in about 2 hours. With /usr/obj on an NFS v4 filesystem
>> it takes
>> about a day. A twelve fold increase is elapsed time makes using NFSv4
>> unusable
>> for this use case.

That is extremely slow.  Here I am unhappy with the makeworld time over
nfs staying about 13 minutes despite attempts to improve this, but I
only have old slow hardware (2 core 2GHz Turion laptop).  I also have
a modified FreeBSD-5, which avoids some of the bloat in -current.  My best
time without excessive tuning was:

@ --------------------------------------------------------------
@ >>> make world completed on Fri Nov  2 23:35:11 EST 2007
@                    (started Fri Nov  2 23:21:27 EST 2007)
@ --------------------------------------------------------------
@       823.53 real      1295.80 user       192.46 sys
@ 
@  Lookup  Read Access Fsstat Other   Total
@  127134 23214 624060  24764    99  799271

The kernel was current at the time, but userland was ~5.2.  Newer
kernels (1-2 years old) are only a bit slower and don't require any
modifications to get similar RPC counts (with Getattr.nstead of Access)
/usr including /usr/bin and /usr/src was on nfs, but /bin and /usr/obj
were local.  Everything fits in RAM caches so there was no disk activity
except for new reads and new writes.  Network latency was tuned to 60
usec (min for ping).

When nfs was pessimized, the above RPC counts blew out to no more than 2
million.  Suppose you have 2 million RPCs with a latency of just 65 usec.
That gives a latency of 130 seconds.  Not too bad, but large compared with
823 seconds.  They latency is amortized by having more than 1 CPU and/or
building concurrently.  Then progress can usually be made in some threads
while others are blocked waiting for the RPCs.  However, many networks
have latencies much larger than 65 usec.  On the freebsd cluster now, the
min latency is about 250 usec, and since it it has multiple users the
latency is sometimes over 1 msec.  2 million RPCs with a latency of 1 msec
take 2000 seconds, which is a lot compared with a build time of 823 seconds.

I consider "nocto" as excessive tuning, since although it would help
makeworld benchmarks it is unsafe in general.  Of course I tried my
version of it in the above.  (They above RPC counts are with the following
critical modifications that weren't in FreeBSD at the time:
- negative caching
- fix for broken dotdot caching
- fix for broken "cto".  It did twice as many RPCs as needed.)
Adding the equivalent of "nocto" reduced the RPC counts significantly,
but only reduced the real time by about 20 (?) seconds.

> Source builds on NFS mounts are notoriously slow. A big part of this is

Only when misconfigured.  The nfs build time in the above is between 5%
and 10% slower than the local build time.

> the synchronous writes that get done because there is only one dirty
> byte range for a block and the loader loves to write small non-contiguous
> areas of its output file.

Writing to nfs would be slow, but I made /usr/obj local to avoid it.  Also,
in other (kernel build) tests where object files are written to the current
directory which is on nfs, the non-separate object directory is mounted
async on the server so it is fast enough.  Now my reference is building
a FreeBSD-4 kernel.  My best times were:
- 32+ seconds (src and obj on nfs, async, -j4)
- 30- seconds (src and obj of ffs, async, -j4)
- 64+ (?) seconds (src and obj on nfs, async, -j1)
- 58 (?) seconds (src and obj on ffs, async, -j1)
(/usr on nfs, /bin on ffs).  Without parallelism, everything has to wait
for the RPCs, and even with low network latency this costs 5-10%.

>> Too bad the server hangs when I use nfsv3 mount for /usr/obj.
> Try this mount command:
> mount -t nfs -o nfsv3,nolockd ...
> (I do builds of the src tree NFS mounted, so the only reason I can
> think that it would hang would be a rpc.lockd issue.)
> If this works, I suspect it will still be slow, but it would be nice to
> find out how much slower NFSv4 is for your case.

Needed to localize the slowness anyway.  It might be just in the server.

Bruce