High rate of NFS cache misses after upgrading from 10.3-prerelease to 11.1-release

Sat Apr 14 23:18:30 UTC 2018

Niels Kobschätzki wrote:
>On 04/14/2018 03:49 AM, Rick Macklem wrote:
>> Niels Kobschätzki wrote:
>>> sorry for the cross-posting but so far I had no real luck on the forum
>>> or on question, thus I want to try my luck here as well.
>> I read email lists but don't do the other stuff, so I just saw this yesterday.
>> Short answer, I haven't a clue why cache hits rate would have changed.
>>
>> The code that decides if there is a hit/miss for the attribute cache is in
>> ncl_getattrcache() and the code hasn't changed between 10.3->11.1,
>> except the old code did a mtx_lock(&Giant), but I can't imagine how that
>> would affect the code.
>>
>> You might want to:
>> # sysctl -a | fgrep vfs.nfs
>> for both the 10.3 and 11.1 systems, to check if any defaults have somehow
>> been changed. (I don't recall any being changed, but??)
>
>I did that and there did nothing change.
>
>> If you go into ncl_getattrcache() {it's in sys/fs/nfsclient/nfs_clsubs.c}
>> and add a printf() for "time_second" and "np->n_mtime.tv_sec" near the
>> top, where it calculates "timeo" from it.
>> Running this hacked kernel might show you if either of these fields is bogus.
>> (You could then printf() "timeo" and "np->n_attrtimeo" just before the "if"
>> clause that increments "attrcache_misses", which is where the cache misses
>> happen to see why it is missing the cache.)
>> If you could do this for the 10.3 kernel as well, this might indicate why the
>> miss rate has increased?
>
>I will do this next week. On monday we switch for other reasons to other
>nfs-servers and when we see that they run stable, I will do this next.
With a miss rate of 2.7%, I doubt printing the above will help. I thought
you were seeing a high miss rate.

>Btw. I calculated now the percentages. The old servers had a attr miss
>rate of something like 0.004%, while the upgraded one has more like
>2.7%. This is till low from what I've read (I remember that you should
>start adjusting acreg* when you hit more than 40% misses) but far higher
>than before.
You could try increasing acregmin, acregmax and see if the misses are reduced.
(The only risk with increasing the cache timeout is that, if another client changes
 the attributes, then the client will use stale ones for longer. Usually, this doesn't
 cause serious problems.)
To be honest, a Getattr RPC is pretty low overhead, so I doubt the increase
to 2.7% will affect your application's performance, but it is interesting that
it increased.

You might also try increasing acdirmin, acdirmax in case it is the directory
attributes that are having cache misses.

Oh, and check that your time of day clocks are in sync with the server,
since the caches are time based, since there is no cache coherency protocol
in NFS.
[good stuff snipped]
rick