NFS server bottlenecks

Sat Oct 13 21:43:25 UTC 2012

Ivan Voras wrote:
> On 5 October 2012 14:54, Rick Macklem <rmacklem at uoguelph.ca> wrote:
> > Yes, the algorithm for UDP is to trim the least recently used
> > entries and that LRU list is how the least recently used is done.
> >
> > Without that, you'd need to do something like timestamp the entries
> > and then scan the entire list to find the least recently used ones.
> 
> Hi,
> 
> I'm reading the cache code, and would like to hear from you if my
> undestanding of the LRU list is correct:
> 
> * The list is only necessary for the UDP case
Yes, since LRU is not a good criteria for selecting cached entries
to be replaced (actually trimmed away).

> * Entries are added to the lru list only in nfsrc_getudp(), if the
> entry is not cached; if it is cached, it is moved to the front
> * An entry is also moved to the front in nfsrvd_updatecache()
A new entry is put at the head of the list, since it is the most
recent. When entries are referenced, they are moved to the head of
the list, since they have been used recently.

> * nfsrc_trimcache() removes entries from the LRU list, but the
> condition is somewhat complex.
It trims from the end of the list, since those entries haven't been
recently used.

> Can you tell me why there is a
> RC_REFCNT flag, and also a rc_refcnt field? I.e. wouldn't the field be
> enough?
The rc_refcnt is only used for NFSv4 entries that reference open/lock
state. This is marked by RC_REFCNT and for the case of RC_REFCNT set
and rc_refcnt > 0, the entry cannot be thrown away. It will go away
when the next open/lock operation in sequence is received.

To be honest, since NFSv4 should be using TCP, this should never be
set for UDP. The code probably handles the case for UDP, since during
early testing UDP was still being used for NFSv4.

> * Is it true that the LRU list is essentially completely orthogonal to
> the hash tables?
Yes.

> I.e. there is nothing which specifially requires that
> a specific hash table be locked at the same time as the LRU list.? My
> goal with all this is to see if the global lru list could be broken
> down similarly to the hash entries, to rework the locking for less
> contention.
It is true that you don't need to hold both locks concurrently, if you
have separate locks for the hash buckets for UDP. However, since all
threads handling UDP requests will contend on the one lock for the global
LRU list, having to also grab a lock for a hash bucket list seems like
it will just be additional overhead to me?

If, as you proposed, use separate LRU lists for each hash bucket, then
how do you know if the least recently used for one hash backet isn't
much more recently used than the least recently used for another hash
bucket? (The hash code is using xid, which might be about the same for
different clients at the same time.)

If you were to switch to doing the hashing based on client IP address,
then you would get requests from the same client in the same hash bucket.
This would be good for selecting least recently used, but poor for
spreading the cached entries across the hash list, since at any given
point in time, only a few clients will be actively generating RPC requests,
I think?

If you timestamp the entries as well as move them to the head of the LRU
list for a bucket, then you could compare timestamps and decide which ones
should be replaced. It would be more work for the trimming code, but if that
code isn't being executed too frequently, that could work to minimize
contention on the locks as a tradeoff against increased cache storage and
more overhead when trimming is being done. I have no idea if this is worth
doing? (I'm always surprised that people still prefer UDP to TCP for NFS,
but some do seem to.;-)

rick
ps: I hope you didn't mind me adding the mailing list. I'd like others to
    be able to comment/read the discussion.