NFS server bottlenecks

Wed Oct 10 14:42:19 UTC 2012

On Oct 10, 2012, at 3:18 AM, Rick Macklem <rmacklem at uoguelph.ca> wrote:

> Nikolay Denev wrote:
>> On Oct 4, 2012, at 12:36 AM, Rick Macklem <rmacklem at uoguelph.ca>
>> wrote:
>> 
>>> Garrett Wollman wrote:
>>>> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
>>>> <rmacklem at uoguelph.ca> said:
>>>> 
>>>>>> Simple: just use a sepatate mutex for each list that a cache
>>>>>> entry
>>>>>> is on, rather than a global lock for everything. This would
>>>>>> reduce
>>>>>> the mutex contention, but I'm not sure how significantly since I
>>>>>> don't have the means to measure it yet.
>>>>>> 
>>>>> Well, since the cache trimming is removing entries from the lists,
>>>>> I
>>>>> don't
>>>>> see how that can be done with a global lock for list updates?
>>>> 
>>>> Well, the global lock is what we have now, but the cache trimming
>>>> process only looks at one list at a time, so not locking the list
>>>> that
>>>> isn't being iterated over probably wouldn't hurt, unless there's
>>>> some
>>>> mechanism (that I didn't see) for entries to move from one list to
>>>> another. Note that I'm considering each hash bucket a separate
>>>> "list". (One issue to worry about in that case would be cache-line
>>>> contention in the array of hash buckets; perhaps
>>>> NFSRVCACHE_HASHSIZE
>>>> ought to be increased to reduce that.)
>>>> 
>>> Yea, a separate mutex for each hash list might help. There is also
>>> the
>>> LRU list that all entries end up on, that gets used by the trimming
>>> code.
>>> (I think? I wrote this stuff about 8 years ago, so I haven't looked
>>> at
>>> it in a while.)
>>> 
>>> Also, increasing the hash table size is probably a good idea,
>>> especially
>>> if you reduce how aggressively the cache is trimmed.
>>> 
>>>>> Only doing it once/sec would result in a very large cache when
>>>>> bursts of
>>>>> traffic arrives.
>>>> 
>>>> My servers have 96 GB of memory so that's not a big deal for me.
>>>> 
>>> This code was originally "production tested" on a server with
>>> 1Gbyte,
>>> so times have changed a bit;-)
>>> 
>>>>> I'm not sure I see why doing it as a separate thread will improve
>>>>> things.
>>>>> There are N nfsd threads already (N can be bumped up to 256 if you
>>>>> wish)
>>>>> and having a bunch more "cache trimming threads" would just
>>>>> increase
>>>>> contention, wouldn't it?
>>>> 
>>>> Only one cache-trimming thread. The cache trim holds the (global)
>>>> mutex for much longer than any individual nfsd service thread has
>>>> any
>>>> need to, and having N threads doing that in parallel is why it's so
>>>> heavily contended. If there's only one thread doing the trim, then
>>>> the nfsd service threads aren't spending time either contending on
>>>> the
>>>> mutex (it will be held less frequently and for shorter periods).
>>>> 
>>> I think the little drc2.patch which will keep the nfsd threads from
>>> acquiring the mutex and doing the trimming most of the time, might
>>> be
>>> sufficient. I still don't see why a separate trimming thread will be
>>> an advantage. I'd also be worried that the one cache trimming thread
>>> won't get the job done soon enough.
>>> 
>>> When I did production testing on a 1Gbyte server that saw a peak
>>> load of about 100RPCs/sec, it was necessary to trim aggressively.
>>> (Although I'd be tempted to say that a server with 1Gbyte is no
>>> longer relevant, I recently recall someone trying to run FreeBSD
>>> on a i486, although I doubt they wanted to run the nfsd on it.)
>>> 
>>>>> The only negative effect I can think of w.r.t. having the nfsd
>>>>> threads doing it would be a (I believe negligible) increase in RPC
>>>>> response times (the time the nfsd thread spends trimming the
>>>>> cache).
>>>>> As noted, I think this time would be negligible compared to disk
>>>>> I/O
>>>>> and network transit times in the total RPC response time?
>>>> 
>>>> With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G
>>>> network connectivity, spinning on a contended mutex takes a
>>>> significant amount of CPU time. (For the current design of the NFS
>>>> server, it may actually be a win to turn off adaptive mutexes -- I
>>>> should give that a try once I'm able to do more testing.)
>>>> 
>>> Have fun with it. Let me know when you have what you think is a good
>>> patch.
>>> 
>>> rick
>>> 
>>>> -GAWollman
>>>> _______________________________________________
>>>> freebsd-hackers at freebsd.org mailing list
>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>>>> To unsubscribe, send any mail to
>>>> "freebsd-hackers-unsubscribe at freebsd.org"
>>> _______________________________________________
>>> freebsd-fs at freebsd.org mailing list
>>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>>> To unsubscribe, send any mail to
>>> "freebsd-fs-unsubscribe at freebsd.org"
>> 
>> My quest for IOPS over NFS continues :)
>> So far I'm not able to achieve more than about 3000 8K read requests
>> over NFS,
>> while the server locally gives much more.
>> And this is all from a file that is completely in ARC cache, no disk
>> IO involved.
>> 
> Just out of curiousity, why do you use 8K reads instead of 64K reads.
> Since the RPC overhead (including the DRC functions) is per RPC, doing
> fewer larger RPCs should usually work better. (Sometimes large rsize/wsize
> values generate too large a burst of traffic for a network interface to
> handle and then the rsize/wsize has to be decreased to avoid this issue.)
> 
> And, although this experiment seems useful for testing patches that try
> and reduce DRC CPU overheads, most "real" NFS servers will be doing disk
> I/O.
> 

This is the default blocksize the Oracle and probably most databases use.
It uses also larger blocks, but for small random reads in OLTP applications this is what is used.

>> I've snatched some sample DTrace script from the net : [
>> http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes
>> ]
>> 
>> And modified it for our new NFS server :
>> 
>> #!/usr/sbin/dtrace -qs
>> 
>> fbt:kernel:nfsrvd_*:entry
>> {
>> self->ts = timestamp;
>> @counts[probefunc] = count();
>> }
>> 
>> fbt:kernel:nfsrvd_*:return
>> / self->ts > 0 /
>> {
>> this->delta = (timestamp-self->ts)/1000000;
>> }
>> 
>> fbt:kernel:nfsrvd_*:return
>> / self->ts > 0 && this->delta > 100 /
>> {
>> @slow[probefunc, "ms"] = lquantize(this->delta, 100, 500, 50);
>> }
>> 
>> fbt:kernel:nfsrvd_*:return
>> / self->ts > 0 /
>> {
>> @dist[probefunc, "ms"] = quantize(this->delta);
>> self->ts = 0;
>> }
>> 
>> END
>> {
>> printf("\n");
>> printa("function %-20s %@10d\n", @counts);
>> printf("\n");
>> printa("function %s(), time in %s:%@d\n", @dist);
>> printf("\n");
>> printa("function %s(), time in %s for >= 100 ms:%@d\n", @slow);
>> }
>> 
>> And here's a sample output from one or two minutes during the run of
>> Oracle's ORION benchmark
>> tool from a Linux machine, on a 32G file on NFS mount over 10G
>> ethernet:
>> 
>> [16:01]root at goliath:/home/ndenev# ./nfsrvd.d
>> ^C
>> 
>> function nfsrvd_access 4
>> function nfsrvd_statfs 10
>> function nfsrvd_getattr 14
>> function nfsrvd_commit 76
>> function nfsrvd_sentcache 110048
>> function nfsrvd_write 110048
>> function nfsrvd_read 283648
>> function nfsrvd_dorpc 393800
>> function nfsrvd_getcache 393800
>> function nfsrvd_rephead 393800
>> function nfsrvd_updatecache 393800
>> 
>> function nfsrvd_access(), time in ms:
>> value ------------- Distribution ------------- count
>> -1 | 0
>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4
>> 1 | 0
>> 
>> function nfsrvd_statfs(), time in ms:
>> value ------------- Distribution ------------- count
>> -1 | 0
>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10
>> 1 | 0
>> 
>> function nfsrvd_getattr(), time in ms:
>> value ------------- Distribution ------------- count
>> -1 | 0
>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14
>> 1 | 0
>> 
>> function nfsrvd_sentcache(), time in ms:
>> value ------------- Distribution ------------- count
>> -1 | 0
>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048
>> 1 | 0
>> 
>> function nfsrvd_rephead(), time in ms:
>> value ------------- Distribution ------------- count
>> -1 | 0
>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
>> 1 | 0
>> 
>> function nfsrvd_updatecache(), time in ms:
>> value ------------- Distribution ------------- count
>> -1 | 0
>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
>> 1 | 0
>> 
>> function nfsrvd_getcache(), time in ms:
>> value ------------- Distribution ------------- count
>> -1 | 0
>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798
>> 1 | 1
>> 2 | 0
>> 4 | 1
>> 8 | 0
>> 
>> function nfsrvd_write(), time in ms:
>> value ------------- Distribution ------------- count
>> -1 | 0
>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039
>> 1 | 5
>> 2 | 4
>> 4 | 0
>> 
>> function nfsrvd_read(), time in ms:
>> value ------------- Distribution ------------- count
>> -1 | 0
>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622
>> 1 | 19
>> 2 | 3
>> 4 | 2
>> 8 | 0
>> 16 | 1
>> 32 | 0
>> 64 | 0
>> 128 | 0
>> 256 | 1
>> 512 | 0
>> 
>> function nfsrvd_commit(), time in ms:
>> value ------------- Distribution ------------- count
>> -1 | 0
>> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44
>> 1 |@@@@@@@ 14
>> 2 | 0
>> 4 |@ 1
>> 8 |@ 1
>> 16 | 0
>> 32 |@@@@@@@ 14
>> 64 |@ 2
>> 128 | 0
>> 
>> 
>> function nfsrvd_commit(), time in ms for >= 100 ms:
>> value ------------- Distribution ------------- count
>> < 100 | 0
>> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
>> 150 | 0
>> 
>> function nfsrvd_read(), time in ms for >= 100 ms:
>> value ------------- Distribution ------------- count
>> 250 | 0
>> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
>> 350 | 0
>> 
>> 
>> Looks like the nfs server cache functions are quite fast, but
>> extremely frequently called.
>> 
> Yep, they are called for every RPC.
> 
> I may try coding up a patch that replaces the single mutex with
> one for each hash bucket, for TCP.
> 
> I'll post if/when I get this patch to a testing/review stage, rick
> 

Cool.

I've readjusted the precision of the dtrace script a bit, and I can see
now the following three functions as taking most of the time : nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache()

This was recorded during a oracle benchmark run called SLOB, which caused 99% cpu load on the NFS server.

>> I hope someone can find this information useful.
>> 
>> _______________________________________________
>> freebsd-hackers at freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>> To unsubscribe, send any mail to
>> "freebsd-hackers-unsubscribe at freebsd.org"