NFS server bottlenecks

Thu Oct 11 05:46:55 UTC 2012

On Oct 11, 2012, at 1:09 AM, Rick Macklem <rmacklem at uoguelph.ca> wrote:

> Nikolay Denev wrote:
>> On Oct 10, 2012, at 3:18 AM, Rick Macklem <rmacklem at uoguelph.ca>
>> wrote:
>> 
>>> Nikolay Denev wrote:
>>>> On Oct 4, 2012, at 12:36 AM, Rick Macklem <rmacklem at uoguelph.ca>
>>>> wrote:
>>>> 
>>>>> Garrett Wollman wrote:
>>>>>> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
>>>>>> <rmacklem at uoguelph.ca> said:
>>>>>> 
>>>>>>>> Simple: just use a sepatate mutex for each list that a cache
>>>>>>>> entry
>>>>>>>> is on, rather than a global lock for everything. This would
>>>>>>>> reduce
>>>>>>>> the mutex contention, but I'm not sure how significantly since
>>>>>>>> I
>>>>>>>> don't have the means to measure it yet.
>>>>>>>> 
>>>>>>> Well, since the cache trimming is removing entries from the
>>>>>>> lists,
>>>>>>> I
>>>>>>> don't
>>>>>>> see how that can be done with a global lock for list updates?
>>>>>> 
>>>>>> Well, the global lock is what we have now, but the cache trimming
>>>>>> process only looks at one list at a time, so not locking the list
>>>>>> that
>>>>>> isn't being iterated over probably wouldn't hurt, unless there's
>>>>>> some
>>>>>> mechanism (that I didn't see) for entries to move from one list
>>>>>> to
>>>>>> another. Note that I'm considering each hash bucket a separate
>>>>>> "list". (One issue to worry about in that case would be
>>>>>> cache-line
>>>>>> contention in the array of hash buckets; perhaps
>>>>>> NFSRVCACHE_HASHSIZE
>>>>>> ought to be increased to reduce that.)
>>>>>> 
>>>>> Yea, a separate mutex for each hash list might help. There is also
>>>>> the
>>>>> LRU list that all entries end up on, that gets used by the
>>>>> trimming
>>>>> code.
>>>>> (I think? I wrote this stuff about 8 years ago, so I haven't
>>>>> looked
>>>>> at
>>>>> it in a while.)
>>>>> 
>>>>> Also, increasing the hash table size is probably a good idea,
>>>>> especially
>>>>> if you reduce how aggressively the cache is trimmed.
>>>>> 
>>>>>>> Only doing it once/sec would result in a very large cache when
>>>>>>> bursts of
>>>>>>> traffic arrives.
>>>>>> 
>>>>>> My servers have 96 GB of memory so that's not a big deal for me.
>>>>>> 
>>>>> This code was originally "production tested" on a server with
>>>>> 1Gbyte,
>>>>> so times have changed a bit;-)
>>>>> 
>>>>>>> I'm not sure I see why doing it as a separate thread will
>>>>>>> improve
>>>>>>> things.
>>>>>>> There are N nfsd threads already (N can be bumped up to 256 if
>>>>>>> you
>>>>>>> wish)
>>>>>>> and having a bunch more "cache trimming threads" would just
>>>>>>> increase
>>>>>>> contention, wouldn't it?
>>>>>> 
>>>>>> Only one cache-trimming thread. The cache trim holds the (global)
>>>>>> mutex for much longer than any individual nfsd service thread has
>>>>>> any
>>>>>> need to, and having N threads doing that in parallel is why it's
>>>>>> so
>>>>>> heavily contended. If there's only one thread doing the trim,
>>>>>> then
>>>>>> the nfsd service threads aren't spending time either contending
>>>>>> on
>>>>>> the
>>>>>> mutex (it will be held less frequently and for shorter periods).
>>>>>> 
>>>>> I think the little drc2.patch which will keep the nfsd threads
>>>>> from
>>>>> acquiring the mutex and doing the trimming most of the time, might
>>>>> be
>>>>> sufficient. I still don't see why a separate trimming thread will
>>>>> be
>>>>> an advantage. I'd also be worried that the one cache trimming
>>>>> thread
>>>>> won't get the job done soon enough.
>>>>> 
>>>>> When I did production testing on a 1Gbyte server that saw a peak
>>>>> load of about 100RPCs/sec, it was necessary to trim aggressively.
>>>>> (Although I'd be tempted to say that a server with 1Gbyte is no
>>>>> longer relevant, I recently recall someone trying to run FreeBSD
>>>>> on a i486, although I doubt they wanted to run the nfsd on it.)
>>>>> 
>>>>>>> The only negative effect I can think of w.r.t. having the nfsd
>>>>>>> threads doing it would be a (I believe negligible) increase in
>>>>>>> RPC
>>>>>>> response times (the time the nfsd thread spends trimming the
>>>>>>> cache).
>>>>>>> As noted, I think this time would be negligible compared to disk
>>>>>>> I/O
>>>>>>> and network transit times in the total RPC response time?
>>>>>> 
>>>>>> With adaptive mutexes, many CPUs, lots of in-memory cache, and
>>>>>> 10G
>>>>>> network connectivity, spinning on a contended mutex takes a
>>>>>> significant amount of CPU time. (For the current design of the
>>>>>> NFS
>>>>>> server, it may actually be a win to turn off adaptive mutexes --
>>>>>> I
>>>>>> should give that a try once I'm able to do more testing.)
>>>>>> 
>>>>> Have fun with it. Let me know when you have what you think is a
>>>>> good
>>>>> patch.
>>>>> 
>>>>> rick
>>>>> 
>>>>>> -GAWollman
>>>>>> _______________________________________________
>>>>>> freebsd-hackers at freebsd.org mailing list
>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>>>>>> To unsubscribe, send any mail to
>>>>>> "freebsd-hackers-unsubscribe at freebsd.org"
>>>>> _______________________________________________
>>>>> freebsd-fs at freebsd.org mailing list
>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>>>>> To unsubscribe, send any mail to
>>>>> "freebsd-fs-unsubscribe at freebsd.org"
>>>> 
>>>> My quest for IOPS over NFS continues :)
>>>> So far I'm not able to achieve more than about 3000 8K read
>>>> requests
>>>> over NFS,
>>>> while the server locally gives much more.
>>>> And this is all from a file that is completely in ARC cache, no
>>>> disk
>>>> IO involved.
>>>> 
>>> Just out of curiousity, why do you use 8K reads instead of 64K
>>> reads.
>>> Since the RPC overhead (including the DRC functions) is per RPC,
>>> doing
>>> fewer larger RPCs should usually work better. (Sometimes large
>>> rsize/wsize
>>> values generate too large a burst of traffic for a network interface
>>> to
>>> handle and then the rsize/wsize has to be decreased to avoid this
>>> issue.)
>>> 
>>> And, although this experiment seems useful for testing patches that
>>> try
>>> and reduce DRC CPU overheads, most "real" NFS servers will be doing
>>> disk
>>> I/O.
>>> 
>> 
>> This is the default blocksize the Oracle and probably most databases
>> use.
>> It uses also larger blocks, but for small random reads in OLTP
>> applications this is what is used.
>> 
> If the client is doing 8K reads, you could increase the read ahead
> "readahead=N" (N up to 16), to try and increase the bandwidth.
> (But if the CPU is 99% busy, then I don't think it will matter.)

I'll try to check if this is possible to be set, as we are testing not only with the Linux NFS client,
but also with the Oracle's built in so called DirectNFS client that is built in to the app.

> 
>> 
>>>> I've snatched some sample DTrace script from the net : [
>>>> http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes
>>>> ]
>>>> 
>>>> And modified it for our new NFS server :
>>>> 
>>>> #!/usr/sbin/dtrace -qs
>>>> 
>>>> fbt:kernel:nfsrvd_*:entry
>>>> {
>>>> self->ts = timestamp;
>>>> @counts[probefunc] = count();
>>>> }
>>>> 
>>>> fbt:kernel:nfsrvd_*:return
>>>> / self->ts > 0 /
>>>> {
>>>> this->delta = (timestamp-self->ts)/1000000;
>>>> }
>>>> 
>>>> fbt:kernel:nfsrvd_*:return
>>>> / self->ts > 0 && this->delta > 100 /
>>>> {
>>>> @slow[probefunc, "ms"] = lquantize(this->delta, 100, 500, 50);
>>>> }
>>>> 
>>>> fbt:kernel:nfsrvd_*:return
>>>> / self->ts > 0 /
>>>> {
>>>> @dist[probefunc, "ms"] = quantize(this->delta);
>>>> self->ts = 0;
>>>> }
>>>> 
>>>> END
>>>> {
>>>> printf("\n");
>>>> printa("function %-20s %@10d\n", @counts);
>>>> printf("\n");
>>>> printa("function %s(), time in %s:%@d\n", @dist);
>>>> printf("\n");
>>>> printa("function %s(), time in %s for >= 100 ms:%@d\n", @slow);
>>>> }
>>>> 
>>>> And here's a sample output from one or two minutes during the run
>>>> of
>>>> Oracle's ORION benchmark
>>>> tool from a Linux machine, on a 32G file on NFS mount over 10G
>>>> ethernet:
>>>> 
>>>> [16:01]root at goliath:/home/ndenev# ./nfsrvd.d
>>>> ^C
>>>> 
>>>> function nfsrvd_access 4
>>>> function nfsrvd_statfs 10
>>>> function nfsrvd_getattr 14
>>>> function nfsrvd_commit 76
>>>> function nfsrvd_sentcache 110048
>>>> function nfsrvd_write 110048
>>>> function nfsrvd_read 283648
>>>> function nfsrvd_dorpc 393800
>>>> function nfsrvd_getcache 393800
>>>> function nfsrvd_rephead 393800
>>>> function nfsrvd_updatecache 393800
>>>> 
>>>> function nfsrvd_access(), time in ms:
>>>> value ------------- Distribution ------------- count
>>>> -1 | 0
>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4
>>>> 1 | 0
>>>> 
>>>> function nfsrvd_statfs(), time in ms:
>>>> value ------------- Distribution ------------- count
>>>> -1 | 0
>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10
>>>> 1 | 0
>>>> 
>>>> function nfsrvd_getattr(), time in ms:
>>>> value ------------- Distribution ------------- count
>>>> -1 | 0
>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14
>>>> 1 | 0
>>>> 
>>>> function nfsrvd_sentcache(), time in ms:
>>>> value ------------- Distribution ------------- count
>>>> -1 | 0
>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048
>>>> 1 | 0
>>>> 
>>>> function nfsrvd_rephead(), time in ms:
>>>> value ------------- Distribution ------------- count
>>>> -1 | 0
>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
>>>> 1 | 0
>>>> 
>>>> function nfsrvd_updatecache(), time in ms:
>>>> value ------------- Distribution ------------- count
>>>> -1 | 0
>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
>>>> 1 | 0
>>>> 
>>>> function nfsrvd_getcache(), time in ms:
>>>> value ------------- Distribution ------------- count
>>>> -1 | 0
>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798
>>>> 1 | 1
>>>> 2 | 0
>>>> 4 | 1
>>>> 8 | 0
>>>> 
>>>> function nfsrvd_write(), time in ms:
>>>> value ------------- Distribution ------------- count
>>>> -1 | 0
>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039
>>>> 1 | 5
>>>> 2 | 4
>>>> 4 | 0
>>>> 
>>>> function nfsrvd_read(), time in ms:
>>>> value ------------- Distribution ------------- count
>>>> -1 | 0
>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622
>>>> 1 | 19
>>>> 2 | 3
>>>> 4 | 2
>>>> 8 | 0
>>>> 16 | 1
>>>> 32 | 0
>>>> 64 | 0
>>>> 128 | 0
>>>> 256 | 1
>>>> 512 | 0
>>>> 
>>>> function nfsrvd_commit(), time in ms:
>>>> value ------------- Distribution ------------- count
>>>> -1 | 0
>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44
>>>> 1 |@@@@@@@ 14
>>>> 2 | 0
>>>> 4 |@ 1
>>>> 8 |@ 1
>>>> 16 | 0
>>>> 32 |@@@@@@@ 14
>>>> 64 |@ 2
>>>> 128 | 0
>>>> 
>>>> 
>>>> function nfsrvd_commit(), time in ms for >= 100 ms:
>>>> value ------------- Distribution ------------- count
>>>> < 100 | 0
>>>> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
>>>> 150 | 0
>>>> 
>>>> function nfsrvd_read(), time in ms for >= 100 ms:
>>>> value ------------- Distribution ------------- count
>>>> 250 | 0
>>>> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
>>>> 350 | 0
>>>> 
>>>> 
>>>> Looks like the nfs server cache functions are quite fast, but
>>>> extremely frequently called.
>>>> 
>>> Yep, they are called for every RPC.
>>> 
>>> I may try coding up a patch that replaces the single mutex with
>>> one for each hash bucket, for TCP.
>>> 
>>> I'll post if/when I get this patch to a testing/review stage, rick
>>> 
>> 
>> Cool.
>> 
>> I've readjusted the precision of the dtrace script a bit, and I can
>> see
>> now the following three functions as taking most of the time :
>> nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache()
>> 
>> This was recorded during a oracle benchmark run called SLOB, which
>> caused 99% cpu load on the NFS server.
>> 
> Even with the drc2.patch and a large value for vfs.nfsd.tcphighwater?
> (Assuming the mounts are TCP ones.)
> 
> Have fun with it, rick
> 

I had upped it, but probably not enough. I'm now running with vfs.nfsd.tcphighwater set
to some ridiculous number, and NFSRVCACHE_HASHSIZE set to 500.
So far it looks like good improvement as those functions no longer show up in the dtrace script output.
I'll run some more benchmarks and testing today.

Thanks!

>> 
>>>> I hope someone can find this information useful.
>>>> 
>>>> _______________________________________________
>>>> freebsd-hackers at freebsd.org mailing list
>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>>>> To unsubscribe, send any mail to
>>>> "freebsd-hackers-unsubscribe at freebsd.org"