NFS server bottlenecks
Nikolay Denev
ndenev at gmail.com
Tue Oct 9 14:12:48 UTC 2012
On Oct 4, 2012, at 12:36 AM, Rick Macklem <rmacklem at uoguelph.ca> wrote:
> Garrett Wollman wrote:
>> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
>> <rmacklem at uoguelph.ca> said:
>>
>>>> Simple: just use a sepatate mutex for each list that a cache entry
>>>> is on, rather than a global lock for everything. This would reduce
>>>> the mutex contention, but I'm not sure how significantly since I
>>>> don't have the means to measure it yet.
>>>>
>>> Well, since the cache trimming is removing entries from the lists, I
>>> don't
>>> see how that can be done with a global lock for list updates?
>>
>> Well, the global lock is what we have now, but the cache trimming
>> process only looks at one list at a time, so not locking the list that
>> isn't being iterated over probably wouldn't hurt, unless there's some
>> mechanism (that I didn't see) for entries to move from one list to
>> another. Note that I'm considering each hash bucket a separate
>> "list". (One issue to worry about in that case would be cache-line
>> contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE
>> ought to be increased to reduce that.)
>>
> Yea, a separate mutex for each hash list might help. There is also the
> LRU list that all entries end up on, that gets used by the trimming code.
> (I think? I wrote this stuff about 8 years ago, so I haven't looked at
> it in a while.)
>
> Also, increasing the hash table size is probably a good idea, especially
> if you reduce how aggressively the cache is trimmed.
>
>>> Only doing it once/sec would result in a very large cache when
>>> bursts of
>>> traffic arrives.
>>
>> My servers have 96 GB of memory so that's not a big deal for me.
>>
> This code was originally "production tested" on a server with 1Gbyte,
> so times have changed a bit;-)
>
>>> I'm not sure I see why doing it as a separate thread will improve
>>> things.
>>> There are N nfsd threads already (N can be bumped up to 256 if you
>>> wish)
>>> and having a bunch more "cache trimming threads" would just increase
>>> contention, wouldn't it?
>>
>> Only one cache-trimming thread. The cache trim holds the (global)
>> mutex for much longer than any individual nfsd service thread has any
>> need to, and having N threads doing that in parallel is why it's so
>> heavily contended. If there's only one thread doing the trim, then
>> the nfsd service threads aren't spending time either contending on the
>> mutex (it will be held less frequently and for shorter periods).
>>
> I think the little drc2.patch which will keep the nfsd threads from
> acquiring the mutex and doing the trimming most of the time, might be
> sufficient. I still don't see why a separate trimming thread will be
> an advantage. I'd also be worried that the one cache trimming thread
> won't get the job done soon enough.
>
> When I did production testing on a 1Gbyte server that saw a peak
> load of about 100RPCs/sec, it was necessary to trim aggressively.
> (Although I'd be tempted to say that a server with 1Gbyte is no
> longer relevant, I recently recall someone trying to run FreeBSD
> on a i486, although I doubt they wanted to run the nfsd on it.)
>
>>> The only negative effect I can think of w.r.t. having the nfsd
>>> threads doing it would be a (I believe negligible) increase in RPC
>>> response times (the time the nfsd thread spends trimming the cache).
>>> As noted, I think this time would be negligible compared to disk I/O
>>> and network transit times in the total RPC response time?
>>
>> With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G
>> network connectivity, spinning on a contended mutex takes a
>> significant amount of CPU time. (For the current design of the NFS
>> server, it may actually be a win to turn off adaptive mutexes -- I
>> should give that a try once I'm able to do more testing.)
>>
> Have fun with it. Let me know when you have what you think is a good patch.
>
> rick
>
>> -GAWollman
>> _______________________________________________
>> freebsd-hackers at freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>> To unsubscribe, send any mail to
>> "freebsd-hackers-unsubscribe at freebsd.org"
> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
My quest for IOPS over NFS continues :)
So far I'm not able to achieve more than about 3000 8K read requests over NFS,
while the server locally gives much more.
And this is all from a file that is completely in ARC cache, no disk IO involved.
I've snatched some sample DTrace script from the net : [ http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes ]
And modified it for our new NFS server :
#!/usr/sbin/dtrace -qs
fbt:kernel:nfsrvd_*:entry
{
self->ts = timestamp;
@counts[probefunc] = count();
}
fbt:kernel:nfsrvd_*:return
/ self->ts > 0 /
{
this->delta = (timestamp-self->ts)/1000000;
}
fbt:kernel:nfsrvd_*:return
/ self->ts > 0 && this->delta > 100 /
{
@slow[probefunc, "ms"] = lquantize(this->delta, 100, 500, 50);
}
fbt:kernel:nfsrvd_*:return
/ self->ts > 0 /
{
@dist[probefunc, "ms"] = quantize(this->delta);
self->ts = 0;
}
END
{
printf("\n");
printa("function %-20s %@10d\n", @counts);
printf("\n");
printa("function %s(), time in %s:%@d\n", @dist);
printf("\n");
printa("function %s(), time in %s for >= 100 ms:%@d\n", @slow);
}
And here's a sample output from one or two minutes during the run of Oracle's ORION benchmark
tool from a Linux machine, on a 32G file on NFS mount over 10G ethernet:
[16:01]root at goliath:/home/ndenev# ./nfsrvd.d
^C
function nfsrvd_access 4
function nfsrvd_statfs 10
function nfsrvd_getattr 14
function nfsrvd_commit 76
function nfsrvd_sentcache 110048
function nfsrvd_write 110048
function nfsrvd_read 283648
function nfsrvd_dorpc 393800
function nfsrvd_getcache 393800
function nfsrvd_rephead 393800
function nfsrvd_updatecache 393800
function nfsrvd_access(), time in ms:
value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4
1 | 0
function nfsrvd_statfs(), time in ms:
value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10
1 | 0
function nfsrvd_getattr(), time in ms:
value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14
1 | 0
function nfsrvd_sentcache(), time in ms:
value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048
1 | 0
function nfsrvd_rephead(), time in ms:
value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
1 | 0
function nfsrvd_updatecache(), time in ms:
value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
1 | 0
function nfsrvd_getcache(), time in ms:
value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798
1 | 1
2 | 0
4 | 1
8 | 0
function nfsrvd_write(), time in ms:
value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039
1 | 5
2 | 4
4 | 0
function nfsrvd_read(), time in ms:
value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622
1 | 19
2 | 3
4 | 2
8 | 0
16 | 1
32 | 0
64 | 0
128 | 0
256 | 1
512 | 0
function nfsrvd_commit(), time in ms:
value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@ 44
1 |@@@@@@@ 14
2 | 0
4 |@ 1
8 |@ 1
16 | 0
32 |@@@@@@@ 14
64 |@ 2
128 | 0
function nfsrvd_commit(), time in ms for >= 100 ms:
value ------------- Distribution ------------- count
< 100 | 0
100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
150 | 0
function nfsrvd_read(), time in ms for >= 100 ms:
value ------------- Distribution ------------- count
250 | 0
300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
350 | 0
Looks like the nfs server cache functions are quite fast, but extremely frequently called.
I hope someone can find this information useful.
More information about the freebsd-hackers
mailing list