Bad ZFS - NFS interaction? [ was: NFS server bottlenecks ]

Nikolay Denev ndenev at gmail.com
Mon Oct 15 15:08:24 UTC 2012


On Oct 15, 2012, at 5:06 PM, Rick Macklem <rmacklem at uoguelph.ca> wrote:

> Nikolay Denev wrote:
>> On Oct 13, 2012, at 6:22 PM, Nikolay Denev <ndenev at gmail.com> wrote:
>> 
>>> 
>>> On Oct 13, 2012, at 5:05 AM, Rick Macklem <rmacklem at uoguelph.ca>
>>> wrote:
>>> 
>>>> I wrote:
>>>>> Oops, I didn't get the "readahead" option description
>>>>> quite right in the last post. The default read ahead
>>>>> is 1, which does result in "rsize * 2", since there is
>>>>> the read + 1 readahead.
>>>>> 
>>>>> "rsize * 16" would actually be for the option "readahead=15"
>>>>> and for "readahead=16" the calculation would be "rsize * 17".
>>>>> 
>>>>> However, the example was otherwise ok, I think? rick
>>>> 
>>>> I've attached the patch drc3.patch (it assumes drc2.patch has
>>>> already been
>>>> applied) that replaces the single mutex with one for each hash list
>>>> for tcp. It also increases the size of NFSRVCACHE_HASHSIZE to 200.
>>>> 
>>>> These patches are also at:
>>>> http://people.freebsd.org/~rmacklem/drc2.patch
>>>> http://people.freebsd.org/~rmacklem/drc3.patch
>>>> in case the attachments don't get through.
>>>> 
>>>> rick
>>>> ps: I haven't tested drc3.patch a lot, but I think it's ok?
>>> 
>>> drc3.patch applied and build cleanly and shows nice improvement!
>>> 
>>> I've done a quick benchmark using iozone over the NFS mount from the
>>> Linux host.
>>> 
>>> drc2.pach (but with NFSRVCACHE_HASHSIZE=500)
>>> 
>>> 	TEST WITH 8K
>>> 	-------------------------------------------------------------------------------------------------
>>>       Auto Mode
>>>       Using Minimum Record Size 8 KB
>>>       Using Maximum Record Size 8 KB
>>>       Using minimum file size of 2097152 kilobytes.
>>>       Using maximum file size of 2097152 kilobytes.
>>>       O_DIRECT feature enabled
>>>       SYNC Mode.
>>>       OPS Mode. Output is in operations per second.
>>>       Command line used: iozone -a -y 8k -q 8k -n 2g -g 2g -C -I -o
>>>       -O -i 0 -i 1 -i 2
>>>       Time Resolution = 0.000001 seconds.
>>>       Processor cache size set to 1024 Kbytes.
>>>       Processor cache line size set to 32 bytes.
>>>       File stride size set to 17 * record size.
>>>                                                           random
>>>                                                           random
>>>                                                           bkwd
>>>                                                           record
>>>                                                           stride
>>>             KB reclen write rewrite read reread read write read
>>>             rewrite read fwrite frewrite fread freread
>>>        2097152 8 1919 1914 2356 2321 2335 1706
>>> 
>>> 	TEST WITH 1M
>>> 	-------------------------------------------------------------------------------------------------
>>>       Auto Mode
>>>       Using Minimum Record Size 1024 KB
>>>       Using Maximum Record Size 1024 KB
>>>       Using minimum file size of 2097152 kilobytes.
>>>       Using maximum file size of 2097152 kilobytes.
>>>       O_DIRECT feature enabled
>>>       SYNC Mode.
>>>       OPS Mode. Output is in operations per second.
>>>       Command line used: iozone -a -y 1m -q 1m -n 2g -g 2g -C -I -o
>>>       -O -i 0 -i 1 -i 2
>>>       Time Resolution = 0.000001 seconds.
>>>       Processor cache size set to 1024 Kbytes.
>>>       Processor cache line size set to 32 bytes.
>>>       File stride size set to 17 * record size.
>>>                                                           random
>>>                                                           random
>>>                                                           bkwd
>>>                                                           record
>>>                                                           stride
>>>             KB reclen write rewrite read reread read write read
>>>             rewrite read fwrite frewrite fread freread
>>>        2097152 1024 73 64 477 486 496 61
>>> 
>>> 
>>> drc3.patch
>>> 
>>> 	TEST WITH 8K
>>> 	-------------------------------------------------------------------------------------------------
>>>       Auto Mode
>>>       Using Minimum Record Size 8 KB
>>>       Using Maximum Record Size 8 KB
>>>       Using minimum file size of 2097152 kilobytes.
>>>       Using maximum file size of 2097152 kilobytes.
>>>       O_DIRECT feature enabled
>>>       SYNC Mode.
>>>       OPS Mode. Output is in operations per second.
>>>       Command line used: iozone -a -y 8k -q 8k -n 2g -g 2g -C -I -o
>>>       -O -i 0 -i 1 -i 2
>>>       Time Resolution = 0.000001 seconds.
>>>       Processor cache size set to 1024 Kbytes.
>>>       Processor cache line size set to 32 bytes.
>>>       File stride size set to 17 * record size.
>>>                                                           random
>>>                                                           random
>>>                                                           bkwd
>>>                                                           record
>>>                                                           stride
>>>             KB reclen write rewrite read reread read write read
>>>             rewrite read fwrite frewrite fread freread
>>>        2097152 8 2108 2397 3001 3013 3010 2389
>>> 
>>> 
>>> 	TEST WITH 1M
>>> 	-------------------------------------------------------------------------------------------------
>>>       Auto Mode
>>>       Using Minimum Record Size 1024 KB
>>>       Using Maximum Record Size 1024 KB
>>>       Using minimum file size of 2097152 kilobytes.
>>>       Using maximum file size of 2097152 kilobytes.
>>>       O_DIRECT feature enabled
>>>       SYNC Mode.
>>>       OPS Mode. Output is in operations per second.
>>>       Command line used: iozone -a -y 1m -q 1m -n 2g -g 2g -C -I -o
>>>       -O -i 0 -i 1 -i 2
>>>       Time Resolution = 0.000001 seconds.
>>>       Processor cache size set to 1024 Kbytes.
>>>       Processor cache line size set to 32 bytes.
>>>       File stride size set to 17 * record size.
>>>                                                           random
>>>                                                           random
>>>                                                           bkwd
>>>                                                           record
>>>                                                           stride
>>>             KB reclen write rewrite read reread read write read
>>>             rewrite read fwrite frewrite fread freread
>>>        2097152 1024 80 79 521 536 528 75
>>> 
>>> 
>>> Also with drc3 the CPU usage on the server is noticeably lower. Most
>>> of the time I could see only the geom{g_up}/{g_down} threads,
>>> and a few nfsd threads, before that nfsd's were much more prominent.
>>> 
>>> I guess under bigger load the performance improvement can be bigger.
>>> 
>>> I'll run some more tests with heavier loads this week.
>>> 
>>> Thanks,
>>> Nikolay
>>> 
>>> 
>> 
>> If anyone is interested here's a FlameGraph generated using DTrace and
>> Brendan Gregg's tools from https://github.com/brendangregg/FlameGraph
>> :
>> 
>> https://home.totalterror.net/freebsd/goliath-kernel.svg
>> 
>> It was sampled during Oracle database restore from Linux host over the
>> nfs mount.
>> Currently all IO on the dataset that the linux machine writes is
>> stuck, simple ls in the directory
>> hangs for maybe 10-15 minutes and then eventually completes.
>> 
>> Looks like some weird locking issue.
>> 
>> [*] http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/
>> 
>> P.S.: The machine runs with drc3.patch for the NFS server.
>> P.S.2: The nfsd server is configured for vfs.nfsd.maxthreads=200,
>> maybe that's too much?
>> 
> You could try trimming the size of vfs.nfsd.tcphighwater down. Remember that,
> with this patch, when you increase this tunable, you are trading space
> for CPU overhead.
> 
> If it's still "running", you could do "vmstat -m" and "vmstat -z" to
> see where the memory is allocated. ("nfsstat -e -s" will tell you the
> size of the cache.)
> 
> rick
>> _______________________________________________
>> freebsd-fs at freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"


Are you saying that the time spent in _mtx_spin_lock can be because of this?
To me it looks like that there was some heavy contention in ZFS, maybe specific to the
way it's accessed by the NFS server? Probably due to high maxthreads value ?


Here's the nfsstat -s -e, seems like it's wrong as it's negative number, maybe overflowed?

Server:
Retfailed    Faults   Clients
        0         0         0
OpenOwner     Opens LockOwner     Locks    Delegs 
        0         0         0         0         0 
Server Cache Stats:
   Inprog      Idem  Non-idem    Misses CacheSize   TCPPeak
        0         0         0  83500632    -24072     16385



Also here are the following sysctls :

vfs.nfsd.request_space_used: 0
vfs.nfsd.request_space_used_highest: 13121808
vfs.nfsd.request_space_high: 13107200
vfs.nfsd.request_space_low: 8738133
vfs.nfsd.request_space_throttled: 0
vfs.nfsd.request_space_throttle_count: 0

Are they related to the same request cache?

I have stats that show at some point nfsd has allocated all 200 threads and 
vfs.nfsd.request_space_used hits the ceiling too.




More information about the freebsd-fs mailing list