Bad ZFS - NFS interaction? [ was: NFS server bottlenecks ]

Mon Oct 15 20:34:59 UTC 2012

Nikolay Denev wrote:
> On Oct 15, 2012, at 5:06 PM, Rick Macklem <rmacklem at uoguelph.ca>
> wrote:
> 
> > Nikolay Denev wrote:
> >> On Oct 13, 2012, at 6:22 PM, Nikolay Denev <ndenev at gmail.com>
> >> wrote:
> >>
> >>>
> >>> On Oct 13, 2012, at 5:05 AM, Rick Macklem <rmacklem at uoguelph.ca>
> >>> wrote:
> >>>
> >>>> I wrote:
> >>>>> Oops, I didn't get the "readahead" option description
> >>>>> quite right in the last post. The default read ahead
> >>>>> is 1, which does result in "rsize * 2", since there is
> >>>>> the read + 1 readahead.
> >>>>>
> >>>>> "rsize * 16" would actually be for the option "readahead=15"
> >>>>> and for "readahead=16" the calculation would be "rsize * 17".
> >>>>>
> >>>>> However, the example was otherwise ok, I think? rick
> >>>>
> >>>> I've attached the patch drc3.patch (it assumes drc2.patch has
> >>>> already been
> >>>> applied) that replaces the single mutex with one for each hash
> >>>> list
> >>>> for tcp. It also increases the size of NFSRVCACHE_HASHSIZE to
> >>>> 200.
> >>>>
> >>>> These patches are also at:
> >>>> http://people.freebsd.org/~rmacklem/drc2.patch
> >>>> http://people.freebsd.org/~rmacklem/drc3.patch
> >>>> in case the attachments don't get through.
> >>>>
> >>>> rick
> >>>> ps: I haven't tested drc3.patch a lot, but I think it's ok?
> >>>
> >>> drc3.patch applied and build cleanly and shows nice improvement!
> >>>
> >>> I've done a quick benchmark using iozone over the NFS mount from
> >>> the
> >>> Linux host.
> >>>
> >>> drc2.pach (but with NFSRVCACHE_HASHSIZE=500)
> >>>
> >>> 	TEST WITH 8K
> >>> 	-------------------------------------------------------------------------------------------------
> >>>       Auto Mode
> >>>       Using Minimum Record Size 8 KB
> >>>       Using Maximum Record Size 8 KB
> >>>       Using minimum file size of 2097152 kilobytes.
> >>>       Using maximum file size of 2097152 kilobytes.
> >>>       O_DIRECT feature enabled
> >>>       SYNC Mode.
> >>>       OPS Mode. Output is in operations per second.
> >>>       Command line used: iozone -a -y 8k -q 8k -n 2g -g 2g -C -I
> >>>       -o
> >>>       -O -i 0 -i 1 -i 2
> >>>       Time Resolution = 0.000001 seconds.
> >>>       Processor cache size set to 1024 Kbytes.
> >>>       Processor cache line size set to 32 bytes.
> >>>       File stride size set to 17 * record size.
> >>>                                                           random
> >>>                                                           random
> >>>                                                           bkwd
> >>>                                                           record
> >>>                                                           stride
> >>>             KB reclen write rewrite read reread read write read
> >>>             rewrite read fwrite frewrite fread freread
> >>>        2097152 8 1919 1914 2356 2321 2335 1706
> >>>
> >>> 	TEST WITH 1M
> >>> 	-------------------------------------------------------------------------------------------------
> >>>       Auto Mode
> >>>       Using Minimum Record Size 1024 KB
> >>>       Using Maximum Record Size 1024 KB
> >>>       Using minimum file size of 2097152 kilobytes.
> >>>       Using maximum file size of 2097152 kilobytes.
> >>>       O_DIRECT feature enabled
> >>>       SYNC Mode.
> >>>       OPS Mode. Output is in operations per second.
> >>>       Command line used: iozone -a -y 1m -q 1m -n 2g -g 2g -C -I
> >>>       -o
> >>>       -O -i 0 -i 1 -i 2
> >>>       Time Resolution = 0.000001 seconds.
> >>>       Processor cache size set to 1024 Kbytes.
> >>>       Processor cache line size set to 32 bytes.
> >>>       File stride size set to 17 * record size.
> >>>                                                           random
> >>>                                                           random
> >>>                                                           bkwd
> >>>                                                           record
> >>>                                                           stride
> >>>             KB reclen write rewrite read reread read write read
> >>>             rewrite read fwrite frewrite fread freread
> >>>        2097152 1024 73 64 477 486 496 61
> >>>
> >>>
> >>> drc3.patch
> >>>
> >>> 	TEST WITH 8K
> >>> 	-------------------------------------------------------------------------------------------------
> >>>       Auto Mode
> >>>       Using Minimum Record Size 8 KB
> >>>       Using Maximum Record Size 8 KB
> >>>       Using minimum file size of 2097152 kilobytes.
> >>>       Using maximum file size of 2097152 kilobytes.
> >>>       O_DIRECT feature enabled
> >>>       SYNC Mode.
> >>>       OPS Mode. Output is in operations per second.
> >>>       Command line used: iozone -a -y 8k -q 8k -n 2g -g 2g -C -I
> >>>       -o
> >>>       -O -i 0 -i 1 -i 2
> >>>       Time Resolution = 0.000001 seconds.
> >>>       Processor cache size set to 1024 Kbytes.
> >>>       Processor cache line size set to 32 bytes.
> >>>       File stride size set to 17 * record size.
> >>>                                                           random
> >>>                                                           random
> >>>                                                           bkwd
> >>>                                                           record
> >>>                                                           stride
> >>>             KB reclen write rewrite read reread read write read
> >>>             rewrite read fwrite frewrite fread freread
> >>>        2097152 8 2108 2397 3001 3013 3010 2389
> >>>
> >>>
> >>> 	TEST WITH 1M
> >>> 	-------------------------------------------------------------------------------------------------
> >>>       Auto Mode
> >>>       Using Minimum Record Size 1024 KB
> >>>       Using Maximum Record Size 1024 KB
> >>>       Using minimum file size of 2097152 kilobytes.
> >>>       Using maximum file size of 2097152 kilobytes.
> >>>       O_DIRECT feature enabled
> >>>       SYNC Mode.
> >>>       OPS Mode. Output is in operations per second.
> >>>       Command line used: iozone -a -y 1m -q 1m -n 2g -g 2g -C -I
> >>>       -o
> >>>       -O -i 0 -i 1 -i 2
> >>>       Time Resolution = 0.000001 seconds.
> >>>       Processor cache size set to 1024 Kbytes.
> >>>       Processor cache line size set to 32 bytes.
> >>>       File stride size set to 17 * record size.
> >>>                                                           random
> >>>                                                           random
> >>>                                                           bkwd
> >>>                                                           record
> >>>                                                           stride
> >>>             KB reclen write rewrite read reread read write read
> >>>             rewrite read fwrite frewrite fread freread
> >>>        2097152 1024 80 79 521 536 528 75
> >>>
> >>>
> >>> Also with drc3 the CPU usage on the server is noticeably lower.
> >>> Most
> >>> of the time I could see only the geom{g_up}/{g_down} threads,
> >>> and a few nfsd threads, before that nfsd's were much more
> >>> prominent.
> >>>
> >>> I guess under bigger load the performance improvement can be
> >>> bigger.
> >>>
> >>> I'll run some more tests with heavier loads this week.
> >>>
> >>> Thanks,
> >>> Nikolay
> >>>
> >>>
> >>
> >> If anyone is interested here's a FlameGraph generated using DTrace
> >> and
> >> Brendan Gregg's tools from
> >> https://github.com/brendangregg/FlameGraph
> >> :
> >>
> >> https://home.totalterror.net/freebsd/goliath-kernel.svg
> >>
> >> It was sampled during Oracle database restore from Linux host over
> >> the
> >> nfs mount.
> >> Currently all IO on the dataset that the linux machine writes is
> >> stuck, simple ls in the directory
> >> hangs for maybe 10-15 minutes and then eventually completes.
> >>
> >> Looks like some weird locking issue.
> >>
> >> [*] http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/
> >>
> >> P.S.: The machine runs with drc3.patch for the NFS server.
> >> P.S.2: The nfsd server is configured for vfs.nfsd.maxthreads=200,
> >> maybe that's too much?
> >>
> > You could try trimming the size of vfs.nfsd.tcphighwater down.
> > Remember that,
> > with this patch, when you increase this tunable, you are trading
> > space
> > for CPU overhead.
> >
> > If it's still "running", you could do "vmstat -m" and "vmstat -z" to
> > see where the memory is allocated. ("nfsstat -e -s" will tell you
> > the
> > size of the cache.)
> >
> > rick
> >> _______________________________________________
> >> freebsd-fs at freebsd.org mailing list
> >> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >> To unsubscribe, send any mail to
> >> "freebsd-fs-unsubscribe at freebsd.org"
> 
> 
> Are you saying that the time spent in _mtx_spin_lock can be because of
> this?
No. I was thinking that memory used by the DRC cache isn't available to
ZFS and that ZFS might be getting contrained because of this. AS I've said
before, I'm not a ZFS guy, but you don't have to look very hard to find
problems related to ZFS running low on what I think they call the ARC cache.
(I believe it is usually a lack of kernel virtual address space, but I'm
 not the guy to know if that's correct or how to tell.)

> To me it looks like that there was some heavy contention in ZFS, maybe
> specific to the
> way it's accessed by the NFS server? Probably due to high maxthreads
> value ?
> 
Using fewer nfsd threads would set a lower upper limit on load for ZFS,
since that sets the upper limit on the # of concurrent VOP_xxx() calls.

> 
> Here's the nfsstat -s -e, seems like it's wrong as it's negative
> number, maybe overflowed?
> 
There was a bug fixed a while ago, where "nfsstat -e -z" would zero
the count out, and then it would go negative when it decreased. It
will also wrap around when it hits 2B, since it's a signed 32bit.
(jwd@ suggested changing the printf() to at least show unsigned, but
 I don't think we ever got around to a patch.)

> Server:
> Retfailed Faults Clients
> 0 0 0
> OpenOwner Opens LockOwner Locks Delegs
> 0 0 0 0 0
> Server Cache Stats:
> Inprog Idem Non-idem Misses CacheSize TCPPeak
> 0 0 0 83500632 -24072 16385
> 
> 
> 
> Also here are the following sysctls :
> 
> vfs.nfsd.request_space_used: 0
> vfs.nfsd.request_space_used_highest: 13121808
> vfs.nfsd.request_space_high: 13107200
> vfs.nfsd.request_space_low: 8738133
> vfs.nfsd.request_space_throttled: 0
> vfs.nfsd.request_space_throttle_count: 0
> 
> Are they related to the same request cache?
> 
Nope. They are in the krpc (sys/rpc/svc.c) and control/limit
the space used by requests (mbuf clusters). Again, a bigger
DRC will mean less mbuf/mbuf cluster space available for the
rest of the system.
Reduce vfs.nfsd.tcphighwater and you reduce the mbuf/mbuf cluster
usage for the DRC. (It caches the reply by m_copy()ing the mbuf list.)

> I have stats that show at some point nfsd has allocated all 200
> threads and
> vfs.nfsd.request_space_used hits the ceiling too.
When all the threads are busy, new requests will be queued in the
receive side of the krpc code, which means more request_space_used.

As I mentioned, use "vmstat -z" to see what the mbuf/mbuf cluster
use is, among others, rick.