RFC: How to fix the NFS/iSCSI vs TSO problem

Rick Macklem rmacklem at uoguelph.ca
Tue Apr 1 02:17:22 UTC 2014


Marcelo Araujo wrote:
> 2014-04-01 9:43 GMT+08:00 Rick Macklem <rmacklem at uoguelph.ca>:
> 
> > Marcelo Araujo wrote:
> > >
> > > Hello Rick,
> > >
> > >
> > > We have made couple more benchmarks here with additional options,
> > > such like '64 threads' and readahead=8.
> > >
> > I can't remember, but if you haven't already done so, another thing
> > to
> > try are these sysctls on the server:
> > sysctl vfs.nfsd.tcphighwater=100000
> > sysctl vfs.nfsd.tcpcachetimeo=300
> >
> > These should reduce the server's CPU overhead (how important these
> > setting are depends on how current your kernel is).
> >
> 
> I haven't done it, I don't have these sysctl on my system.
> 
Ok, yes, I now notice you are using a 9.1 system.

> 
> >
> > >
> > > Now, we add nfsstat and netstat -m into the table.
> > > Here attached is the full benchmark, and I can say, this patch
> > > really
> > > improved the read speed.
> > >
> > I noticed a significant reduction in CPU usage on the server (about
> > 20%).
> > An interesting question would be "Is this CPU reduction a result of
> > avoiding the m_defrag() calls in the ix driver?".
> >
> 
> I do believe it is because avoid m_defrag(), but I didn't try to dig
> into
> it to check if is really m_defrag().
> 
For FreeBSD9.1, you could make this small change to sys/netinet/tcp_output.c
so that it will avoid the m_defrag() calls in the "ix" driver.
(line #750, 751)
Replace:
 if (len > IP_MAXPACKET - hdrlen) {
	len = IP_MAXPACKET - hdrlen;
with
 if (len > 29 * MCLBYTES - hdrlen) {
	len = 29 * MCLBYTES - hdrlen;

I think this will keep the TSO segments at 32 mbufs in the chain.
It would be interesting to see if a system with this patch still
demonstrated the 20% reduction in CPU when the 4kmcl.patch is
applied to it.

> 
> > Unfortunately, the only way I can think of answering this is doing
> > the
> > benchmarks on hardware without the 32 mbuf chain limitation, but I
> > doubt that you can do that?
> >
> 
> No, I don't have any hardware without the 32mbuf limitation.
> 
Your previous post mentioned this network interface:
NIC - 10G Intel X540 that is based on 82599 chipset.
This one has the 32 mbuf limitation.
(See above w.r.t. testing without m_defrag() calls.)

Again, have fun with it, rick

> 
> >
> > Put another way, it would be interesting to compare "with vs
> > without"
> > the patch on machines where the network interface can handle 35
> > mbufs
> > in the transmit chain, so there aren't m_defrag() calls being done
> > for
> > the non-patched case.
> >
> > Anyhow, have fun with it, rick
> >
> 
> Maybe Christopher can do this benchmark as well in his environment.
> 
> 
> >
> > >
> > > I understand your concern about add more one sysctl, however
> > > maybe we
> > > can do something like ZFS does, if it detect the system is AMD
> > > and
> > > have more than X of RAM it enables some options by default, or a
> > > kind of warning can be displayed show the new sysctl option.
> > >
> > >
> > > Of, course other people opinion will be very welcome.
> > >
> > >
> > > Best Regards,
> > >
> > >
> > >
> > > 2014-03-29 6:44 GMT+08:00 Rick Macklem < rmacklem at uoguelph.ca > :
> > >
> > >
> > >
> > >
> > > Marcelo Araujo wrote:
> > > > 2014-03-28 5:37 GMT+08:00 Rick Macklem < rmacklem at uoguelph.ca
> > > > >:
> > > >
> > > > > Christopher Forgeron wrote:
> > > > > > I'm quite sure the problem is on 9.2-RELEASE, not
> > > > > > 9.1-RELEASE
> > > > > > or
> > > > > > earlier,
> > > > > > as a 9.2-STABLE from last year I have doesn't exhibit the
> > > > > > problem.
> > > > > > New
> > > > > > code in if.c at line 660 looks to be what is starting this,
> > > > > > which
> > > > > > makes me
> > > > > > wonder how TSO was being handled before 9.2.
> > > > > >
> > > > > > I also like Rick's NFS patch for cluster size. I notice an
> > > > > > improvement, but
> > > > > > don't have solid numbers yet. I'm still stress testing it
> > > > > > as we
> > > > > > speak.
> > > > > >
> > > > > Unfortunately, this causes problems for small i386 systems,
> > > > > so I
> > > > > am reluctant to commit it to head. Maybe a variant that is
> > > > > only
> > > > > enabled for amd64 systems with lots of memory would be ok?
> > > > >
> > > > >
> > > > Rick,
> > > >
> > > > Maybe you can create a SYSCTL to enable/disable it by the end
> > > > user
> > > > will be
> > > > more reasonable. Also, of course, it is so far safe if only
> > > > 64Bits
> > > > CPU can
> > > > enable this SYSCTL. Any other option seems not OK, will be hard
> > > > to
> > > > judge
> > > > what is lots of memory and what is not, it will depends what is
> > > > running
> > > > onto the system.
> > > >
> > > I guess adding it so it can be optionally enabled via a sysctl
> > > isn't
> > > a bad idea. I think the largest risk here is "how do you tell
> > > people
> > > what the risk of enabling this is"?
> > >
> > > There are already a bunch of sysctls related to NFS that few
> > > people
> > > know how to use. (I recall that Alexander has argued that folk
> > > don't
> > > want
> > > worry about these tunables and I tend to agree.)
> > >
> > > If I do a variant of the patch that uses m_getjcl(..M_WAITOK..),
> > > then
> > > at least the "breakage" is thread(s) sleeping on "btallo", which
> > > is
> > > fairly easy to check for, althouggh rather obscure.
> > > (Btw, I've never reproduced this for a patch that changes the
> > > code to
> > > always use MJUMPAGESIZE mbuf clusters.
> > > I can only reproduce it intermittently when the patch mixes
> > > allocation of
> > > MCLBYTES clusters and MJUMPAGESIZE clusters.)
> > >
> > > I've been poking at it to try and figure out how to get
> > > m_getjcl(..M_NOWAIT..)
> > > to return NULL instead of looping when it runs out of boundary
> > > tags
> > > (to
> > > see if that can result in a stable implementation of the patch),
> > > but
> > > haven't had much luck yet.
> > >
> > > Bottom line:
> > > I just don't like committing a patch that can break the system in
> > > such an
> > > obscure way, even if it is enabled via a sysctl.
> > >
> > > Others have an opinion on this?
> > >
> > > Thanks, rick
> > >
> > >
> > > > The SYSCTL will be great, and in case you don't have time to do
> > > > it,
> > > > I
> > > > can
> > > > give you a hand.
> > > >
> > > > I'm gonna do more benchmarks today and will send another
> > > > report,
> > > > but
> > > > in our
> > > > product here, I'm inclined to use this patch, because 10~20%
> > > > speed
> > > > up
> > > > in
> > > > read for me is a lot. :-)
> > > >
> > > > Thank you so much and best regards,
> > > > --
> > > > Marcelo Araujo
> > > > araujo at FreeBSD.org
> > >
> > >
> > > > _______________________________________________
> > > > freebsd-net at freebsd.org mailing list
> > > > http://lists.freebsd.org/mailman/listinfo/freebsd-net
> > > > To unsubscribe, send any mail to
> > > > " freebsd-net-unsubscribe at freebsd.org "
> > > >
> > >
> > >
> > >
> > >
> > > --
> > > Marcelo Araujo
> > > araujo at FreeBSD.org
> >
> 
> 
> 
> --
> Marcelo Araujo
> araujo at FreeBSD.org
> 


More information about the freebsd-fs mailing list