NFS client READ performance on -current

Sun Jul 20 14:31:03 UTC 2014

On Thursday 17 July 2014 19:45:56 Julian Elischer wrote:
> On 7/15/14, 10:34 PM, John Baldwin wrote:
> > On Saturday, July 12, 2014 5:14:00 pm Rick Macklem wrote:
> >> Yonghyeon Pyun wrote:
> >>> On Fri, Jul 11, 2014 at 09:54:23AM -0400, John Baldwin wrote:
> >>>> On Thursday, July 10, 2014 6:31:43 pm Rick Macklem wrote:
> >>>>> John Baldwin wrote:
> >>>>>> On Thursday, July 03, 2014 8:51:01 pm Rick Macklem wrote:
> >>>>>>> Russell L. Carter wrote:
> >>>>>>>> On 07/02/14 19:09, Rick Macklem wrote:
> >>>>>>>>> Could you please post the dmesg stuff for the network
> >>>>>>>>> interface,
> >>>>>>>>> so I can tell what driver is being used? I'll take a look
> >>>>>>>>> at
> >>>>>>>>> it,
> >>>>>>>>> in case it needs to be changed to use m_defrag().
> >>>>>>>> 
> >>>>>>>> em0: <Intel(R) PRO/1000 Network Connection 7.4.2> port
> >>>>>>>> 0xd020-0xd03f
> >>>>>>>> mem
> >>>>>>>> 0xfe4a0000-0xfe4bffff,0xfe480000-0xfe49ffff irq 44 at
> >>>>>>>> device 0.0
> >>>>>>>> on
> >>>>>>>> pci2
> >>>>>>>> em0: Using an MSI interrupt
> >>>>>>>> em0: Ethernet address: 00:15:17:bc:29:ba
> >>>>>>>> 001.000007 [2323] netmap_attach             success for em0
> >>>>>>>> tx
> >>>>>>>> 1/1024
> >>>>>>>> rx
> >>>>>>>> 1/1024 queues/slots
> >>>>>>>> 
> >>>>>>>> This is one of those dual nic cards, so there is em1 as
> >>>>>>>> well...
> >>>>>>> 
> >>>>>>> Well, I took a quick look at the driver and it does use
> >>>>>>> m_defrag(),
> >>>>>>> but
> >>>>>>> I think that the "retry:" label it does a goto after doing so
> >>>>>>> might
> >>>>>>> be in
> >>>>>>> the wrong place.
> >>>>>>> 
> >>>>>>> The attached untested patch might fix this.
> >>>>>>> 
> >>>>>>> Is it convenient to build a kernel with this patch applied
> >>>>>>> and then
> >>>>>>> try
> >>>>>>> it with TSO enabled?
> >>>>>>> 
> >>>>>>> rick
> >>>>>>> ps: It does have the transmit segment limit set to 32. I have
> >>>>>>> no
> >>>>>>> idea if
> >>>>>>> 
> >>>>>>>      this is a hardware limitation.
> >>>>>> 
> >>>>>> I think the retry is not in the wrong place, but the overhead
> >>>>>> of all
> >>>>>> those
> >>>>>> pullups is apparently quite severe.
> >>>>> 
> >>>>> The m_defrag() call after the first failure will just barely
> >>>>> squeeze
> >>>>> the just under 64K TSO segment into 32 mbuf clusters. Then I
> >>>>> think any
> >>>>> m_pullup() done during the retry will allocate an mbuf
> >>>>> (at a glance it seems to always do this when the old mbuf is a
> >>>>> cluster)
> >>>>> and prepend that to the list.
> >>>>> --> Now the list is > 32 mbufs again and the
> >>>>> bus_dmammap_load_mbuf_sg()
> >>>>> 
> >>>>>      will fail again on the retry, this time fatally, I think?
> >>>>> 
> >>>>> I can't see any reason to re-do all the stuff using m_pullup()
> >>>>> and Russell
> >>>>> reported that moving the "retry:" fixed his problem, from what I
> >>>>> understood.
> >>>> 
> >>>> Ah, I had assumed (incorrectly) that the m_pullup()s would all be
> >>>> nops in this
> >>>> case.  It seems the NIC would really like to have all those things
> >>>> in a single
> >>>> segment, but it is not required, so I agree that your patch is
> >>>> fine.
> >>> 
> >>> I recall em(4) controllers have various limitation in TSO. Driver
> >>> has to update IP header to make TSO work so driver has to get a
> >>> writable mbufs.  bpf(4) consumers will see IP packet length is 0
> >>> after this change.  I think tcpdump has a compile time option to
> >>> guess correct IP packet length.  The firmware of controller also
> >>> should be able to access complete IP/TCP header in a single buffer.
> >>> I don't remember more details in TSO limitation but I guess you may
> >>> be able to get more details TSO limitation from publicly available
> >>> Intel data sheet.
> >> 
> >> I think that the patch should handle this ok. All of the m_pullup()
> >> stuff gets done the first time. Then, if the result is more than 32
> >> mbufs in the list, m_defrag() is called to copy the chain. This should
> >> result in all the header stuff in the first mbuf cluster and the map
> >> call is done again with this list of clusters. (Without the patch,
> >> m_pullup() would allocate another prepended mbuf and make the chain
> >> more than 32mbufs again.)
> > 
> > Hmm, I am surprised by the m_pullup() behavior that it doesn't just
> > notice that the first mbuf with a cluster has the desired data already
> > and returns without doing anything.  That is, I'm surprised the first
> > 
> > statement in m_pullup() isn't just:
> > 	if (n->m_len >= len)
> > 	
> > 		return (n);
> 
> I seem to remember that the standard behaviour is for the caller to do
> exactly that.

Huh, the manpage doesn't really state that, and it does check in one case.
However, I think that means that the code in em(4) is busted and should be
checking m_len before all the calls to m_pullup().  I think this will fix
the issue the same as Rick's change but it might also avoid unnecessary
pullups in some cases when defrag isn't needed in the first place.

Index: if_em.c
===================================================================

--- if_em.c	(revision 268570)
+++ if_em.c	(working copy)
@@ -1857,32 +1857,41 @@ retry:
 		 * for IPv6 yet.
 		 */
 		ip_off = sizeof(struct ether_header);
-		m_head = m_pullup(m_head, ip_off);
-		if (m_head == NULL) {
-			*m_headp = NULL;
-			return (ENOBUFS);
+		if (m_head->m_len < ip_off) {
+			m_head = m_pullup(m_head, ip_off);
+			if (m_head == NULL) {
+				*m_headp = NULL;
+				return (ENOBUFS);
+			}
 		}
 		eh = mtod(m_head, struct ether_header *);
 		if (eh->ether_type == htons(ETHERTYPE_VLAN)) {
 			ip_off = sizeof(struct ether_vlan_header);
-			m_head = m_pullup(m_head, ip_off);
+			if (m_head->m_len < ip_off) {
+				m_head = m_pullup(m_head, ip_off);
+				if (m_head == NULL) {
+					*m_headp = NULL;
+					return (ENOBUFS);
+				}
+			}
+		}
+		if (m_head->m_len < ip_off + sizeof(struct ip)) {
+			m_head = m_pullup(m_head, ip_off + sizeof(struct ip));
 			if (m_head == NULL) {
 				*m_headp = NULL;
 				return (ENOBUFS);
 			}
 		}
-		m_head = m_pullup(m_head, ip_off + sizeof(struct ip));
-		if (m_head == NULL) {
-			*m_headp = NULL;
-			return (ENOBUFS);
-		}
 		ip = (struct ip *)(mtod(m_head, char *) + ip_off);
 		poff = ip_off + (ip->ip_hl << 2);
 		if (do_tso) {
-			m_head = m_pullup(m_head, poff + sizeof(struct tcphdr));
-			if (m_head == NULL) {
-				*m_headp = NULL;
-				return (ENOBUFS);
+			if (m_head->m_len < poff + sizeof(struct tcphdr)) {
+				m_head = m_pullup(m_head, poff +
+				    sizeof(struct tcphdr));
+				if (m_head == NULL) {
+					*m_headp = NULL;
+					return (ENOBUFS);
+				}
 			}
 			tp = (struct tcphdr *)(mtod(m_head, char *) + poff);
 			/*
@@ -1889,10 +1898,13 @@ retry:
 			 * TSO workaround:
 			 *   pull 4 more bytes of data into it.
 			 */
-			m_head = m_pullup(m_head, poff + (tp->th_off << 2) + 4);
-			if (m_head == NULL) {
-				*m_headp = NULL;
-				return (ENOBUFS);
+			if (m_head->m_len < poff + (tp->th_off << 2) + 4) {
+				m_head = m_pullup(m_head, poff +
+				    (tp->th_off << 2) + 4);
+				if (m_head == NULL) {
+					*m_headp = NULL;
+					return (ENOBUFS);
+				}
 			}
 			ip = (struct ip *)(mtod(m_head, char *) + ip_off);
 			ip->ip_len = 0;
@@ -1907,24 +1919,33 @@ retry:
 			tp->th_sum = in_pseudo(ip->ip_src.s_addr,
 			    ip->ip_dst.s_addr, htons(IPPROTO_TCP));
 		} else if (m_head->m_pkthdr.csum_flags & CSUM_TCP) {
-			m_head = m_pullup(m_head, poff + sizeof(struct tcphdr));
-			if (m_head == NULL) {
-				*m_headp = NULL;
-				return (ENOBUFS);
+			if (m_head->m_len < poff + sizeof(struct tcphdr)) {
+				m_head = m_pullup(m_head, poff +
+				    sizeof(struct tcphdr));
+				if (m_head == NULL) {
+					*m_headp = NULL;
+					return (ENOBUFS);
+				}
 			}
 			tp = (struct tcphdr *)(mtod(m_head, char *) + poff);
-			m_head = m_pullup(m_head, poff + (tp->th_off << 2));
-			if (m_head == NULL) {
-				*m_headp = NULL;
-				return (ENOBUFS);
+			if (m_head->m_len < poff + (tp->th_off << 2)) {
+				m_head = m_pullup(m_head, poff +
+				    (tp->th_off << 2));
+				if (m_head == NULL) {
+					*m_headp = NULL;
+					return (ENOBUFS);
+				}
 			}
 			ip = (struct ip *)(mtod(m_head, char *) + ip_off);
 			tp = (struct tcphdr *)(mtod(m_head, char *) + poff);
 		} else if (m_head->m_pkthdr.csum_flags & CSUM_UDP) {
-			m_head = m_pullup(m_head, poff + sizeof(struct udphdr));
-			if (m_head == NULL) {
-				*m_headp = NULL;
-				return (ENOBUFS);
+			if (m_head->m_len < poff + sizeof(struct udphdr)) {
+				m_head = m_pullup(m_head, poff +
+				    sizeof(struct udphdr));
+				if (m_head == NULL) {
+					*m_headp = NULL;
+					return (ENOBUFS);
+				}
 			}
 			ip = (struct ip *)(mtod(m_head, char *) + ip_off);
 		}




-- 
John Baldwin