bizarre em + TSO + MSS issue in RELENG_7

Sun Nov 18 15:26:36 PST 2007

On Sun, 18 Nov 2007, Jack Vogel wrote:

> On Nov 18, 2007 11:33 AM, Jack Vogel <jfvogel at gmail.com> wrote:
>>
>> On Nov 18, 2007 12:58 AM, Mike Andrews <mandrews at bit0.com> wrote:
>>>
>>> On Sat, 17 Nov 2007, Mike Andrews wrote:
>>>
>>>> Kip Macy wrote:
>>>>> On Nov 17, 2007 5:28 PM, Mike Andrews <mandrews at bit0.com> wrote:
>>>>>> Kip Macy wrote:
>>>>>>> On Nov 17, 2007 3:23 PM, Mike Andrews <mandrews at bit0.com> wrote:
>>>>>>>> On Sat, 17 Nov 2007, Kip Macy wrote:
>>>>>>>>
>>>>>>>>> On Nov 17, 2007 2:33 PM, Mike Andrews <mandrews at bit0.com> wrote:
>>>>>>>>>> On Sat, 17 Nov 2007, Kip Macy wrote:
>>>>>>>>>>
>>>>>>>>>>> On Nov 17, 2007 10:33 AM, Denis Shaposhnikov <dsh at vlink.ru> wrote:
>>>>>>>>>>>> On Sat, 17 Nov 2007 00:42:54 -0500 (EST)
>>>>>>>>>>>> Mike Andrews <mandrews at bit0.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Has anyone run into problems with MSS not being respected when
>>>>>>>>>>>>> using
>>>>>>>>>>>>> TSO, specifically on em cards?
>>>>>>>>>>>> Yes, I wrote about this problem on the beginning of 2007, see
>>>>>>>>>>>>
>>>>>>>>>>>>     http://tinyurl.com/3e5ak5
>>>>>>>>>>>>
>>>>>>>>>>> if_em.c:3502
>>>>>>>>>>>        /*
>>>>>>>>>>>         * Payload size per packet w/o any headers.
>>>>>>>>>>>         * Length of all headers up to payload.
>>>>>>>>>>>         */
>>>>>>>>>>>        TXD->tcp_seg_setup.fields.mss =
>>>>>>>>>>> htole16(mp->m_pkthdr.tso_segsz);
>>>>>>>>>>>        TXD->tcp_seg_setup.fields.hdr_len = hdr_len;
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Please print out the value of tso_segsz here. It appears to be being
>>>>>>>>>>> set correctly. The only thing I can think of is that t_maxopd is not
>>>>>>>>>>> correct. As tso_segsz is correct here:
>>>>>>>>>> It repeatedly prints 1368 during a 1 meg file transfer over a
>>>>>>>>>> connection
>>>>>>>>>> with a 1380 MSS.  Any other printf's I can add?  I'm working on a web
>>>>>>>>>> page
>>>>>>>>>> with tcpdump / firewall log output illustrating the issue...
>>>>>>>>> Mike -
>>>>>>>>> Denis' tcpdump output doesn't show oversized segments, something else
>>>>>>>>> appears to be happening there. Can you post your tcpdump output
>>>>>>>>> somewhere?
>>>>>>>> URL sent off-list.
>>>>>>>        if (tso) {
>>>>>>>                m->m_pkthdr.csum_flags = CSUM_TSO;
>>>>>>>                m->m_pkthdr.tso_segsz = tp->t_maxopd - optlen;
>>>>>>>        }
>>>>>>>
>>>>>>>
>>>>>>> Please print the value of maxopd and optlen under "if (tso)" in
>>>>>>> tcp_output. I think the calculated optlen may be too small.
>>>>>>
>>>>>> maxopt=1380 - optlen=12 = tso_segsz=1368
>>>>>>
>>>>>> Weird though, after this reboot, I had to re-copy a 4 meg file 5 times
>>>>>> to start getting the firewall to log any drops.  Transfer rate was
>>>>>> around 240KB/sec before the firewall started to drop, then it went down
>>>>>> to about 64KB/sec during the 5th copy, and stayed there for subsequent
>>>>>> copies.  The actual packet size the firewall said it was dropping was
>>>>>> varying all over the place still, yet the maxopt/optlen/tso_segsz values
>>>>>> stayed constant.  But it's interesting that it didn't start dropping
>>>>>> immediately after the reboot -- though the transfer rate was still
>>>>>> sub-optimal.
>>>>>
>>>>> Ok, next theory :D. You shouldn't be seeing "bad len" packets from
>>>>> tcpdump. I'm wondering if that means you're sending down more than
>>>>> 64k. Can you please print out the value of mp->m_pkthdr.len around the
>>>>> same place that you printed out tso_segsz? 64k is the generally
>>>>> accepted limit for TSO, I'm wondering if the card firmware does
>>>>> something weird if you give it more.
>>>>
>>>> OK.  In that last message, where I said it took 5 times to start reproducing
>>>> the problem... this time it took until I actually toggled TSO back off and
>>>> back on again, and then it started acting up again.  I don't know what the
>>>> actual trigger is... it's very weird.
>>>>
>>>> Initially, w/ TSO on and it wasn't dropping yet (but was still transferring
>>>> slow)...
>>>>
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=8306
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=8306
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=8306
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=8306
>>>> (etc, always 8306)
>>>>
>>>> After toggling off/on which caused the drops to start (and the speed to drop
>>>> even further):
>>>>
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=7507
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=3053
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1677
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=3037
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=2264
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1656
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1902
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1888
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1640
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1871
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=2461
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1849
>>>> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=2092
>>>>
>>>> and so on, with more seemingly random lengths... but none of them ever over
>>>> 8306, much less 64K.
>>>
>>>
>>> Got a few more data points here.
>>>
>>> I can reproduce this on an i386 kernel, so it isn't amd64 specific.
>>>
>>> I can reproduce this on an 82541EI nic, so it isn't 82573 specific.
>>>
>>> I can't reproduce this on a Marvell Yukon II (msk) nic; it works fine
>>> whether TSO is on or off.
>>>
>>> I can't reproduce this on a bge nic because it doesn't support TSO :)
>>> That's the only other gigabit nic I've got easy access to.
>>>
>>> I can reproduce this with just a Cisco 877W IOS-based router and no Cisco
>>> PIX / ASA firewalls in the way, with the servers on the LAN interface with
>>> "ip tcp adjust-mss 1340" on it, and the downloading client on the Cisco's
>>> 802.11G interface.  This time, the client is a Macbook Pro running
>>> Leopard, and I'm running "tcpdump -i en1 -s 1500 -n -v length \> 1394" on
>>> the Macbook (not the server this time) to find oversize packets, which is
>>> actually handier because I can see how trashed they really get :)
>>>
>>> I can't reproduce this between two machines on the same subnet (though I
>>> can reproduce throughput problems alone).  I haven't tried lowering the
>>> system MSS on one end yet (is there a sysctl to lower the MSS for outbound
>>> connections without lowering the MTU as well?).  If I could do this it
>>> would greatly simplify testing for everyone as they wouldn't have to stick
>>> an MSS-clamping router in the middle.  It doesn't have to be Cisco.
>>>
>>> With this setup, copying to the Mac through the 877W from:
>>>
>>> msk-based server, TSO disabled: tcpdump reports no problems, file
>>> transfers are fast
>>>
>>> msk-based server, TSO enabled: tcpdump reports no problems, file
>>> transfers are fast
>>>
>>> em-based server, TSO disabled: tcpdump reports no problems, file
>>> transfers are fast
>>>
>>> em-based server, TSO enabled: tcpdump reports numerous oversize packets of
>>> varying sizes just as before, AND numerous packets with bad TCP checksums.
>>> The checksum problems aren't limited to only the large packets though.
>>> (That's probably what's causing the throughput problems.)  Toggling rxcsum
>>> and txcsum flags on the server made no difference.  What I haven't tried
>>> yet is hexdumping the packets to see what exactly is getting trashed.
>>>
>>> The problem still comes and goes; sometimes it'll work for a few minutes
>>> after boot, sometimes not; it might be dependent on what other traffic's
>>> going through the box.
>>
>> Hmmm, OK so the data is pointing to something in the em TSO  or encap
>> code. I will look into this tomorrow. So the necessary elements are systems
>> on two subnets and em doing the transmitting with TSO?

And a sub-1460 MSS on the client end OR the router doing MSS clamping, 
yes.  I can't yet reproduce it with 1500 byte MTU's or between two 
machines on the same subnet.  I definitely haven't done any tests with 
jumbos...

> BTW, not to dodge the problem, but this is a case where I'd say its absurd
> to be using TSO. Is the link at 1G or 100Mb?

It's reproducible at either speed, but I personally am perfectly happy 
leaving TSO disabled on my production boxes -- I've got my workaround, it 
performs, I'm cool.  At this point I'm pursuing a fix more for others' 
benefit because some other people are having at least throughput issues -- 
and for my own weirdo curiosity.

If a fix doesn't make 7.0-RELEASE (and I almost hate to say this) might it 
be worth disabling TSO by default in RELENG_7_0 but back on for RELENG_7?