bizarre em + TSO + MSS issue in RELENG_7

Sun Nov 18 00:59:11 PST 2007

On Sat, 17 Nov 2007, Mike Andrews wrote:

> Kip Macy wrote:
>> On Nov 17, 2007 5:28 PM, Mike Andrews <mandrews at bit0.com> wrote:
>>> Kip Macy wrote:
>>>> On Nov 17, 2007 3:23 PM, Mike Andrews <mandrews at bit0.com> wrote:
>>>>> On Sat, 17 Nov 2007, Kip Macy wrote:
>>>>> 
>>>>>> On Nov 17, 2007 2:33 PM, Mike Andrews <mandrews at bit0.com> wrote:
>>>>>>> On Sat, 17 Nov 2007, Kip Macy wrote:
>>>>>>> 
>>>>>>>> On Nov 17, 2007 10:33 AM, Denis Shaposhnikov <dsh at vlink.ru> wrote:
>>>>>>>>> On Sat, 17 Nov 2007 00:42:54 -0500 (EST)
>>>>>>>>> Mike Andrews <mandrews at bit0.com> wrote:
>>>>>>>>> 
>>>>>>>>>> Has anyone run into problems with MSS not being respected when 
>>>>>>>>>> using
>>>>>>>>>> TSO, specifically on em cards?
>>>>>>>>> Yes, I wrote about this problem on the beginning of 2007, see
>>>>>>>>>
>>>>>>>>>     http://tinyurl.com/3e5ak5
>>>>>>>>> 
>>>>>>>> if_em.c:3502
>>>>>>>>        /*
>>>>>>>>         * Payload size per packet w/o any headers.
>>>>>>>>         * Length of all headers up to payload.
>>>>>>>>         */
>>>>>>>>        TXD->tcp_seg_setup.fields.mss = 
>>>>>>>> htole16(mp->m_pkthdr.tso_segsz);
>>>>>>>>        TXD->tcp_seg_setup.fields.hdr_len = hdr_len;
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Please print out the value of tso_segsz here. It appears to be being
>>>>>>>> set correctly. The only thing I can think of is that t_maxopd is not
>>>>>>>> correct. As tso_segsz is correct here:
>>>>>>> It repeatedly prints 1368 during a 1 meg file transfer over a 
>>>>>>> connection
>>>>>>> with a 1380 MSS.  Any other printf's I can add?  I'm working on a web 
>>>>>>> page
>>>>>>> with tcpdump / firewall log output illustrating the issue...
>>>>>> Mike -
>>>>>> Denis' tcpdump output doesn't show oversized segments, something else
>>>>>> appears to be happening there. Can you post your tcpdump output
>>>>>> somewhere?
>>>>> URL sent off-list.
>>>>        if (tso) {
>>>>                m->m_pkthdr.csum_flags = CSUM_TSO;
>>>>                m->m_pkthdr.tso_segsz = tp->t_maxopd - optlen;
>>>>        }
>>>> 
>>>> 
>>>> Please print the value of maxopd and optlen under "if (tso)" in
>>>> tcp_output. I think the calculated optlen may be too small.
>>> 
>>> maxopt=1380 - optlen=12 = tso_segsz=1368
>>> 
>>> Weird though, after this reboot, I had to re-copy a 4 meg file 5 times
>>> to start getting the firewall to log any drops.  Transfer rate was
>>> around 240KB/sec before the firewall started to drop, then it went down
>>> to about 64KB/sec during the 5th copy, and stayed there for subsequent
>>> copies.  The actual packet size the firewall said it was dropping was
>>> varying all over the place still, yet the maxopt/optlen/tso_segsz values
>>> stayed constant.  But it's interesting that it didn't start dropping
>>> immediately after the reboot -- though the transfer rate was still
>>> sub-optimal.
>> 
>> Ok, next theory :D. You shouldn't be seeing "bad len" packets from
>> tcpdump. I'm wondering if that means you're sending down more than
>> 64k. Can you please print out the value of mp->m_pkthdr.len around the
>> same place that you printed out tso_segsz? 64k is the generally
>> accepted limit for TSO, I'm wondering if the card firmware does
>> something weird if you give it more.
>
> OK.  In that last message, where I said it took 5 times to start reproducing 
> the problem... this time it took until I actually toggled TSO back off and 
> back on again, and then it started acting up again.  I don't know what the 
> actual trigger is... it's very weird.
>
> Initially, w/ TSO on and it wasn't dropping yet (but was still transferring 
> slow)...
>
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=8306
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=8306
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=8306
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=8306
> (etc, always 8306)
>
> After toggling off/on which caused the drops to start (and the speed to drop 
> even further):
>
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=7507
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=3053
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1677
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=3037
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=2264
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1656
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1902
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1888
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1640
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1871
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=2461
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=1849
> BIT0 DEBUG: tso_segsz=1368  hdr_len=66  mp->m_pkthdr.len=2092
>
> and so on, with more seemingly random lengths... but none of them ever over 
> 8306, much less 64K.

Got a few more data points here.

I can reproduce this on an i386 kernel, so it isn't amd64 specific.

I can reproduce this on an 82541EI nic, so it isn't 82573 specific.

I can't reproduce this on a Marvell Yukon II (msk) nic; it works fine 
whether TSO is on or off.

I can't reproduce this on a bge nic because it doesn't support TSO :)
That's the only other gigabit nic I've got easy access to.

I can reproduce this with just a Cisco 877W IOS-based router and no Cisco 
PIX / ASA firewalls in the way, with the servers on the LAN interface with 
"ip tcp adjust-mss 1340" on it, and the downloading client on the Cisco's 
802.11G interface.  This time, the client is a Macbook Pro running 
Leopard, and I'm running "tcpdump -i en1 -s 1500 -n -v length \> 1394" on 
the Macbook (not the server this time) to find oversize packets, which is 
actually handier because I can see how trashed they really get :)

I can't reproduce this between two machines on the same subnet (though I 
can reproduce throughput problems alone).  I haven't tried lowering the 
system MSS on one end yet (is there a sysctl to lower the MSS for outbound 
connections without lowering the MTU as well?).  If I could do this it 
would greatly simplify testing for everyone as they wouldn't have to stick 
an MSS-clamping router in the middle.  It doesn't have to be Cisco.

With this setup, copying to the Mac through the 877W from:

msk-based server, TSO disabled: tcpdump reports no problems, file 
transfers are fast

msk-based server, TSO enabled: tcpdump reports no problems, file 
transfers are fast

em-based server, TSO disabled: tcpdump reports no problems, file 
transfers are fast

em-based server, TSO enabled: tcpdump reports numerous oversize packets of 
varying sizes just as before, AND numerous packets with bad TCP checksums. 
The checksum problems aren't limited to only the large packets though. 
(That's probably what's causing the throughput problems.)  Toggling rxcsum 
and txcsum flags on the server made no difference.  What I haven't tried 
yet is hexdumping the packets to see what exactly is getting trashed.

The problem still comes and goes; sometimes it'll work for a few minutes 
after boot, sometimes not; it might be dependent on what other traffic's 
going through the box.