icmp packets on em larger than 1472 [SEC=UNCLASSIFIED]

Pyun YongHyeon pyunyh at gmail.com
Thu Nov 11 21:05:34 UTC 2010


On Thu, Nov 11, 2010 at 08:10:57AM -0800, Kevin Oberman wrote:
> > Date: Wed, 10 Nov 2010 23:49:56 -0800 (PST)
> > From: Kirill Yelizarov <ykirill at yahoo.com>
> > 
> > 
> > 
> > --- On Thu, 11/11/10, Kevin Oberman <oberman at es.net> wrote:
> > 
> > > From: Kevin Oberman <oberman at es.net>
> > > Subject: Re: icmp packets on em larger than 1472 [SEC=UNCLASSIFIED]
> > > To: "Wilkinson, Alex" <alex.wilkinson at dsto.defence.gov.au>
> > > Cc: freebsd-stable at freebsd.org
> > > Date: Thursday, November 11, 2010, 8:26 AM
> > > > Date: Thu, 11 Nov 2010 13:01:26
> > > +0800
> > > > From: "Wilkinson, Alex" <alex.wilkinson at dsto.defence.gov.au>
> > > > Sender: owner-freebsd-stable at freebsd.org
> > > > 
> > > > 
> > > >? ???0n Wed, Nov 10, 2010 at
> > > 04:21:12AM -0800, Kirill Yelizarov wrote: 
> > > > 
> > > >? ???>All my em cards running
> > > 8.1 stable don't reply to icmp echo requests packets larger
> > > than 1472 bytes.
> > > >? ???>
> > > >? ???>On stable 7.2 the same
> > > hardware works as expected:
> > > >? ???># ping -s 1500
> > > 192.168.64.99
> > > >? ???>PING 192.168.64.99
> > > (192.168.64.99): 1500 data bytes
> > > >? ???>1508 bytes from
> > > 192.168.64.99: icmp_seq=0 ttl=63 time=1.249 ms
> > > >? ???>1508 bytes from
> > > 192.168.64.99: icmp_seq=1 ttl=63 time=1.158 ms
> > > >? ???>
> > > >? ???>Here is the dump on em
> > > interface
> > > >? ???>15:06:31.452043 IP
> > > 192.168.66.65 > *****: ICMP echo request, id 28729, seq
> > > 5, length 1480
> > > >? ???>15:06:31.452047 IP
> > > 192.168.66.65 > ****: icmp
> > > >? ???>15:06:31.452069 IP ****
> > > > 192.168.66.65: ICMP echo reply, id 28729, seq 5, length
> > > 1480
> > > >? ???>15:06:31.452071 IP ***
> > > > 192.168.66.65: icmp
> > > >? ???> 
> > > >? ???>Same ping from same source
> > > (it's a 8.1 stable with fxp interface) to em card running
> > > 8.1 stable
> > > >? ???>#pciconf -lv
> > > >?
> > > ???>em0 at pci0:3:4:0:???
> > > class=0x020000 card=0x10798086 chip=0x10798086 rev=0x03
> > > hdr=0x00
> > > >? ???>? ? vendor?
> > > ???= 'Intel Corporation'
> > > >? ???>? ? device?
> > > ???= 'Dual Port Gigabit Ethernet Controller
> > > (82546EB)'
> > > >? ???>? ? class?
> > > ? ? = network
> > > >? ???>? ?
> > > subclass???= ethernet
> > > >? ???>
> > > >? ???># ping -s 1472
> > > 192.168.64.200
> > > >? ???>PING 192.168.64.200
> > > (192.168.64.200): 1472 data bytes
> > > >? ???>1480 bytes from
> > > 192.168.64.200: icmp_seq=0 ttl=63 time=0.848 ms
> > > >? ???>^C
> > > >? ???>
> > > >? ???># ping -s 1473
> > > 192.168.64.200
> > > >? ???>PING 192.168.64.200
> > > (192.168.64.200): 1473 data bytes
> > > >? ???>^C
> > > >? ???>--- 192.168.64.200 ping
> > > statistics ---
> > > >? ???>4 packets transmitted, 0
> > > packets received, 100.0% packet loss
> > > > 
> > > > works fine for me:
> > > > 
> > > > FreeBSD 8.1-STABLE #0 r213395
> > > > 
> > > > em0 at pci0:0:25:0:class=0x020000 card=0x3035103c
> > > chip=0x10de8086 rev=0x02 hdr=0x00
> > > >? ???vendor?
> > > ???= 'Intel Corporation'
> > > >? ???device?
> > > ???= 'Intel Gigabit network connection
> > > (82567LM-3 )'
> > > >? ???class? ? ? =
> > > network
> > > >? ???subclass???=
> > > ethernet
> > > > 
> > > > #ping -s 1473 host
> > > > PING host(192.168.1.1): 1473 data bytes
> > > > 1481 bytes from 192.168.1.1: icmp_seq=0 ttl=253
> > > time=31.506 ms
> > > > 1481 bytes from 192.168.1.1: icmp_seq=1 ttl=253
> > > time=31.493 ms
> > > > 1481 bytes from 192.168.1.1: icmp_seq=2 ttl=253
> > > time=31.550 ms
> > > > ^C
> > > 
> > > The reason the '-s 1500' worked was that the packets were
> > > fragmented. If
> > > I add the '-D' option, '-s 1473' fails on v7 and v8. Are
> > > the V8 systems
> > > where you see if failing without the '-D' on the same
> > > network segment?
> > > If not, it is likely that an intervening device is refusing
> > > to fragment
> > > the packet. (Some routers deliberately don't fragment ICMP
> > > Echos Request
> > > packets.) 
> > 
> > If i set -D -s 1473 sender side refuses to ping and that is
> > correct. All mentioned above machines are behind the same router and
> > switch. Same hardware running v7 is working while v8 is not. And i
> > never saw such problems before.  Also correct me if i'm wrong but the
> > dump shows that the packet arrived. I'll try driver from head and will
> > post here results.
> 
> I did a bit more looking at this today and I see that something bogus is
> going on and it MAY be the em driver.
> 
> I tried 1473 data byte pings without the DF flag. I then captured the
> packets on both ends (where the sending system has a bge (Broadcom GE)
> and the responding end has an em (Intel) card.
> 
> What I saw was the fragmented IP packets all being received by the
> system with the em interface and an ICMP Echo Reply being sent back,
> again fragmented. I saw the reply on both ends, so both interfaces were
> able to fragment an over-sized packet, transmit the two pieces, and
> receive the two pieces. The em device could re-assemble them properly,
> but the bge device does not seem to re-assemble them correctly or else
> has a problem with ICMP packets bigger then MTU size.
> 
> When I send from the em system, I see the packets and fragments all
> arrive in good form, but the system never sends out a reply. Since this
> is a kernel function, it may be a driver, but I suspect that it is in
> the IP stack since I am seeing the problem with a Broadcom card and I
> see the data all arriving.
> 

Most ethernet controllers including bge(4) have a function to
specify how much RX buffer space would be allocated to receive a
frame. When controller receive a frame that has larger size than
the size specified in RX buffer space, it would drop the frame.
Because the oversized frame was silently dropped in driver layer
upper stack has no chance to reply back ICMP responses with
fragmentation needed bit for frames that set don't fragment bit.
This is where correct MTU configuration play an important role in
driver layer. If you want to handle oversized frame you also have
to set correct MTU of interface. However all controllers should be
able to receive standard MTU sized frame including VLAN tag so no
special configuration is needed when you handle standard MTU sized
frames. Some old controllers can't handle VLAN oversized frame such
that you would have no way to send or receive them.

em(4) controllers have different receiving logic where it allows
chaining multiple oversized frames into a single frame. So up to
certain point, which depends on the size of jumbo frame controller
supports, em(4) can receive these oversized frames regardless of
MTU configuration with the help of driver. The chaining is done in
driver layer and that would add additional overhead(chaining +
multiple mbuf allocation) but it has its own advantages.

I was not able to to reproduce the issue with em(4)/bge(4) on
CURRENT and these drivers worked as expected. 

> I think Jack can probably relax, but some patch to the network stack
> seems to have broken at least ICMP processing. And, since the bge system
> ups updated to 8-Stable on October 20 while the em system was updated
> back on August 9, I suspect the flaw was not driver related and was
> committed between August 9 and Oct. 20.
> 
> I think this needs to go to the network list where the folks who tinker
> with that part of the kernel tend to hang out. Sorry for the cross-post.
> -- 
> R. Kevin Oberman, Network Engineer
> Energy Sciences Network (ESnet)
> Ernest O. Lawrence Berkeley National Laboratory (Berkeley Lab)
> E-mail: oberman at es.net			Phone: +1 510 486-8634
> Key fingerprint:059B 2DDF 031C 9BA3 14A4  EADA 927D EBB3 987B 3751
> _______________________________________________
> freebsd-net at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"


More information about the freebsd-stable mailing list