icmp packets on em larger than 1472 [SEC=UNCLASSIFIED]

Kevin Oberman oberman at es.net
Thu Nov 11 21:37:23 UTC 2010


> From: Pyun YongHyeon <pyunyh at gmail.com>
> Date: Thu, 11 Nov 2010 13:04:36 -0800
> 
> On Thu, Nov 11, 2010 at 08:10:57AM -0800, Kevin Oberman wrote:
> > > Date: Wed, 10 Nov 2010 23:49:56 -0800 (PST)
> > > From: Kirill Yelizarov <ykirill at yahoo.com>
> > > 
> > > 
> > > 
> > > --- On Thu, 11/11/10, Kevin Oberman <oberman at es.net> wrote:
> > > 
> > > > From: Kevin Oberman <oberman at es.net>
> > > > Subject: Re: icmp packets on em larger than 1472 [SEC=UNCLASSIFIED]
> > > > To: "Wilkinson, Alex" <alex.wilkinson at dsto.defence.gov.au>
> > > > Cc: freebsd-stable at freebsd.org
> > > > Date: Thursday, November 11, 2010, 8:26 AM
> > > > > Date: Thu, 11 Nov 2010 13:01:26
> > > > +0800
> > > > > From: "Wilkinson, Alex" <alex.wilkinson at dsto.defence.gov.au>
> > > > > Sender: owner-freebsd-stable at freebsd.org
> > > > > 
> > > > > 
> > > > >? ???0n Wed, Nov 10, 2010 at
> > > > 04:21:12AM -0800, Kirill Yelizarov wrote: 
> > > > > 
> > > > >? ???>All my em cards running
> > > > 8.1 stable don't reply to icmp echo requests packets larger
> > > > than 1472 bytes.
> > > > >? ???>
> > > > >? ???>On stable 7.2 the same
> > > > hardware works as expected:
> > > > >? ???># ping -s 1500
> > > > 192.168.64.99
> > > > >? ???>PING 192.168.64.99
> > > > (192.168.64.99): 1500 data bytes
> > > > >? ???>1508 bytes from
> > > > 192.168.64.99: icmp_seq=0 ttl=63 time=1.249 ms
> > > > >? ???>1508 bytes from
> > > > 192.168.64.99: icmp_seq=1 ttl=63 time=1.158 ms
> > > > >? ???>
> > > > >? ???>Here is the dump on em
> > > > interface
> > > > >? ???>15:06:31.452043 IP
> > > > 192.168.66.65 > *****: ICMP echo request, id 28729, seq
> > > > 5, length 1480
> > > > >? ???>15:06:31.452047 IP
> > > > 192.168.66.65 > ****: icmp
> > > > >? ???>15:06:31.452069 IP ****
> > > > > 192.168.66.65: ICMP echo reply, id 28729, seq 5, length
> > > > 1480
> > > > >? ???>15:06:31.452071 IP ***
> > > > > 192.168.66.65: icmp
> > > > >? ???> 
> > > > >? ???>Same ping from same source
> > > > (it's a 8.1 stable with fxp interface) to em card running
> > > > 8.1 stable
> > > > >? ???>#pciconf -lv
> > > > >?
> > > > ???>em0 at pci0:3:4:0:???
> > > > class=0x020000 card=0x10798086 chip=0x10798086 rev=0x03
> > > > hdr=0x00
> > > > >? ???>? ? vendor?
> > > > ???= 'Intel Corporation'
> > > > >? ???>? ? device?
> > > > ???= 'Dual Port Gigabit Ethernet Controller
> > > > (82546EB)'
> > > > >? ???>? ? class?
> > > > ? ? = network
> > > > >? ???>? ?
> > > > subclass???= ethernet
> > > > >? ???>
> > > > >? ???># ping -s 1472
> > > > 192.168.64.200
> > > > >? ???>PING 192.168.64.200
> > > > (192.168.64.200): 1472 data bytes
> > > > >? ???>1480 bytes from
> > > > 192.168.64.200: icmp_seq=0 ttl=63 time=0.848 ms
> > > > >? ???>^C
> > > > >? ???>
> > > > >? ???># ping -s 1473
> > > > 192.168.64.200
> > > > >? ???>PING 192.168.64.200
> > > > (192.168.64.200): 1473 data bytes
> > > > >? ???>^C
> > > > >? ???>--- 192.168.64.200 ping
> > > > statistics ---
> > > > >? ???>4 packets transmitted, 0
> > > > packets received, 100.0% packet loss
> > > > > 
> > > > > works fine for me:
> > > > > 
> > > > > FreeBSD 8.1-STABLE #0 r213395
> > > > > 
> > > > > em0 at pci0:0:25:0:class=0x020000 card=0x3035103c
> > > > chip=0x10de8086 rev=0x02 hdr=0x00
> > > > >? ???vendor?
> > > > ???= 'Intel Corporation'
> > > > >? ???device?
> > > > ???= 'Intel Gigabit network connection
> > > > (82567LM-3 )'
> > > > >? ???class? ? ? =
> > > > network
> > > > >? ???subclass???=
> > > > ethernet
> > > > > 
> > > > > #ping -s 1473 host
> > > > > PING host(192.168.1.1): 1473 data bytes
> > > > > 1481 bytes from 192.168.1.1: icmp_seq=0 ttl=253
> > > > time=31.506 ms
> > > > > 1481 bytes from 192.168.1.1: icmp_seq=1 ttl=253
> > > > time=31.493 ms
> > > > > 1481 bytes from 192.168.1.1: icmp_seq=2 ttl=253
> > > > time=31.550 ms
> > > > > ^C
> > > > 
> > > > The reason the '-s 1500' worked was that the packets were
> > > > fragmented. If
> > > > I add the '-D' option, '-s 1473' fails on v7 and v8. Are
> > > > the V8 systems
> > > > where you see if failing without the '-D' on the same
> > > > network segment?
> > > > If not, it is likely that an intervening device is refusing
> > > > to fragment
> > > > the packet. (Some routers deliberately don't fragment ICMP
> > > > Echos Request
> > > > packets.) 
> > > 
> > > If i set -D -s 1473 sender side refuses to ping and that is
> > > correct. All mentioned above machines are behind the same router and
> > > switch. Same hardware running v7 is working while v8 is not. And i
> > > never saw such problems before.  Also correct me if i'm wrong but the
> > > dump shows that the packet arrived. I'll try driver from head and will
> > > post here results.
> > 
> > I did a bit more looking at this today and I see that something bogus is
> > going on and it MAY be the em driver.
> > 
> > I tried 1473 data byte pings without the DF flag. I then captured the
> > packets on both ends (where the sending system has a bge (Broadcom GE)
> > and the responding end has an em (Intel) card.
> > 
> > What I saw was the fragmented IP packets all being received by the
> > system with the em interface and an ICMP Echo Reply being sent back,
> > again fragmented. I saw the reply on both ends, so both interfaces were
> > able to fragment an over-sized packet, transmit the two pieces, and
> > receive the two pieces. The em device could re-assemble them properly,
> > but the bge device does not seem to re-assemble them correctly or else
> > has a problem with ICMP packets bigger then MTU size.
> > 
> > When I send from the em system, I see the packets and fragments all
> > arrive in good form, but the system never sends out a reply. Since this
> > is a kernel function, it may be a driver, but I suspect that it is in
> > the IP stack since I am seeing the problem with a Broadcom card and I
> > see the data all arriving.
> > 
> 
> Most ethernet controllers including bge(4) have a function to
> specify how much RX buffer space would be allocated to receive a
> frame. When controller receive a frame that has larger size than
> the size specified in RX buffer space, it would drop the frame.
> Because the oversized frame was silently dropped in driver layer
> upper stack has no chance to reply back ICMP responses with
> fragmentation needed bit for frames that set don't fragment bit.
> This is where correct MTU configuration play an important role in
> driver layer. If you want to handle oversized frame you also have
> to set correct MTU of interface. However all controllers should be
> able to receive standard MTU sized frame including VLAN tag so no
> special configuration is needed when you handle standard MTU sized
> frames. Some old controllers can't handle VLAN oversized frame such
> that you would have no way to send or receive them.
> 
> em(4) controllers have different receiving logic where it allows
> chaining multiple oversized frames into a single frame. So up to
> certain point, which depends on the size of jumbo frame controller
> supports, em(4) can receive these oversized frames regardless of
> MTU configuration with the help of driver. The chaining is done in
> driver layer and that would add additional overhead(chaining +
> multiple mbuf allocation) but it has its own advantages.
> 
> I was not able to to reproduce the issue with em(4)/bge(4) on
> CURRENT and these drivers worked as expected. 

I don't have any systems running CURRENT at the moment, so I can't check
it out. I hope it is fixed there, but it needs to be fixed in
STABLE. Not fragmenting packets that will not fit in a standard frame is
a very serious issues as, when the frame is dropped, the source
re-transmits the same over-sized frame.

Of course, this should not happen if the interface is set to an MTU of
1500 as the higher layers should never pass a block of data larger than
1480 bytes to the IP layer. That's the only reason this had not already
been noticed.
-- 
R. Kevin Oberman, Network Engineer
Energy Sciences Network (ESnet)
Ernest O. Lawrence Berkeley National Laboratory (Berkeley Lab)
E-mail: oberman at es.net			Phone: +1 510 486-8634
Key fingerprint:059B 2DDF 031C 9BA3 14A4  EADA 927D EBB3 987B 3751


More information about the freebsd-stable mailing list