bge0: discard frame w/o packet header

Bruce Evans bde at zeta.org.au
Thu Feb 15 23:22:30 UTC 2007


On Thu, 15 Feb 2007, Sam Leffler wrote:

> John Polstra wrote:
>> I have a Dell SC1435 server running an i386 -current system from
>> around the end of December, with a few selected updates applied.  It
>> had been running reliably until early this morning, when the
>> following sequence of events happened.  First, this message was
>> logged:
>>
>>   Feb 15 07:14:29 rock kernel: bge0: discard frame w/o packet header
>>
>> About 2 minutes later, at 07:16:30, the last /var/log/maillog entry
>> was logged.  (This machine is under constant assault from spambots
>> trying dictionary attacks.  It is rare for more than 15 seconds to
>> pass without something being logged in the maillog file.)
>>
>> 30 seconds after that came another bge message:
>>
>>   Feb 15 07:17:00 rock kernel: bge0: discard frame w/o packet header
>>
>> At that point, all network connectivity was gone.  The machine didn't
>> respond to pings.  Worse, its remote management controller, which uses
>> ASF and shares the same network interface, was also unresponsive to
>> pings.  To get the machine back, I had to ask somebody working at the
>> colocation facility to power-cycle it.
>>
>> The "discard frame w/o packet header" message comes from ether_input()
>> if it gets an mbuf that doesn't have the M_PKTHDR flag set.  That
>> can't happen unless something is very wrong with the system.  I'd like
>> to make it a panic.  At least then the machine would reboot instead of
>> just becoming unreachable.  Any objections?
>>
>> Some other nearby warnings should also be panics, in my opinion:
>>
>>   discard frame w/o leading ethernet header ...
>>
>>   discard frame w/o interface pointer ...
>>
>> and, maybe:
>>
>>   discard oversize frame ...

Old versions of sk with a Yukon Lite NIC spew the first and third of
these messages when blasted with tiny packets.  ISTR seeing just a few
of "discard frame w/o packet header" in combination with this.  It was
a driver bug.  I haven't seen any 1 Gbps NICs/buses that are UnLite
enough to actually keep up with 1 Gbps for small packets, and the Yukon
Lite is one of the Lite-ests.  With old versions of sk, when blasted
at 640 kpps, it claims to to receive 270k good pps and drop a few
thousand bad pps, with most of the few thousand reported in the above
messages, at least when I hide the messages under bootverbose so that
the system doesn't spend most of its time spewing the messages (then
it only reports the errors by incrementing if_ierrors).  In newer
versions of sk, it checks the correct hardware error bit and also
checks packet lengths, and finds errors in 112k of the packets previously
reported as good.  It still doesn't report errors for 640-270k packets
dropped before reaching the interrupt handler.

>> Opinions?
>
> There are several diagnostics in ether_input I added mostly because
> drivers "shouldn't do that"; this is one of them.  However some are
> questionable.  I'm not sure about the panic but at the least we should
> rate limit the messages so they can't be used as a DOS mechanism.
> Replacing them with counters and sticking the printf's under IFF_DEBUG
> is another option.

sk would have been fixed sooner if the printfs were panics, but what I
wanted was just rate limiting, with some messages to remind me of the
problem since being silent about it except for incrementing if_ierrors
makes it far too easy to ignore.

Accumulating and printing counters for new and old types of errors is
another problem.  Handling the errors at low levels (preferably entirely
in hardware) is good for efficiency, but gives the problem of mapping
hardware error counters for each type of error to error counters designed
for software.  The fixed sk driver doesn't do this at all -- it just
increments the generic if_ierrors for all types of errors; it doesn't
know how to read hardware error counts or even if they exist, so it
doesn't even get if_ierrors correct (multiple dropped packets are counted
as 1 error).  The bge driver knows how to do this, but can't do it well
since there are dozens of hardware error counters and only a few software
error counters, and statistics programs can barely display the old ones.

Bruce


More information about the freebsd-net mailing list