em network issues

Bill Paul wpaul at FreeBSD.ORG
Fri Oct 20 21:21:38 UTC 2006


> Bill Paul wrote:
> > [Charset ISO-8859-1 unsupported, filtering to ASCII...]
> > 
> >>On 10/19/06, Kris Kennaway <kris at obsecurity.org> wrote:
> >>
> >>>On Thu, Oct 19, 2006 at 02:18:13PM -0700, Jack Vogel wrote:
> >>>
> >>>>The engineer in our test group has installed 6.2 BETA2 and attempted via a
> >>>>number of tests to reproduce this problem, the machine even shares the em
> >>>>interrupt with usb, and yet so far he has been unsuccessful.
> >>>
> >>>What tests is he running?
> >>
> >>He tried doing something Kip said reliably repro'd the issue, building a big
> >>source archive over NFS. Then he has been running a continuous NFS data
> >>back and forth copy since, that is still ongoing.
> >>
> >>Other suggestions?
> >>
> >>Jack
> >>
> > 
> > 
> > Just out of curiosity, what sort of torture tests does Intel do, in
> > general, on the em driver on FreeBSD? One thing that I've found which
> > works wonders at exposing race conditions is the Smartbits bi-directional
> > IP forwarding test. Put two NICs in a system, configure for it for IP
> > forwarding, then connect the Smartbits to each port and run the
> > SmartApps router test in bi-directional mode. At 64 bytes per frame,
> > it will try to push 2.96 million packets/second through both ports
> > simultaneously (1.48 million in each direction). Of course, you won't
> > actually be able to forward all the traffic, but the interfaces (not
> > to mention the OS) should continue running regardless.
> > 
> > This test exercises both the RX and TX paths and generates hundreds of
> > thousands of interrupts per second. You'd be amazed at the sort of
> > things you can discover with it. The downside of course is that a
> > Smartbits with gigE ports isn't cheap, but I'd be surprised if Intel
> > didn't have one kicking around somewhere.
> > 
> > -Bill
> > 
> 
> This is exactly the test that Andre and I were running, though only in
> one direction (I think due to lack of hardware for a full test).

Yes, but did you do it with a Smartbits though, or just with a couple of
other FreeBSD machines? Unfortunately, a typical FreeBSD system on its own
won't generate frames anywhere near fast enough to really torture test a
gigE interface. At best you might hit around 200000 to 300000 frames/sec.

A given Smartbits system doesn't need special hardware to run a
bi-directional forwarding test. If you're using SmartApps, you just
have to click the "Bi-Directional" checkbox on the main setup window.
(At least, that's how it is with the ones at work.)

Being able to flood the link with the Smartbits is also handy for
provoking error conditions (RX overruns and TX underruns, mostly), which
shows you how well (or not) the driver's error recovery works.

In the past I considered creating a kernel module that would grab hold
of a given interface and blast traffic through it with as little software
overhead as possible (e.g. sending the same mbuf over and over) in order
to create my own test system that could hopefully rival the Smartbits,
but I never got around to it. I'm not sure that it's really possible
without custom hardware though.

> Prior to the INTR_FAST change, the machine would live-lock.  Now it
> survives, stays responsive, and drops packets as needed.

The wide range of failures people seem to be reporting might mean that
the driver code itself is not the issue, but that there's an interaction
with some other part of the system. This means torture testing the driver
itself might not be enough to provoke the problems.

Unfortunately, nobody seems to have nailed down a good test case for
any of these failures. I strongly suspect people are leaving out details
which seem obvious and/or trivial to them, but which are critical to
finding the problem. ("Oh, I was using SCHED_ULE... was I not supposed
to do that? Tee-hee. *curls finger in blonde hair*)

Another thing that might be handy is improving the watchdog timeout
message so that it dumps the state of the ICR and ICM registers (and
maybe some other interesting driver and/or device state). The timeout
implies no interrupts were delivered for a Long Time (tm). If the
ICM register indicates interrupts have been masked, then that means
em_intr_fast() was triggered by and interrupt and it scheduled work,
but that work never executed. If that really is what happened, then
I can understand the watchdog error occuring. If that's _not_ what
happened, them something else is screwed up.

-Bill

--
=============================================================================
-Bill Paul            (510) 749-2329 | Senior Engineer, Master of Unix-Fu
                 wpaul at windriver.com | Wind River Systems
=============================================================================
              <adamw> you're just BEGGING to face the moose
=============================================================================


More information about the freebsd-stable mailing list