em network issues

Scott Long scottl at samsco.org
Fri Oct 20 21:32:49 UTC 2006


Bill Paul wrote:
> 
> Yes, but did you do it with a Smartbits though, or just with a couple of
> other FreeBSD machines? Unfortunately, a typical FreeBSD system on its own
> won't generate frames anywhere near fast enough to really torture test a
> gigE interface. At best you might hit around 200000 to 300000 frames/sec.
> 

Yes, it was some model of a Smartbits.

> A given Smartbits system doesn't need special hardware to run a
> bi-directional forwarding test. If you're using SmartApps, you just
> have to click the "Bi-Directional" checkbox on the main setup window.
> (At least, that's how it is with the ones at work.)

Didn't know the details here.

> 
> Being able to flood the link with the Smartbits is also handy for
> provoking error conditions (RX overruns and TX underruns, mostly), which
> shows you how well (or not) the driver's error recovery works.

Yup, tested that =-)

> 
> In the past I considered creating a kernel module that would grab hold
> of a given interface and blast traffic through it with as little software
> overhead as possible (e.g. sending the same mbuf over and over) in order
> to create my own test system that could hopefully rival the Smartbits,
> but I never got around to it. I'm not sure that it's really possible
> without custom hardware though.
> 

I tried this.  It was too crude.

> 
>>Prior to the INTR_FAST change, the machine would live-lock.  Now it
>>survives, stays responsive, and drops packets as needed.
> 
> 
> The wide range of failures people seem to be reporting might mean that
> the driver code itself is not the issue, but that there's an interaction
> with some other part of the system. This means torture testing the driver
> itself might not be enough to provoke the problems.
> 

It's indeed a complex problem, but I haven't ruled out the driver.
Shifting timing around in innocent ways seems to be the key.

> Unfortunately, nobody seems to have nailed down a good test case for
> any of these failures. I strongly suspect people are leaving out details
> which seem obvious and/or trivial to them, but which are critical to
> finding the problem. ("Oh, I was using SCHED_ULE... was I not supposed
> to do that? Tee-hee. *curls finger in blonde hair*)

The survey that Kris and I sent out specifically asked about ULE, as
well as other 'deceptively obvious' attributes.

> 
> Another thing that might be handy is improving the watchdog timeout
> message so that it dumps the state of the ICR and ICM registers (and
> maybe some other interesting driver and/or device state). The timeout
> implies no interrupts were delivered for a Long Time (tm). If the
> ICM register indicates interrupts have been masked, then that means
> em_intr_fast() was triggered by and interrupt and it scheduled work,
> but that work never executed. If that really is what happened, then
> I can understand the watchdog error occuring. If that's _not_ what
> happened, them something else is screwed up.

Yes, instrumenting em_watchdog is on my TODO list, and will hopefully
reveal a lot more information here.

Scott



More information about the freebsd-stable mailing list