NIC card problems....

Sun Jan 23 15:27:42 PST 2005

On 2005-01-23 05:57 -0800, Net Virtual Mailing Lists <mailinglists at net-virtual.com> wrote:
> My latest problem is with a:
> 
> dc0: <ADMtek AN985 10/100BaseTX> port 0xe800-0xe8ff mem 0xe6080000-
> 0xe60803ff irq 11 at device 10.0 on pci0
> dc0: Ethernet address: 00:0c:41:ed:ae:85
> 
> ...  after several hours of *HEAVY* (I'm probably understating this)
> utilization I get:
> 
> dc0: TX underrun -- increasing TX threshold
> dc0: TX underrun -- increasing TX threshold
> .. repeats numerous times..
> dc0: TX underrun -- using store and forward mode

Well, that's nothing to worry too much about ...

The device has a data FIFO that is filled with data words fetched
by its bus-master DMA engine via the PCI bus. As an optimization,
sending the Ethernet frame may optionally start, before all data
for the frame has been put into the FIFO. Normally, data is fetched
faster by DMA then sent by Ethernet, but if there are a significant
number of simultanous PCI data transfers by other devices, there is
the risk that the FIFO runs out of data (buffer underflow). In such
a case, the current Ethernet frame can't be finished (dummy data and
a bogus CRC are added to have the receiving party discard the frame).

If such a sittuation exists, the driver will increase the amount of
data required in the FIFO, before transmission of the next Ethernet
frame starts. After multiple underruns, the driver will configure
the Ethernet chip to buffer the full contents of each frame (store
and forward mode) and will avoid the early transmission start, since
the hardware apparently is not capable of providing data fast enough.

> .. at this point the system simply reboots.  I have attempted to apply a

Then there definitely is a bug somewhere, but not neccessarily in the
dc driver. Instead of switching Ethernet cards, you may want to check
the PCI performance of your mainboard. The PCI latency timers (master
latency timer and individual timers in each device) may play a role.

They decide about the maximum "time slice" assigned to each bus-master
in turn. Too small a value causes the PCI bus throughput to suffer
(because of a few PCI bus clocks are lost each time the next bus-master
takes over), while too high a value may cause a device to starve waiting
to get access to the bus granted.

A master latency timer value of 32 (0x20) should keep the bus-master
switch overhead down to 20% (i.e. 80% left for data transfers) and
should keep the latency in the range of 1 microsecond per bus-master
(i.e. 5 microseconds if there are 2 Ethernet cards, 2 disk controllers
and one host bridge active at the same time). In that case, each PCI
device could expect to transfer 100 bytes every 5 microseconds. A
buffer of 128 bytes ought to suffice for a fast Ethernet card, in
that case.

But this is a simplified view and calculation. Devices may keep the
PCI bus longer than the granted time slice, if there are (what used
to be called) "wait cycles" inserted by slow devices. (And some sound
cards have been reported to cause high PCI load, even when idle.)

The TX threshold messages issued by the dc driver appear more as an
indication that the PCI bus is under severe load, than as a hint that
the dc driver is causing the reboots, IMHO.

Regards, STefan