em driver input errors

Wed Aug 19 12:52:27 UTC 2009

Greetings.

--- On Mon, 8/17/09, Дмитрий Замураев <gigabyte.tmn at gmail.com> wrote:

> From: Дмитрий Замураев <gigabyte.tmn at gmail.com>
> Subject: RE: em driver input errors
> To: alexpalias-bsdnet at yahoo.com
> Cc: freebsd-net at freebsd.org
> Date: Monday, August 17, 2009, 6:17 PM
> 
>  
> >/boot/loader.conf:
> >hw.em.rxd=4096
> >hw.em.txd=4096
> why you are using this
> values? try default (without 
> this lines in loader.conf)

As said in my original email, I was getting way more errors with the defaults.

> > Witout the above we
> were seeing way more 
> errors, now they are reduced, but still come in bursts of
> over 1000 errors on 
> em0.
> >Still seeing errros,
> after some searching the 
> mailing lists we also added:
> ># the four lines below
> are repeated for em1, 
> em2, 
> em3
> >dev.em.0.rx_int_delay=0
> >dev.em.0.rx_abs_int_delay=0
> >dev.em.0.tx_int_delay=0
> >dev.em.0.tx_abs_int_delay=0
> try to increase
> rx_int_delay to 600 and 
> rx_abs_int_delay to 1000, tx_*_delay without changes ->
> by default 
> (100?)

Thanks for the suggestion.
From a "clean" box:
dev.em.0.rx_int_delay: 0
dev.em.0.tx_int_delay: 66
dev.em.0.rx_abs_int_delay: 66
dev.em.0.tx_abs_int_delay: 66

I reset all the values (errors still appearing), then tried your suggestion (rx_int_delay=600, rx_abs_int_delay=1000).  This has reduced the number of interrupts for em0 (from about 7200/sec to around 6500/sec).  After some time, I started getting errors again.  But that has made me try this also:

dev.em.0.tx_int_delay=600
dev.em.0.tx_abs_int_delay=1000

Meaning using your suggested values for tx too.  Now em0 is seeing about 1800 interrupts/second, which is way better, but after some time I saw errors again...

From the output of "netstat -nI em0 -w 5":

            input          (em0)           output
   packets  errs      bytes    packets  errs      bytes colls
     87267     0   50372599     106931     0   81598993     0
     86496     0   50990332     105467     0   80064657     0
     81726  3056   49876613      99080     0   73273640     0
     90425     0   59172531     105299     0   77110096     0
    120292     0   70369292     109597     0   78626248     0
... a few minutes pass with zero errors ...
     89646     0   56951878     111240     0   86493393     0
     86031     0   53549721     108695     0   83592747     0
     77760  3054   48505562      96912     0   73185576     0
     87508     0   56116394     106094     0   79130608     0
     89031     0   56490982     103039     0   77398567     0

What's interesting is that I'm seeing errors in a 80k packets/5 sec (so around 16k packets/s) zone, but no errors at 120k packets/5sec (24kpps).

Currently, I've set the delay to 600 and abs_delay to 1000 on all interfaces (em0, em1, em2, em3), thus reducing the number of interrupts.
I'm currently seeing (in systat -vmstat 2):
Around 1800 irqs/s for em0, 1800 for em1, 1800 for em2, under 10/s for em3
Around 2000 irqs/s for cpu0:time, 2000 more for cpu1:time, 2000 for cpu2:time and 2000 for cpu3:time.

Interrupts total (as reported by systat):  around 13500/second.  I would estimate the old IRQ load at around 30000-35000/second, which doesn't seem too much to me, for a dual xeon machine.

> >kern.ipc.nmbclusters=655360
> no need. see netstat
> -m

Thanks, but as I said, I did try almost *EVERYTHING* I could without rebooting.  Including this.

Speaking of which, I did compile the kernel with "options DEVICE_POLLING", but enabling polling only made the errors appear more often, and in greater numbers.

> P.S. change copper cable,
> turn off the flow-control 
> (if is on) 

There are 4 em interfaces on this machine, with new cat6 cables.  2 more em interfaces on another machine that was seeing the same errors (the old router), on different cables.  And 2 more em interfaces on another machine that's in production, also with new cables.  The input errors (as debugged by sysctl dev.em.0.stats=1 -> read dmesg) are only 2 because of CRC errors, as opposed to around 2.500.000 from other causes.  I tend to feel the cable isn't the problem.

Flow control is off, I just checked.  I forgot about that one, thanks for reminding me.

Thank you for your help
Alex