Reproducible problems with re(4) on RELENG_7 and HEAD

Sun Nov 11 17:08:43 PST 2007

On Sun, Nov 11, 2007 at 07:07:06PM +0100, Daniel Gerzo wrote:
 > Hello people,
 > 
 >   I would like to report problems which are most probably related to
 >   the re(4) driver. The problem is reproducible after some time (i.e.
 >   after some amount of data has been sent/received) and disappears
 >   again after reboot. You can try to reproduce it by extracting a big
 >   tar archive contianing a thousands of small files with verbose
 >   output (e.g. tar -v) over ssh session. It will reset your connection
 >   after some time with something like:
 > 
 >   Disconnecting: Bad packet length 4070316545.
 > 
 >   After it's been provoked for the first time, it is even more easier
 >   to provoke it again, simply by ssh-ing to the box and running "yes".
 >   The problem will occur in a few seconds, and you will be
 >   disconnected with the error mentioned above, or sometimes with the
 >   following error:
 > 
 >   Disconnecting: Corrupted MAC on input.
 > 
 >   Ok, these were the symptones, now the device in the question:
 > 
 >   I suppose that it is an integrated card (the machine is in
 >   collocation and I've never seen it by myself). This is the
 >   respective line from dmesg:
 >   
 > re0: <RealTek 8168/8111B PCIe Gigabit Ethernet> port 0xd800-0xd8ff mem 0xfdfff000-0xfdffffff irq 19 at device 0.0 on pci2
 > 
 >   pciconf -lv output:
 > 
 > re0 at pci0:2:0:0: class=0x020000 card=0x368c1462 chip=0x816810ec rev=0x01 hdr=0x00
 >     vendor     = 'Realtek Semiconductor'
 >     device     = 'RTL8168/8111 PCI-E Gigabit Ethernet NIC'
 >     class      = network
 >     subclass   = ethernet
 >   
 >   I would swear that this isn't a bad hardware, as the machine is
 >   brand new, and we have 4 of these boxes, all of them are having the
 >   same symptons. I also have a friend, who is experiencing the same
 >   problem for quite some time on HEAD (I am running on recent
 >   RELENG_7).
 > 
 >   I will very willingly provide any additional data, which might be
 >   required, I can also manage a remote ssh access to the machine so it
 >   can be debugged.
 > 
 >   The problem is, that the system itself doesn't hang, there is no
 >   panic and no additional information in /var/log/messages. If there
 >   is any way how can I debug this, please let me know and I will do so
 >   ASAP (as we are migrating our servers to this hardware). Also, I
 >   wasn't able to reproduce it by transferring a 10gb file over ftp,
 >   but when the problem starts to occur, it's not limited to only ssh
 >   connecion. I mean, even mysql connections are being reset.
 > 
 >   Any help will be greatly appreciated! Also, if you are able to
 >   confirm this problem with re(4) etherent card, please let us know!
 > 

I didn't encounter this problem. And I think you're the first one
that reported this issue. Problem description says that data
corruption happened somewhere during large transfers of data.

It seems that you can reproduce it on demand so how about disabling
checksum offload on re(4)? If that fix the issue would you check
number of bad checksums from the "netstat -s" before/after the test?

-- 
Regards,
Pyun YongHyeon