NFE adapter 'hangs'

Fri Oct 15 12:59:59 UTC 2010

On 4 Sep 2010, at 01:53, Pyun YongHyeon wrote:

> On Fri, Sep 03, 2010 at 07:59:26AM +0100, Melissa Jenkins wrote:
>> 
>> Thank you for your very quick response :)
>> 
> 
> [...]
> 
>>> Also I'd like to know whether both RX and TX are dead or only one
>>> RX/TX path is hung. Can you see incoming traffic with tcpdump when
>>> you think the controller is in stuck?
>> 
>> Yes, though not very much. The traffic to 4800 is every second so you can see in the following trace when it stops
>> 
>> 07:10:42.287163 IP 192.168.1.203 > 224.0.0.240:  pfsync 108
>> 07:10:42.911995
>> 07:10:43.112073 STP 802.1d, Config, Flags [Topology change], bridge-id 8000.c4:7d:4f:a9:ac:30.8008, length 43
>> 07:10:43.148659 IP 192.168.1.203.57026 > 192.168.1.255.4800: UDP, length 60
>> 07:10:43.148684 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP 192.168.1.203.57026 > 192.168.1.129.4800: UDP, length 60
>> 07:10:43.148689 IP 172.31.1.203 > 172.31.1.129: GREv0, length 92: IP 192.168.1.203.57026 > 192.168.1.1.4800: UDP, length 60
>> 07:10:43.148918 IP 192.168.1.213.40677 > 192.168.1.255.4800: UDP, length 48
> 
> [...]
> 
>> a bit later on, still broken, a slight odd message:
>> 07:11:43.079720 IP 172.31.1.129 > 172.31.1.213: GREv0, length 52: IP 192.168.1.129.60446 > 192.168.1.213.179:  tcp 12 [bad hdr length 16 - too short, < 20]
>> 07:11:44.210794 IP 172.31.1.129 > 172.31.1.203: GREv0, length 84: IP 192.168.1.129.64744 > 192.168.1.203.4800: UDP, length 52
>> 07:11:44.210831 IP 172.31.1.129 > 172.31.1.213: GREv0, length 84: IP 192.168.1.129.64744 > 192.168.1.213.4800: UDP, length 52
>> 
>> Now this really is odd, I don't recognise either of those MAC addresses, though the SQL shown is used on this machine (
>> 07:12:13.054393 45:43:54:20:41:63 > 00:00:03:53:45:4c, ethertype Unknown (0x6374), length 60:
>>        0x0000:  556e 6971 7565 4964 2046 524f 4d20 7261  UniqueId.FROM.ra
>>        0x0010:  6461 6363 7420 2057 4845 5245 2043 616c  dacct..WHERE.Cal
>>        0x0020:  6c69 6e67 5374 6174 696f 6e49 6420       lingStationId.
> 
> Hmm, it seems you're using really complex setup. It's very hard to
> narrow down guilty ones under these environments. Could you setup
> simple network configuration that reproduces the issue? One of
> possible cause would be wrong(garbled) data might be passed up to
> upper stack. But I have no idea why you see GRE packets with
> truncated TCP header(172.31.1.129 > 172.31.1.213).
> How about disabling TX/RX checksum offloading as well as TSO?
> 
> [...]
> 
>> 
>> I then restarted the interface (nfe down/up, route restart)
>> 
>> From dmesg at the time (slight obfuscated)
>> Sep  3 07:10:19 manch2 bgpd[89612]: neighbor XX: received notification: HoldTimer expired, unknown subcode 0
>> Sep  3 07:10:49 manch2 bgpd[89612]: neighbor XX connect: Host is down
>> # at this point I took the interface down & up and reloaded the routing tables
>> Sep  3 07:12:07 manch2 kernel: carp0: link state changed to DOWN
>> Sep  3 07:12:07 manch2 kernel: carp0: link state changed to DOWN
>> Sep  3 07:12:07 manch2 kernel: nfe0: link state changed to DOWN
>> Sep  3 07:12:07 manch2 kernel: carp0: link state changed to DOWN
>> Sep  3 07:12:11 manch2 kernel: nfe0: link state changed to UP   
>> Sep  3 07:12:11 manch2 kernel: carp0: link state changed to DOWN
>> Sep  3 07:12:14 manch2 kernel: carp0: link state changed to UP
> 
> Hmm, it does not look right, carp0 showed link DOWN message four
> times in a row.
> By the way, are you using IPMI on MCP55? nfe(4) is not ready to
> handle MAC operation with IPMI.

Turning off tx & rc checksum offloading seems to have resolved the problem:

ifconfig nfe0 -txcsum -rxcsum

Seems to have stopped both the corruption and the interface hanging.  I ran it for about 16 hours on the FreeBSD 8 box.  It also appears to have fixed the problem on my FreeBSD 7 machine as well.  

I didn't try turning off TSO.

Thank you for your suggestion & help!
Mel