nfs-server silent data corruption

Wed Apr 23 09:52:11 UTC 2008

2008/4/23 Pyun YongHyeon <pyunyh at gmail.com>:
>
> On Wed, Apr 23, 2008 at 12:13:44AM +0400, pluknet wrote:
>   > On 22/04/2008, Mike Tancsa <mike at sentex.net> wrote:
>   > > At 02:00 PM 4/22/2008, Arno J. Klaassen wrote:
>   > >
>   > > > >
>   > > > > Are you using the latest RELENG_7, or at least the latest version of
>   > > > > nfe thats in RELENG_7 ?
>   > > >
>   > > >
>   > > > Think so :
>   > > >
>   > >
>   > >  OK, and it is the latest RELENG_7 ? Or just the if_nfe.c file has been
>   > > manually updated ? Also, you are using ULE or the 4BSD scheduler ?  I still
>   > > have 4BSD on the box I am testing on.
>   >
>   > Hi, I have the same problem with data corruption (with nfe on nfs server side),
>   > particularly when transferring large files.
>   > Maybe this is somehow associated with the topic.
>   >
>   > My simple test case:
>   > truncate -s 1000m bigfile
>   > ^^ here I get zero-filed file
>   > cp bigfile /nfs/mounted
>   > ^^ here I get not-at-all-zero-filed file, after uploading to nfs server
>   >
>   > I looked at the corrupted file. It contains a few ranges, filed with
>   > non-zero bytes:
>   > equal to zero?  real 4-byte value   offset
>   > ======================================
>   > not equal       1200355616     at pos=38797316
>   > ... <-- this range contains per-4bytes garbage, omit
>   > not equal       3879749905     at pos=38813696
>   >
>   > not equal       161160732      at pos=45613060
>   > ... <-- ditto
>   > not equal       575257183      at pos=45629440
>   >
>   > not equal       1943682165     at pos=59768836
>   > ... <-- ditto
>   > not equal       2843639625     at pos=59785216
>   >
>   > not equal       2653910121     at pos=60293124
>   > ... <-- ditto
>   > not equal       3462830780     at pos=60309504
>   >
>   > Some info:
>   >
>   > nfs server on 8-CURRENT as of Apr 17
>   > nfs client on 7.0-STABLE as of Apr 12
>   >
>   > dmesg | grep nfe
>   > nfe0: <NVIDIA nForce2 MCP2 Networking Adapter> port 0xe000-0xe007 mem
>   > 0xe2001000-0xe2001fff irq 20 at device 4.0 on pci0
>   > miibus0: <MII bus> on nfe0
>   > nfe0: Ethernet address: 00:04:61:6c:76:b1
>   > nfe0: [FILTER]
>   > nfe0: tx v1 error 0x6001
>   > nfe0: tx v1 error 0x6001
>   > nfe0: tx v1 error 0x6001
>   > nfe0: tx v1 error 0x6001
>   > nfe0: tx v1 error 0x6001
>   > nfe0: tx v1 error 0x6001
>   > nfe0: tx v1 error 0x6001
>   > nfe0: tx v1 error 0x6001
>   > nfe0: tx v1 error 0x6001
>   > nfe0: tx v1 error 0x6001
>   > nfe0: tx v1 error 0x6001
>   > nfe0: tx v1 error 0x6001
>   > ^^^
>
>  I'm not sure it's related with data corruption issue but 0x6001
>  would mean Tx underflow error. I recall these Tx errors were seen
>  on nfe(4) if negotiated speed/duplex does not match with link
>  partner or MACs.
>  Does link partner also agree on speed/duplex settings of nfe(4)?

One unmanaged 10/100 switch is between them (which are both 100baseTX),
so I cannot say exactly :( Though I can achieve speed upto 100mbps.
I can test later directly on demand.

>  What PHY driver nfe(4) use?
>

$ kldload if_nfe
nfe0: <NVIDIA nForce2 MCP2 Networking Adapter> port 0xe000-0xe007 mem
0xe2001000-0xe2001fff irq 20 at device 4.0 on pci0
nfe0: Ethernet address: 00:04:61:6c:76:b1
nfe0: [FILTER]
miibus0: <MII bus> on nfe0
rlphy0: <RTL8201L 10/100 media interface> PHY 1 on miibus0
rlphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
nfe0: link state changed to DOWN
nfe0: link state changed to UP

So, it seems to be rlphy.

>
>   > This appears while cp'ing file to server.
>   > (btw they do not appear with disabled polling, probably it's an another issue)
>   >
>   > vmstat -i | grep nfe
>   > irq20: nfe0 ohci0                      1          0
>   >
>   > nfe0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
>   >         options=48<VLAN_MTU,POLLING>
>   >         ether 00:04:61:6c:76:b1
>   >         inet 192.168.200.137 netmask 0xffffff00 broadcast 192.168.200.255
>   >         media: Ethernet autoselect (100baseTX <full-duplex>)
>   >         status: active
>   > I can reproduce it regardless polling presence.
>   >
>   > nfe0 at pci0:0:4:0:        class=0x020000 card=0x10001695 chip=0x006610de
>   > rev=0xa1 hdr=0x00
>   >

wbr,
pluknet