nfs-server silent data corruption

Arno J. Klaassen arno at heho.snv.jussieu.fr
Mon Apr 21 21:57:39 UTC 2008


Hello,

Mike Tancsa <mike at sentex.net> writes:

> At 10:52 AM 4/21/2008, Arno J. Klaassen wrote:
> 
> >Device is :
> >
> > nfe0 at pci0:0:10:0:       class=0x068000 card=0x289510f1
> > chip=0x005710de rev=0xa3 hdr=0x00
> >     vendor     = 'Nvidia Corp'
> >     device     = 'nForce4 Ultra NVidia Network Bus Enumerator'
> >     class      = bridge
> >     cap 01[44] = powerspec 2  supports D0 D1 D2 D3  current D0
> >
> >(this is with the default BIOS setting " LAN Bridge Enabled", disabling
> >  that setting makes pciconf say "class = network" but does not influence
> >  my problem)
> >
> >I will restart my tests now by populating all 4G to only CPU1 and
> >say whether that matters.
> 
> Hi,
> How long does it take for the problem to show up ?


Less than an hour in general (running the same client script
simultanuously on a 100Mbps linux box and 1Gbps bds6-x86)

> I have what appears
> to be a very similar Tyan board (I have an Socket 939 X2 cpu) with the
> same NIC, but this one is running RELENG_7 from April 17th.  There
> have been a few fixes for the nfe driver since 7.0
> 
> I am running this small script below on a nfs client (em nic) against
> the server (nfe) ( mount options on the client 192.168.245.1:/backup
> /backup nfs rw,-r=32768,-w=32768,tcp,noauto )
> 
> #!/bin/sh
> i=0
> while true
> do
>   i=`expr $i + 1`
>   dd if=/dev/urandom of=/tmp/junk.txt bs=1024 count=81920  > /dev/null 2>&1
>   cp -p /tmp/junk.txt /backup/
>   orig=`md5 -q /tmp/junk.txt`
>   umount /backup
>   sleep 2
>   mount /backup
>   copy=`md5 -q /backup/junk.txt`
>   echo "$orig and $copy on $i"
>   if [ $orig != $copy ]; then
>          echo "\a copy not ok on $i"
>          exit 255
>   fi
> done


quite the same as what I do (apart from the umount/sleep/mount and I 
use same partition for write and copy) :

SIZE=$1

COUNTER=${2:-20}

until [  $COUNTER -lt 1 ]; do
    echo "**** Still $COUNTER iterations to go *** "
    echo
    echo -n Creating random file of $SIZE MBytes ...
    dd if=/dev/random of=BIG bs=1048576 count=${SIZE} > /dev/null 2>&1
    echo Done
    echo -n Calculating md5 checksum ...
    CS1=`md5 -q BIG`
    echo Done
    echo -n Copying file ...
    cp -fp BIG BIG2
    echo Done
    echo -n Calculating md5 checksum ...
    CS2=`md5 -q BIG2`
    echo Done
    if [ ${CS1} != ${CS2} ]; then
     echo CHECKSUM MISMATCH
     exit -1
    else
     echo
    fi
    let COUNTER-=1
done


for info, I test with args '38 999' (38M, try 999 times) on linux
(slightly adapted script BTW) and '138 999' on bsd. The best 'score' I
got was 'still 871 iterations to go'

> On the server, I have
> 
> nfe0 at pci0:0:10:0:       class=0x068000 card=0x286510f1 chip=0x005710de
> rev=0xa3 hdr=0x00
>      vendor     = 'Nvidia Corp'
>      device     = 'nForce4 Ultra NVidia Network Bus Enumerator'
>      class      = bridge
>      cap 01[44] = powerspec 2  supports D0 D1 D2 D3  current D0


idem

> # ifconfig nfe0
> nfe0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
>          options=10b<RXCSUM,TXCSUM,VLAN_MTU,TSO4>
>          ether 00:e0:81:58:91:6a
>          inet 192.168.245.1 netmask 0xffffff00 broadcast 192.168.245.255
>          media: Ethernet autoselect (1000baseTX <full-duplex,flag0,flag1>)
>          status: active

idem
 
> How long does it take for the problem to come up ?

as said : approximately half an hour; never more than 4 hours


Best, Arno


More information about the freebsd-stable mailing list