kern/50201: [twe] 3ware RAID 5 resulting in data corruption

Sat Nov 26 16:00:20 GMT 2005

The following reply was made to PR kern/50201; it has been noted by GNATS.

From: Jan Srzednicki <w at wrzask.pl>
To: bug-followup at freebsd.org, bruce at engmail.uwaterloo.ca, dpk at dpk.net
Cc:  
Subject: Re: kern/50201: [twe] 3ware RAID 5 resulting in data corruption
Date: Sat, 26 Nov 2005 16:58:36 +0100

 I'm experiencing a similar problem, though with a few notable
 differences.

 First of all, I'm running FreeBSD 5.4-RELEASE (with RELENG_5_4 fixes) on
 my machine. Here's a brief output from my dmesg related to the 3ware
 controller:

 [16:32] hostname:~ # dmesg | grep twe
 twe0: <3ware Storage Controller. Driver version 1.50.01.002> port 0xcc00-0xcc0f mem 0xfe000000-0xfe7fffff irq 21 at device 0.0 on pci2
 twe0: 8 ports, Firmware FE7X 1.05.00.065, BIOS BE7X 1.08.00.048
 twed0: <Unit 0, RAID5, Normal> on twe0
 twed0: 1192370MB (2441975040 sectors)

 The controller is a 7000-class 8-way RAID controller with PATA
 interfaces.

 I'm experiencing repeatable data corruption, but it's was far more
 difficult to pin it down. I'm using the array for backups, which I'm
 doing via ssh over the network (100Mbit ethernet) in the following way:

 dump | gzip | md5checker | network(ssh) | md5checker | split twe0/files

 md5checker is my small utility to calculate md5 sums of each 1MB chunk
 of data piped through it. It assured me that data corruption does not
 occur on the network, as MD5 sums on each sides match each other. The
 total size of backuped data after gzipping sums to about 43GB.

 The strange thing was that performing _the same_ backup in the following
 way:

 dump | gzip > file
 cat file | md5checker | network(ssh) | md5checker | split twe0/files

 .. did not produce any errors (I repeated both "ways" several times, to
 make sure). Well, it appears that the data corruption is somehow related
 to the speed of the data transmition, as dump output is quite irregular
 and becomes rather slow when it hits a bunch of small files. The whole
 dump process takes about 6 hours. 

 I tried dumping the data into an IDE disk on the machine with the
 controller, which resulted in no errors. I also tried turning off
 softupdates on the filesystem on the 3ware array, with no effect. It
 clearly appears the data corruption is somehow related to the 3ware
 controller.

 After some investigation, I've discovered the following facts:
  - data is corrupted in exact 128kB chunks; the whole 128kB is bad and
    appears to be random (that is, I could not find any similar chunk in
    other files on the partition).
  - errors are pretty rare; in the whole 43GB stream I'm getting about 3
    or 4 errors.
  - I'm not able to repeat data corruption locally. Things like:

 	cat /dev/(zero|urandom) | md5checker | split array/files

    .. did not produce _any_ errors, after piping about a terabyte of
    data.

 It also appears that turning off write-cache on the controller fixed the
 problem, but writes are very slow now.

 I don't have another 3ware controller, so I cannot check if it isn't a
 hardware issue within it.

 I'm of course willing to provide any feedback needed on that issue, but
 because of the duration of the process testing stuff is rather slow.