kern/50201: [twe] 3ware RAID 5 resulting in data corruption
Jan Srzednicki
w at wrzask.pl
Sat Nov 26 16:00:20 GMT 2005
The following reply was made to PR kern/50201; it has been noted by GNATS.
From: Jan Srzednicki <w at wrzask.pl>
To: bug-followup at freebsd.org, bruce at engmail.uwaterloo.ca, dpk at dpk.net
Cc:
Subject: Re: kern/50201: [twe] 3ware RAID 5 resulting in data corruption
Date: Sat, 26 Nov 2005 16:58:36 +0100
I'm experiencing a similar problem, though with a few notable
differences.
First of all, I'm running FreeBSD 5.4-RELEASE (with RELENG_5_4 fixes) on
my machine. Here's a brief output from my dmesg related to the 3ware
controller:
[16:32] hostname:~ # dmesg | grep twe
twe0: <3ware Storage Controller. Driver version 1.50.01.002> port 0xcc00-0xcc0f mem 0xfe000000-0xfe7fffff irq 21 at device 0.0 on pci2
twe0: 8 ports, Firmware FE7X 1.05.00.065, BIOS BE7X 1.08.00.048
twed0: <Unit 0, RAID5, Normal> on twe0
twed0: 1192370MB (2441975040 sectors)
The controller is a 7000-class 8-way RAID controller with PATA
interfaces.
I'm experiencing repeatable data corruption, but it's was far more
difficult to pin it down. I'm using the array for backups, which I'm
doing via ssh over the network (100Mbit ethernet) in the following way:
dump | gzip | md5checker | network(ssh) | md5checker | split twe0/files
md5checker is my small utility to calculate md5 sums of each 1MB chunk
of data piped through it. It assured me that data corruption does not
occur on the network, as MD5 sums on each sides match each other. The
total size of backuped data after gzipping sums to about 43GB.
The strange thing was that performing _the same_ backup in the following
way:
dump | gzip > file
cat file | md5checker | network(ssh) | md5checker | split twe0/files
.. did not produce any errors (I repeated both "ways" several times, to
make sure). Well, it appears that the data corruption is somehow related
to the speed of the data transmition, as dump output is quite irregular
and becomes rather slow when it hits a bunch of small files. The whole
dump process takes about 6 hours.
I tried dumping the data into an IDE disk on the machine with the
controller, which resulted in no errors. I also tried turning off
softupdates on the filesystem on the 3ware array, with no effect. It
clearly appears the data corruption is somehow related to the 3ware
controller.
After some investigation, I've discovered the following facts:
- data is corrupted in exact 128kB chunks; the whole 128kB is bad and
appears to be random (that is, I could not find any similar chunk in
other files on the partition).
- errors are pretty rare; in the whole 43GB stream I'm getting about 3
or 4 errors.
- I'm not able to repeat data corruption locally. Things like:
cat /dev/(zero|urandom) | md5checker | split array/files
.. did not produce _any_ errors, after piping about a terabyte of
data.
It also appears that turning off write-cache on the controller fixed the
problem, but writes are very slow now.
I don't have another 3ware controller, so I cannot check if it isn't a
hardware issue within it.
I'm of course willing to provide any feedback needed on that issue, but
because of the duration of the process testing stuff is rather slow.
More information about the freebsd-bugs
mailing list