filesystem corruption with 1TB filesystem, 4.9-STABLE, twe

Fri Apr 30 10:52:32 PDT 2004

Is your card plugged into a riser card? We had similar problems (random 
corruption) with a 7506-8 card. The workaround was to set the speed for 
that PCI slot to 33MHz (rather than Auto or 66MHz). I think this tech 
note describes our problem:

http://www.3ware.com/kb/article.aspx?id=10848

(Read the PDF file attached to the tech note.)

Now the box is as solid as a rock.

Matt

Doug White wrote:
> On Sun, 18 Apr 2004, Ollie Cook wrote:
> 
> 
>>I am experiencing filesystem corruption while using a 1TB (appx.) partition
>>under 4.9-STABLE (sources from Mar 17) and an 8-port 3ware ATA RAID card (twe
>>device driver). The RAID set comprises 5x250GB ATA disks.
> 
> 
> [...]
> 
> The type of corruption you're seeing would be consistent with one of the
> disks not accepting writes or some other sort of array corruption. I
> realize it'll take forever, but can you run an array verify?  I wonder if
> the BIOS isn't picking up a disk failure since it isn't throwing errors,
> but isn't doing any useful work either.
> 
> 
> 
>>The kernel logs such messages as:
>>
>>Apr 17 16:25:37 heman /kernel: free inode /clara/170175645 had 137391860 blocks
>>Apr 17 17:18:29 heman /kernel: free inode /clara/169969279 had 1803039330 blocks
>>Apr 17 18:06:38 heman /kernel: free inode /clara/171086221 had 544501359 blocks
>>
>>The operations it was performing at the time involved copying a lot of small
>>(email messages) files from a busy NFS mount to the RAID5 array. A number of
>>processes were all copying different files and the throughput was around 3MB/s
>>to disk.
>>
>>As far as I can tell from sys/ufs/ffs/ffs_alloc.c this error indicates that a
>>kernel data structure contains unexpected data, but I'm not confident enough to
>>be able to tell what might be causing that.
>>
>>After such messages, if I cleanly unmount the filesystem and run fsck, errors
>>are detected. Such errors are:
>>
>>  directory corrupted
>>  directory contains empty blocks
>>  unallocated inode
>>  wrong link counts
>>
>>There are many more distinct error messages, but those are the ones I recall.
>>After a number of passes through fsck, the filesystem is eventually marked
>>clean but quite a number of files wind up in lost+found.
>>
>>Has anyone seen behaviour similar to this with twe RAID sets or large
>>partitions in the past? I've not been able to find reports of similar symptoms
>>using Google.
>>
>>Can anyone offer advice on how I might further debug this problem?
>>
>>Yours,
>>
>>Ollie
>>
>>Apr 16 11:34:12 heman /kernel: twe0: <3ware Storage Controller> port 0xc800-0xc80f mem 0xfe000000-0xfe7fffff,0xfe8ffc00-0xfe8ffc0f irq 10 at device 4.0 on pci3
>>Apr 16 11:34:12 heman /kernel: twe0: 8 ports, Firmware FE7X 1.05.00.065, BIOS BE7X 1.08.00.048
>>Apr 16 11:34:12 heman /kernel: twed0: <Unit 0, JBOD, Normal> on twe0
>>Apr 16 11:34:12 heman /kernel: twed0: 4126MB (8452080 sectors)
>>Apr 16 11:34:12 heman /kernel: twed1: <Unit 1, RAID5, Normal> on twe0
>>Apr 16 11:34:12 heman /kernel: twed1: 953896MB (1953580032 sectors)
>>Apr 16 11:34:12 heman /kernel: twe0: command interrupt
>>
>>
> 
>