RAID and NFS exports (Possible Data Corruption)

Tue Jul 15 13:59:11 PDT 2003

Thanks for the reply.

>> ad4: hard error reading fsbn  242727552
>
> The error means that that the disk said that there was an error
> trying to read this block. You say that when you rebooted that the
> controler said a disk had gone bad, so this would sort of confirm
> this. (I could believe that restarting mountd might upset raid stuff
> if there were a kernel bug, but it seems very unlikely it could
> cause a disk to go bad.)

The full error was something like this on _both_ of the identical 
systems, even _before_ the reboot.  After this message we could not 
read/write/fsck /dev/ar0

ad7: hard error reading fsbn 291786506 of 0-127 (ad7 bn 291786506; cn 
289470 tn 11 sn 53) trying PIO
  mode
ad7: DMA problem fallback to PIO mode
ad7: DMA problem fallback to PIO mode
ad7: DMA problem fallback to PIO mode
ad7: DMA problem fallback to PIO mode
ad7: DMA problem fallback to PIO mode
ad7: hard error reading fsbn 291786586 of 0-127 (ad7 bn 291786586; cn 
289470 tn 13 sn 7) status=59 e
rror=40
ar0: ERROR - array broken

There was also a variety of messages like these:
Jul 14 02:55:39 thorimage1 /kernel: ad7: hard error reading fsbn 
291786586 of 0-127 (ad7 bn 291786586; cn 289470 tn 13 sn 7) status=59 
error=40

where ad7: .... included any of the 6 devices, somewhat randomly, in 
the array.

>
> My best guess would be that you have a bad batch of disks that
> happen to have failed in similar ways. It is possible that restarting
> mountd uncovered the errors, 'cos I think mountd internally does
> a remount of the filesystem in question and that might cause a chunk
> of stuff to be flushed out on to the disk, highlighting an error.
>
> (I had a bunch of the IBM "deathstar" disks fail on me within the
> space of a week or so, after they'd been in use for about six
> months.

That certainly sounds reasonable that this problem had just manifested 
itself by restarting mountd.  It's just strange and too much of a 
coincidence that two sets of six disks on two different but identical 
machines would fail exactly the same way within an hour.  I guess given 
the decline of quality in hard drives things like this might be more 
likely.

Thanks,
Sumit