RAID and NFS exports (Possible Data Corruption)
Sumit Shah
shah at ucla.edu
Tue Jul 15 13:59:11 PDT 2003
Thanks for the reply.
>> ad4: hard error reading fsbn 242727552
>
> The error means that that the disk said that there was an error
> trying to read this block. You say that when you rebooted that the
> controler said a disk had gone bad, so this would sort of confirm
> this. (I could believe that restarting mountd might upset raid stuff
> if there were a kernel bug, but it seems very unlikely it could
> cause a disk to go bad.)
The full error was something like this on _both_ of the identical
systems, even _before_ the reboot. After this message we could not
read/write/fsck /dev/ar0
ad7: hard error reading fsbn 291786506 of 0-127 (ad7 bn 291786506; cn
289470 tn 11 sn 53) trying PIO
mode
ad7: DMA problem fallback to PIO mode
ad7: DMA problem fallback to PIO mode
ad7: DMA problem fallback to PIO mode
ad7: DMA problem fallback to PIO mode
ad7: DMA problem fallback to PIO mode
ad7: hard error reading fsbn 291786586 of 0-127 (ad7 bn 291786586; cn
289470 tn 13 sn 7) status=59 e
rror=40
ar0: ERROR - array broken
There was also a variety of messages like these:
Jul 14 02:55:39 thorimage1 /kernel: ad7: hard error reading fsbn
291786586 of 0-127 (ad7 bn 291786586; cn 289470 tn 13 sn 7) status=59
error=40
where ad7: .... included any of the 6 devices, somewhat randomly, in
the array.
>
> My best guess would be that you have a bad batch of disks that
> happen to have failed in similar ways. It is possible that restarting
> mountd uncovered the errors, 'cos I think mountd internally does
> a remount of the filesystem in question and that might cause a chunk
> of stuff to be flushed out on to the disk, highlighting an error.
>
> (I had a bunch of the IBM "deathstar" disks fail on me within the
> space of a week or so, after they'd been in use for about six
> months.
That certainly sounds reasonable that this problem had just manifested
itself by restarting mountd. It's just strange and too much of a
coincidence that two sets of six disks on two different but identical
machines would fail exactly the same way within an hour. I guess given
the decline of quality in hard drives things like this might be more
likely.
Thanks,
Sumit
More information about the freebsd-hackers
mailing list