drive failure during rebuild causes page fault

Mon Dec 13 11:21:53 PST 2004

> > On Sun, Dec 12, 2004 at 09:59:16PM -0800, Doug White wrote:
> > > Thats a nice shotgun you have there.

> On Sun, 12 Dec 2004, Joe Rhett wrote:
> > Yessir.  And that's what testing is designed to uncover.  The question is
> > why this works, and how do we prevent it?

On Mon, Dec 13, 2004 at 10:28:53AM -0800, Doug White wrote:
> I'm sure Soren appreciates you donating your feet to the cause :)

That's what sandbox feet are for ;-)

> Why it works: the system assumes the administrator is competent enough to
> not yank a disk that is being rebuilt to.

Yes, I and most others are.  But that's a bad assumption. The issue is
fairly simple --  what occurs if the disk goes offline for a hardware 
failure?  For example, that SATA interface starts having problems.  We 
replace the drive, assuming it is the drive.  The rebuild starts, and the 
interface dies again.  Bam! There goes the system.  Not good.

Or, perhaps it's a DOA drive and it fails during the rebuild?

> > Is there a proper way to handle these sort of events?  If so, where is it
> > documented?
> >
> > And fyi just pulling the drives causes the same failure so that means that
> > RAID1 buys you nothing because your system will also crash.
> 
> This is why I don't trust ATA RAID for fault tolerance -- it'll save your
> data, but the system will tank.  Since the disk state is maintained by
> the OS and not abstracted by a separate processor, if a disk dies in a
> particularly bad way the system may not be able to cope.

Yes, but SATA isn't limited by this problem.  It does have a processor per
disk. (this is all SATA, if I didn't make that clear)

-- 
Joe Rhett
Senior Geek
Meer.net