Adaptec 3210S, 4.9-STABLE, corruption when disk fails

Thu Mar 31 13:47:27 PST 2005

Don Bowman wrote:
> From: Uwe Doering [mailto:gemini at geminix.org] 
>>Don Bowman wrote:
>>
>>>[...]
>>>Another drive failed and the same thing happened.
>>>After the failure, the raid worked in degrade mode just 
>>
>>fine, but many 
>>
>>>files had been corrupted during the failure.
>>>
>>>So I would suggest that this merge did not help, and the 
>>
>>cam timeout 
>>
>>>did not help either.
>>>
>>>This is very frustrating, again I rebuild my postgresql 
>>
>>install from 
>>
>>>backup :(
>>
>>This is indeed unfortunate.  Maybe the problem is in fact 
>>located neither in PostgreSQL nor in FreeBSD but in the 
>>controller itself.  Does it have the latest firmware?  The 
>>necessary files should be available on Adaptec's website, and 
>>you can use the 'raidutil' program under FreeBSD to upload 
>>the firmware to the controller.  I have to concede, however, 
>>that I never did this under FreeBSD myself.  If I recall 
>>correctly I did the upload via a DOS diskette the last time.
>>
>>If this doesn't help either you could ask Adaptec's support for help. 
>>You need to register the controller first, if memory serves.
> 
> The latest firmware & bios is in the controller (upgraded the
> last time I had problems).
> 
> Tried adaptec support, controller is registered.
> 
> The problem is definitely not in postgresql. Files go missing
> in directories that are having new entries added (e.g. I lost
> a 'PG_VERSION' file). Data within the postgresql files becomes
> corrupt. Since the only application running is postgresql,
> and it reads/writes/fsyncs the data, its not unexpected that
> it's the one that reaps the 'rewards' of the failure.
> 
> I have to believe this is either a bug in the controller,
> or a problem in cam or asr.

As far as I understand this family of controllers the OS drivers aren't 
involved at all in case of a disk drive failure.  It's strictly the 
controller's business to deal with it internally.  The OS just sits 
there and waits until the controller is done with the retries and either 
drops into degraded mode or recovers from the disk error.

That's why I initially speculated that there might be a timeout 
somewhere in PostgreSQL or FreeBSD that leads to data loss if the 
controller is busy for too long.

A somewhat radical way to at least make these failures as rare an event 
as possible would be to deliberately fail all remaining old disk drives, 
one after the other of course, in order to get rid of them.  And if you 
are lucky the problem won't happen with newer drives anyway, in case the 
root cause is an incompatibility between the controller and the old drives.

    Uwe
-- 
Uwe Doering         |  EscapeBox - Managed On-Demand UNIX Servers
gemini at geminix.org  |  http://www.escapebox.net