Adaptec 3210S, 4.9-STABLE, corruption when disk fails

Mon Feb 28 21:58:27 GMT 2005

I have a machine running:

$ uname -a
FreeBSD machine.phaedrus.sandvine.com 4.9-STABLE FreeBSD 4.9-STABLE #0:
Fri Mar 19 10:39:07 EST 2004
user at machine.phaedrus.sandvine.com:/usr/src/sys/compile/LABDB  i386

It has an adaptec 3210S raid controller running a single raid-5, and
runs postgresql 7.4.6 as its primary application.

3 times now I have had a drive fail, and have had corrupted files in the
postgresql cluster @ the same time.

The time is too closely correlated to be a coincidence. It passes fsck @
the time that I got to it a couple of hours later, and the filesystem
seems to be ok (with a failed drive, the raid in 'degrade' mode).

It appears that the drive failure and the postgresql failure occur @
exactly the same time (monitoring with nagios, within 1hr accuracy). It
would appear that for some file(s) bad data was returned.

Does anyone have any suggestions?

$ raidutil -L all
RAIDUTIL  Version: 3.04  Date: 9/27/2000  FreeBSD CLI Configuration
Utility
Adaptec ENGINE  Version: 3.04  Date: 9/27/2000  Adaptec FreeBSD SCSI
Engine

#  b0 b1 b2  Controller     Cache  FW    NVRAM     Serial     Status
------------------------------------------------------------------------
---
d0 -- --     ADAP3210S      16MB   370F  ADPT 1.0  BF0A21700J7Optimal

Physical View
Address    Type              Manufacturer/Model         Capacity  Status
------------------------------------------------------------------------
---
d0b0t0d0   Disk Drive (DASD) SEAGATE  ST318453LW        17501MB
Optimal
d0b0t1d0   Disk Drive (DASD) SEAGATE  ST318453LW        17501MB
Optimal
d0b0t2d0   Disk Drive (DASD) IBM      DNES-318350W      17501MB
Optimal
d0b1t3d0   Disk Drive (DASD) IBM      DNES-318350W      17501MB
Optimal
d0b1t4d0   Disk Drive (DASD) SEAGATE  ST318452LW        17501MB
Optimal
d0b1t5d0   Disk Drive (DASD) IBM      DNES-318350W      17501MB
Optimal