A little story of failed raid5 (3ware 8000 series)

Artem Kuchin matrix at itlegion.ru
Tue Aug 21 02:59:10 PDT 2007


>You can run smartmontools on disks behind 3ware controllers, eg
>/dev/twe0 -d 3ware,0 -a -o on -S on -m root at localhost
>/dev/twe0 -d 3ware,1 -a -o on -S on -m root at localhost

did this:

smartctl /dev/twe0 -d 3ware,1 -a

for each driver on another server. Two driver are pretty old, the driver
on port 2 is less than a month old.

However, ALL of the drives have the same values for this

5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0

How come the number are the same? Even more, what does this 100 mean? 100% of  backup sector space
is free or just 100 sectors are available? How many total of them in there.
Why does it say "Pre-fail" if it is WAY above the threshold? This data seems to be
useless.

Now, i did the same for the raid  which failed and got me into so many trobles and has bad
sectors now (some files are unredable):

smartctl /dev/twe0 -d 3ware,0 -A
5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0

smartctl /dev/twe0 -d 3ware,1 -A
5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       39

smartctl /dev/twe0 -d 3ware,2 -A
5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       9

Now this is BS!!! Agaim accroding to SMART i shoud lookup at VALUE (100) and
see if it is below THRES (36). If it is then i am in trouble.  No, it does no work this way.
Now, if we look at raw number we see 39 for disk1 and 9 for disk 2

For 39 disk1 also

198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       22
1 Raw_Read_Error_Rate     0x000f   058   055   006    Pre-fail  Always       -       170185544
195 Hardware_ECC_Recovered  0x001a   058   055   000    Old_age   Always       -       170185544
7 Seek_Error_Rate         0x000f   087   060   030    Pre-fail  Always       -       524461066

Even for the newly inserted ( 24 hours ago, absulutelly new) driver:
7 Seek_Error_Rate         0x000f   069   060   030    Pre-fail  Always       -       8525167
195 Hardware_ECC_Recovered  0x001a   069   066   000    Old_age   Always       -       8433725

Now, as i undertand the main indication  is
"Offline_Uncorrectable" is raw value of it any more than 0 - REPLACE DRIVER ASAP (or 
maybe it is too late and it is "replace driver asap" as soon as Reallocated_Sector_Ct  >0 ?)

Now, what i don't understand is why Hardware_ECC_Recovered   and
Seek_Error_Rate          are so hight. The first one is maybe relate to cabling problem. 
The driver are all in hot swap baskets of supermicro 2u case. Maybe backpanel is no so good?

Seek_Error_Rate is a mistety for me. Any idea?


--
Artem






More information about the freebsd-stable mailing list