Troube with SSD

Wed Feb 1 16:35:58 UTC 2012

On Wed, Feb 01, 2012 at 04:56:23PM +0100, Willem Jan Withagen wrote:
> On 2012-02-01 15:39, Jeremy Chadwick wrote:
> > On Wed, Feb 01, 2012 at 02:40:17PM +0100, Willem Jan Withagen wrote:
> >> The device is a Corsair 60Gb Force GT. And thusfar I have not found any
> >> suggestions that that serie of devices is prone to doing this.
> > 
> > Can you please provide the following output when that SSD is attached
> > to the system?  You will need to install ports/sysutils/smartmontools
> > for this (please make sure it's version 5.42 or newer).
> > 
> > * smartctl -a /dev/whatever
> > * smartctl -l devstat /dev/whatever
> > * smartctl -l sataphy /dev/whatever
> > * smartctl -l ssd /dev/whatever
> 
> Eh, the last 3 look like they are not supported on the 3ware controller:
> ATA_READ_LOG_EXT (addr=0x00:0x00, page=0, n=1) failed: 48-bit ATA
> commands not supported
> Read GP Log Directory failed.

This indicates either a) the driver on FreeBSD, b) the controller
itself, or c) the controller firmware does not permit these specific
kinds of SMART sub-commands to be passed to the underlying device.

> So I'll have to put it back in my real fileserver.

Yes, please do.  In fact, I wish you would not have moved the disk to
another machine at all.  I wish people would not do this during/around
the time they ask for help; wait until someone clueful has exhausted
existing analysis before doing that.  Doing so adds great complexity to
the situation, because then I have to start asking questions like "did
you power off the machine before moving the drive?"  "Did you use the
same SATA cables?"  "Is the controller on the other machine identical?"
You get the idea.

It is extremely taxing for me to track all of these things, because 99%
of people do not write down/track what it is they do when they start
moving hardware around/etc..  I'm not necessarily lecturing you, I'm
more or less ranting -- I go through this situation two or three times a
week with people I help online, and it wastes a lot of time.  I have
another individual in a private Email who asked me for help with 2 disks
(one SSD, one HD), and kept moving the drives around between 3 different
machines, giving me random output from each one (behaviour differed per
box).  I cannot deal with this kind of situation.

> The output of the first one command, but it contains some real weird
> values.....??

All the values below look fine to me.  I will try my best to explain.

> === START OF INFORMATION SECTION ===
> Device Model:     Corsair Force GT
> Serial Number:    11296503000005870891
> LU WWN Device Id: 0 000000 000000000
> Firmware Version: 1.2
> User Capacity:    60,022,480,896 bytes [60.0 GB]
> Sector Size:      512 bytes logical/physical
> Device is:        Not in smartctl database [for details use: -P showall]

First thing to note is the last line here.  smartmontools does not
appear to have knowledge of all the quirks/SMART attribute data for this
model of Corsair SSD.  So, some data may be inaccurate, and it does the
best it can.

Reformatting output to not force newlines/wrapping:

> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x000f   082   082   050    Pre-fail  Always       -       897651373777
>   5 Reallocated_Sector_Ct   0x0033   100   100   003    Pre-fail  Always       -       0
>   9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       121242631799621
>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       31
> 171 Unknown_Attribute       0x0032   000   000   000    Old_age   Always       -       0
> 172 Unknown_Attribute       0x0032   000   000   000    Old_age   Always       -       0
> 174 Unknown_Attribute       0x0030   000   000   000    Old_age   Offline      -       19
> 177 Wear_Leveling_Count     0x0000   000   000   000    Old_age   Offline      -       0
> 181 Program_Fail_Cnt_Total  0x0032   000   000   000    Old_age   Always       -       0
> 182 Erase_Fail_Count_Total  0x0032   000   000   000    Old_age   Always       -       0
> 187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
> 194 Temperature_Celsius     0x0022   026   035   000    Old_age   Always       -       26 (Min/Max 21/35)
> 195 Hardware_ECC_Recovered  0x001c   120   120   000    Old_age   Offline      -       897651373777
> 196 Reallocated_Event_Count 0x0033   100   100   003    Pre-fail  Always       -       0
> 201 Soft_Read_Error_Rate    0x001c   120   120   000    Old_age   Offline      -       897651373777
> 204 Soft_ECC_Correction     0x001c   120   120   000    Old_age   Offline      -       897651373777
> 230 Head_Amplitude          0x0013   100   100   000    Pre-fail  Always       -       429496729700
> 231 Temperature_Celsius     0x0013   100   100   010    Pre-fail  Always       -       0
> 233 Media_Wearout_Indicator 0x0000   000   000   000    Old_age   Offline      -       1260
> 234 Unknown_Attribute       0x0032   000   000   000    Old_age   Always       -       1925
> 241 Total_LBAs_Written      0x0032   000   000   000    Old_age   Always       -       1925
> 242 Total_LBAs_Read         0x0032   000   000   000    Old_age   Always       -       1032

These values all look acceptable/excellent as best I can tell.  The only
attribute above that interests me is attribute 174.  smartmontools
doesn't know what this is, but I am curious to know what value "19"
(which to me appears to be a counter or gauge) actually represents.
Also, just for note: I think it's cool that Corsair put a thermistor or
DTS inside of their drive for temperature readings.  Wise of them!

What you probably meant by "real weird values" are the extremely high
numbers in the RAW_VALUE column.  This is a sign of an individual who
lacks familiarity with SMART and does not know how to properly interpret
attributes.  :-)

I will make it crystal clear (since this is a mailing list and I'm sure
someone will read this in the future): you cannot look at RAW_VALUE and
assume it is a raw integer/counter or gauge.

SMART attributes and their associated 6-byte data values are not defined
per ATA standard.  Thus, each vendor can implement them or store the
data in the RAW_VALUE portion in any format they wish.

Common vendors who do this are Seagate and Hitachi, and apparently
Corsair.  The behaviour varies from vendor to vendor, drive model to
drive model, and firmware to firmware.

Vendor-encoded values often appear very large or "look scary" to the
uneducated eye.  smartmontools can decode some of these, but the drive
has to be in the smartmontools database (drivedb.h), **and** the code
has to be written in smartmontools to properly decode the data.

Since the attributes are proprietary, figuring out the format is
virtually impossible without help from the vendor.  Some (most) vendors
choose not to disclose this information.  In the case of some Seagate
drives, the smartmontools folks either got "tips" from someone within
Seagate, or somehow managed to figure out how to decode some (not all)
on their own.

You should probably start digging around on the Corsair forums, or
within any online documentation you can find from Corsair, to see if
they document what their SMART attributes are in their drives.  For
example, Intel documents all of their SMART attributes in an official
PDF.

> SMART Error Log not supported

Well that's disappointing.  That means that any kind of LBA (read/write)
error inside of the drive will not be logged within the drive itself.
Thus, the only kind of I/O errors or anomalies you'll be able to verify
are purely OS-level.  Oh well, there isn't anything anyone can do about
this.

So let's recap the original OS errors you saw in FreeBSD:

> Jan  7 10:04:24 zfs kernel: ahcich3: Timeout on slot 27 port 0
> Jan  7 10:04:24 zfs kernel: ahcich3: is 00000000 cs 20000000 ss 38000000 rs 38000000 tfd c0 serr 00000000 cmd 0004dd17
> Jan  7 10:04:56 zfs kernel: ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
> Jan  7 10:05:26 zfs kernel: ahcich3: Timeout on slot 29 port 0
> Jan  7 10:05:26 zfs kernel: ahcich3: is 00000000 cs 20000000 ss 00000000 rs 20000000 tfd 80 serr 00000000 cmd 0004dd17
> Jan  7 10:05:57 zfs kernel: ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
> Jan  7 10:06:27 zfs kernel: ahcich3: Timeout on slot 29 port 0
> Jan  7 10:06:27 zfs kernel: ahcich3: is 00000000 cs 20000000 ss 00000000 rs 20000000 tfd 80 serr 00000000 cmd 0004dd17
> Jan  7 10:06:27 zfs kernel: (ada2:ahcich3:0:0:0): lost device
> Jan  7 10:06:58 zfs kernel: ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
> Jan  7 10:07:28 zfs kernel: ahcich3: Timeout on slot 29 port 0
> Jan  7 10:07:28 zfs kernel: ahcich3: is 00000000 cs e0000000 ss e0000000 rs e0000000 tfd 80 serr 00000000 cmd 0004dd17
> Jan  7 10:08:16 zfs kernel: ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
> Jan  7 10:08:16 zfs kernel: ahcich3: Poll timeout on slot 31 port 0
> Jan  7 10:08:16 zfs kernel: ahcich3: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd 80 serr 00000000 cmd 0004df17
> Jan  7 10:08:46 zfs kernel: ahcich3: Timeout on slot 31 port 0
> Jan  7 10:08:46 zfs kernel: ahcich3: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd 80 serr 00000000 cmd 0004df17
> Jan  7 10:08:48 zfs kernel: (ada2:ahcich3:0:0:0): removing device entry
> Jan  7 10:09:33 zfs kernel: ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
> Jan  7 10:09:33 zfs kernel: ahcich3: Poll timeout on slot 31 port 0
> Jan  7 10:09:33 zfs kernel: ahcich3: is 00000000 cs 80000000 ss 00000000 rs 80000000 tfd 80 serr 00000000 cmd 0004df17

What is shown here appears to be the SSD disk simply falling off the
SATA bus.  Do not get confused about the "slots" and how the numbers
there change; that has nothing to do with SATA ports or anything like
that, it's an internal AHCI protocol thing.  (I believe FreeBSD supports
distributing commands across multiple slots or spreading them across
multiple slots for added benefits).

Everything above indicates that after 30 seconds (well, 31 seconds
exactly, but I imagine it's 30 seconds plus 1 extra second due to how
the timeout loop might be written) the drive stopped responding to
commands on the AHCI protocol level.

This could be caused by a multitude of things, and it is very difficult
for me remotely to diagnose any of these:

- Power supply issues (voltage ripple, not enough amps on that port,
  shoddy or loose SATA power connector)
- SATA cable issues (cable too long, possibly some broken copper within
  the cable itself (very unlikely though), etc.)
- SATA port (physical) problems; dust in connectors, etc.
- SSD-level issues.  There are so many possibilities here (more than
  on a MHDD) that it's almost impossible to list them all off:
  -- Internal garbage collection mechanism (this is different than TRIM)
     on drive may be overly aggressive and stalls all I/O to drive
     during heavy GC.  This would be classified as a firmware bug
  -- Power circuitry on PCB may be flaky
  -- Drive may have locked up hard due to other firmware bugs or some
     form of very low-level electrical/electronic error
  -- Internal SSD SATA + NAND flash I/O controller failure

For those considering the remote possibility of interoperability issues
between the Corsair SSD and the AHCI controller -- it's possible, but
highly unlikely.  The controller itself is an Intel ICH9, which FreeBSD has
excellent support for and is very reliable.  So, the controller here is
probably not at fault.  I imagine if there were incompatibilities of
this sort (between ICH9 and Corsair), we'd have heard about it.

I have seen many drives in my time (many means hundreds, no
exaggeration) "lock up" or fall off the bus, both on SCSI and SATA.
It's very difficult to troubleshoot these kinds of issues as I said, and
usually requires someone with extensive knowledge to figure it out.
General "Tier 1" Technical Support from companies do not have this level
of expertise, so don't expect that from Corsair.

I look forward to seeing the output from the below 3 commands, as they
may provide more insights to what actually transpired.  Whether or not
Corsair chose to implement these in the General Purpose Log area of
SMART is unknown, however.  Furthermore, they may actually implement
them, but stick them in a non-common place (e.g. different GPLog
offsets), but PLEASE DO NOT go tinkering around with -l gplog,0xXX
values.

> > * smartctl -l devstat /dev/whatever
> > * smartctl -l sataphy /dev/whatever
> > * smartctl -l ssd /dev/whatever

Thanks.

-- 
| Jeremy Chadwick                                 jdc at parodius.com |
| Parodius Networking                     http://www.parodius.com/ |
| UNIX Systems Administrator                 Mountain View, CA, US |
| Making life hard for others since 1977.             PGP 4BD6C0CB |