smartctl / mpt on 9.0-RC1

Mon Nov 7 11:42:34 UTC 2011

On Mon, Nov 07, 2011 at 03:24:03PM +0400, Marat N.Afanasyev wrote:
> Alex Samorukov wrote:
> >On 11/06/2011 09:37 PM, Alex Samorukov wrote:
> >>>Command failed, ata.status=(0x00), ata.command=(0xec), ata.flags=(0x01)
> >>>WARNING - NO DEVICE FOUND ON 3WARE CONTROLLER (disk 0)
> >>>Smartctl: Device Read Identity Failed (not an ATA/ATAPI device)
> >>>
> >>>A mandatory SMART command failed: exiting. To continue, add one or
> >>>more '-T permissive' options.
> >>>
> >>
> >>
> >>Ok, looking in the code i found that on "3ware" device only
> >>"ata_command_interface" is implemented (with
> >>TW_OSL_IOCTL_FIRMWARE_PASS_THROUGH). The question is if that interface
> >>actually supports SAS drives at all. From the quick view of the
> >>sources i found TWE_Command_ATA packet description, but nothing
> >>related to SCSI/SATA packets. So i am not sure that it is possible at
> >>all. If you know any tool which able to get health information for SAS
> >>drives we can try to debug ioctl it using to find the way to talk with
> >>disk.
> >>
> >One more update - there is TWA_FW_CMD_EXECUTE_SCSI command in the twa
> >driver, so it should be possible to get required data. I have no access
> >to such hardware, but if anyone if going to provide it - i could try at
> >least.
> >
> this is an output on mfi controller with mfip loaded:
> 
> # smartctl -a /dev/pass1
> smartctl 5.41 2011-06-09 r3365 [FreeBSD 8.2-RELEASE amd64] (local build)
> Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
> 
> Vendor:               SEAGATE
> Product:              ST3146356SS
> Revision:             0007
> User Capacity:        146,815,737,856 bytes [146 GB]
> Logical block size:   512 bytes
> Logical Unit id:      0x5000c50028f8a56f
> Serial number:        3QN4PWHS00009130JLKB
> Device type:          <31>
> Transport protocol:   SAS
> Local Time is:        Mon Nov  7 15:20:27 2011 MSK
> Device supports SMART and is Enabled
> Temperature Warning Enabled
> SMART Health Status: OK
> 
> Current Drive Temperature:     26 C
> Drive Trip Temperature:        68 C
> 
> Error counter log:
>            Errors Corrected by           Total   Correction
> Gigabytes    Total
>                ECC          rereads/    errors   algorithm processed
> uncorrected
>            fast | delayed   rewrites  corrected  invocations   [10^9
> bytes]  errors
> read:    9382124        0         0   9382124    9382124
> 3436.782           0
> write:         0        0         0         0          0
> 8978.360           0
> verify:   663433        0         0    663433     663433
> 332.651           0
> 
> Non-medium error count:        7
> 
> [GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
> No self-tests have been logged
> Long (extended) Self Test duration: 1740 seconds [29.0 minutes]
> 
> btw, 3dm can tell about reallocated sector count on sas somehow,
> while smartctl cannot, even on supported controller :(

I think this is getting into a separate discussion topic.

I realise we're discussing SAS, but what's shown above looks pure and
total SCSI output from smartmontools.  I'm very familiar with it (we
predominantly used SCSI disks at my workplace up until ~1 year ago).

SCSI disks only support two kinds of "reallocations": grown defects and
physical defects.  Physical defects are "factory-known bad sectors"
while grown defects are ones learned over time.  Both defect lists are
manageable via SCSI CDBs (meaning you can literally tell the disk "make
LBA N considered a grown defect").  Furthermore, an actual low-level
format will in effect "merge" the grown defect list into the physical
defect list (e.g. prior to format, physical defect list = 225 sectors,
grown = 10; after format, physical = 235, grown = 0).  The defect lists
are also viewable.

smartmontools does support display of both defect counts.  So, either
SAS support in smartmontools lacks code for getting this, SAS (because
it's SAS) does something different, or the controller itself (or
pass(4)) is intercepting the response data.  I simply do not know
because I have no experience with SAS.  I really don't know what people
are expecting, SMART-wise, with SAS.  To me, the above output looks
perfectly normal sans some details and defect counts.

Proof of my statements, re: smartmontools on SCSI disks (taken from a
Solaris 10 system):

# smartctl -a /dev/rdsk/c0t0d0s0
smartctl 5.40 2010-10-16 r3189 [i386-pc-solaris2.10] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Device: FUJITSU  MAW3073NC        Version: 0104
Serial number: DAL0P6802E1Y
Device type: disk
Transport protocol: Parallel SCSI (SPI-4)
Local Time is: Mon Nov  7 03:38:52 2011 PST
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK

Current Drive Temperature:     26 C
Drive Trip Temperature:        65 C
Manufactured in week 31 of year 2006
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  10
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0       5262.342           0
write:         0        0         0         0          0       1704.590           0

Non-medium error count:       39

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Self test in progress ...   -     NOW                 - [-   -    -]
# 2  Background long   Self test in progress ...   -     NOW                 - [-   -    -]

Long (extended) Self Test duration: 1919 seconds [32.0 minutes]

And as I said, display of such defect lists:

# format
Searching for disks...done

AVAILABLE DISK SELECTIONS:
       0. c0t0d0 <DEFAULT cyl 8938 alt 2 hd 255 sec 63>
          /pci at 0,0/pci1022,7450 at a/pci9005,ffff at a/sd at 0,0
Specify disk (enter its number): 0
selecting c0t0d0
{...}
format> defect
defect> grown
Extracting grown defects list...Extraction complete.
Defect List has a total of 0 defects.
defect> primary
Extracting primary defect list...Extraction complete.
Defect List has a total of 803 defects.
defect> print
 num     cyl     hd     bfi     len     sec     blk
   1     536      0  697945       0
   2     536      0  698515       0
   3     537      0  697945       0
   4     537      0  698515       0
   5     538      0  697945       0
   6     538      0  698515       0
   7    1665      0  499696       0
   8    1665      0  500266       0
   9    1666      0  499696       0
  10    1666      0  500266       0
  11    1667      0  499696       0
  12    1667      0  500266       0
  13    1668      0  473104       0
  14    1668      0  473674       0
  15    1668      0  499696       0
  16    1668      0  500266       0
  17    1669      0  473104       0
{...snipping for brevity...}
 801   10313      1  313415       0
 802   10314      1  312827       0
 803   10314      1  313415       0
total of 803 defects.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |