deadlock or bad disk ? RELENG_8

Mon Jul 19 12:37:50 UTC 2010

At 11:34 PM 7/18/2010, Jeremy Chadwick wrote:
> >
> > yes, da0 is a RAID volume with 4 disks behind the scenes.
>
>Okay, so can you get full SMART statistics for all 4 of those disks?
>The adjusted/calculated values for SMART thresholds won't be helpful
>here, one will need the actual raw SMART data.  I hope the Areca CLI can
>provide that.

I thought there was, but I cant seem to get the current smartctl to 
work with the card.

-d TYPE, --device=TYPE
               Specifies  the  type of the device.  The valid arguments to this
               option are ata, scsi, sat,  marvell,  3ware,N,  areca,N,  usbcy-
               press,  usbjmicron, usbsunplus, cciss,N, hpt,L/M (or hpt,L/M/N),
               and test.

# smartctl -a -d areca,0 /dev/arcmsr0
smartctl 5.39.1 2010-01-28 r3054 [FreeBSD 8.1-PRERELEASE amd64] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

/dev/arcmsr0: Unknown device type 'areca,0'
=======> VALID ARGUMENTS ARE: ata, scsi, sat[,N][+TYPE], 
usbcypress[,X], usbjmicron[,x][,N], usbsunplus, 3ware,N, hpt,L/M/N, 
cciss,N, atacam, test <=======

Use smartctl -h to get a usage summary

The latest CLI tool only gives this info

CLI> disk info drv=1
Drive Information
===============================================================
IDE Channel                        : 1
Model Name                         : ST31000340AS
Serial Number                      : 3QJ07F1N
Firmware Rev.                      : SD15
Disk Capacity                      : 1000.2GB
Device State                       : NORMAL
Timeout Count                      : 0
Media Error Count                  : 0
Device Temperature                 : 29 C
SMART Read Error Rate              : 108(6)
SMART Spinup Time                  : 91(0)
SMART Reallocation Count           : 100(36)
SMART Seek Error Rate              : 81(30)
SMART Spinup Retries               : 100(97)
SMART Calibration Retries          : N.A.(N.A.)
===============================================================
GuiErrMsg<0x00>: Success.

CLI>  disk smart drv=1
S.M.A.R.T Information For Drive[#01]
   # Attribute Items                           Flag   Value  Thres  State
===============================================================================
   1 Raw Read Error 
Rate                       0x0f     108      6  OK
   3 Spin Up 
Time                              0x03      91      0  OK
   4 Start/Stop 
Count                          0x32     100     20  OK
   5 Reallocated Sector 
Count                  0x33     100     36  OK
   7 Seek Error 
Rate                           0x0f      81     30  OK
   9 Power-on Hours 
Count                      0x32      79      0  OK
  10 Spin Retry 
Count                          0x13     100     97  OK
  12 Device Power Cycle 
Count                  0x32     100     20  OK
194 Temperature                               0x22      29      0  OK
197 Current Pending Sector Count              0x12     100      0  OK
198 Off-line Scan Uncorrectable Sector Count  0x10     100      0  OK
199 Ultra DMA CRC Error Count                 0x3e     200      0  OK
===============================================================================
GuiErrMsg<0x00>: Success.

CLI>

The obvious ones (timeout, media error etc) are all zero

>Also, I'm willing to bet that the da0 "volume" and the da1 "volume"
>actually share the same physical disks on the Areca controller.  Is that
>correct?

Yes

>If so, think about what would happen if heavy I/O happened on
>both da0 and da1 at the same time.  I talk about this a bit more below.

No different than any other single disk being heavily worked.  Again, 
this particular hardware configuration has been beaten about for a 
couple of years. So I am not sure why all of a sudden it would be not 
possible to do

> >
> > Prior to someone rebooting it, it had been stuck in this state for a
> > good 90min.  Apart from upgrading to a later RELENG_8 to get the
> > security patches, the machine had been running a few versions of
> > RELENG_8 doing the same workloads every week without issue.
>
>Then I would say you'd need to roll back kernel+world to a previous date
>and try to figure out when the issue began, if that is indeed the case.

Possibly.  The box only gets a heavy workout periodically when it 
does an rsync to our DR site.

>It would also help if you could provide timestamps of those messages;
>are they all happening at once, or gradual over time?  If over time, do
>they all happen around the same time every day, etc.?  You see where I'm
>going with this.

Every couple of seconds I think.  If it happens again, I will time it.

>situation (since you'd then be dedicating an entire disk to just swap).
>Others may have other advice.  You mention in a later mail that the
>ada[0-3] disks make up a ZFS pool of some sort.  You might try splitting
>ada0 into two slices, one for swap and the other used as a pool member.

That seems like it would just move the problem you are trying to get 
me to avoid to a different set of disks. If putting swap on a raid 
array is a bad thing, I am not sure how moving it to a ZFS raid array 
will help.

>Again: I don't think this is necessarily a bad disk problem.  The only
>way you'd be able to determine that would be to monitor on a per-disk
>basis the I/O response time of each disk member on the Areca.  If the
>CLI tools provide this, awesome.  Otherwise you'll probably need to
>involve Areca Support.

In the past when I have had bad disks on the areca, it did catch and 
flag device timeouts.  There were no such alerts leading up to this situation.

         ---Mike

--------------------------------------------------------------------
Mike Tancsa,                                      tel +1 519 651 3400
Sentex Communications,                            mike at sentex.net
Providing Internet since 1994                    www.sentex.net
Cambridge, Ontario Canada                         www.sentex.net/mike