deadlock or bad disk ? RELENG_8
Mike Tancsa
mike at sentex.net
Mon Jul 19 12:37:50 UTC 2010
At 11:34 PM 7/18/2010, Jeremy Chadwick wrote:
> >
> > yes, da0 is a RAID volume with 4 disks behind the scenes.
>
>Okay, so can you get full SMART statistics for all 4 of those disks?
>The adjusted/calculated values for SMART thresholds won't be helpful
>here, one will need the actual raw SMART data. I hope the Areca CLI can
>provide that.
I thought there was, but I cant seem to get the current smartctl to
work with the card.
-d TYPE, --device=TYPE
Specifies the type of the device. The valid arguments to this
option are ata, scsi, sat, marvell, 3ware,N, areca,N, usbcy-
press, usbjmicron, usbsunplus, cciss,N, hpt,L/M (or hpt,L/M/N),
and test.
# smartctl -a -d areca,0 /dev/arcmsr0
smartctl 5.39.1 2010-01-28 r3054 [FreeBSD 8.1-PRERELEASE amd64] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
/dev/arcmsr0: Unknown device type 'areca,0'
=======> VALID ARGUMENTS ARE: ata, scsi, sat[,N][+TYPE],
usbcypress[,X], usbjmicron[,x][,N], usbsunplus, 3ware,N, hpt,L/M/N,
cciss,N, atacam, test <=======
Use smartctl -h to get a usage summary
The latest CLI tool only gives this info
CLI> disk info drv=1
Drive Information
===============================================================
IDE Channel : 1
Model Name : ST31000340AS
Serial Number : 3QJ07F1N
Firmware Rev. : SD15
Disk Capacity : 1000.2GB
Device State : NORMAL
Timeout Count : 0
Media Error Count : 0
Device Temperature : 29 C
SMART Read Error Rate : 108(6)
SMART Spinup Time : 91(0)
SMART Reallocation Count : 100(36)
SMART Seek Error Rate : 81(30)
SMART Spinup Retries : 100(97)
SMART Calibration Retries : N.A.(N.A.)
===============================================================
GuiErrMsg<0x00>: Success.
CLI> disk smart drv=1
S.M.A.R.T Information For Drive[#01]
# Attribute Items Flag Value Thres State
===============================================================================
1 Raw Read Error
Rate 0x0f 108 6 OK
3 Spin Up
Time 0x03 91 0 OK
4 Start/Stop
Count 0x32 100 20 OK
5 Reallocated Sector
Count 0x33 100 36 OK
7 Seek Error
Rate 0x0f 81 30 OK
9 Power-on Hours
Count 0x32 79 0 OK
10 Spin Retry
Count 0x13 100 97 OK
12 Device Power Cycle
Count 0x32 100 20 OK
194 Temperature 0x22 29 0 OK
197 Current Pending Sector Count 0x12 100 0 OK
198 Off-line Scan Uncorrectable Sector Count 0x10 100 0 OK
199 Ultra DMA CRC Error Count 0x3e 200 0 OK
===============================================================================
GuiErrMsg<0x00>: Success.
CLI>
The obvious ones (timeout, media error etc) are all zero
>Also, I'm willing to bet that the da0 "volume" and the da1 "volume"
>actually share the same physical disks on the Areca controller. Is that
>correct?
Yes
>If so, think about what would happen if heavy I/O happened on
>both da0 and da1 at the same time. I talk about this a bit more below.
No different than any other single disk being heavily worked. Again,
this particular hardware configuration has been beaten about for a
couple of years. So I am not sure why all of a sudden it would be not
possible to do
> >
> > Prior to someone rebooting it, it had been stuck in this state for a
> > good 90min. Apart from upgrading to a later RELENG_8 to get the
> > security patches, the machine had been running a few versions of
> > RELENG_8 doing the same workloads every week without issue.
>
>Then I would say you'd need to roll back kernel+world to a previous date
>and try to figure out when the issue began, if that is indeed the case.
Possibly. The box only gets a heavy workout periodically when it
does an rsync to our DR site.
>It would also help if you could provide timestamps of those messages;
>are they all happening at once, or gradual over time? If over time, do
>they all happen around the same time every day, etc.? You see where I'm
>going with this.
Every couple of seconds I think. If it happens again, I will time it.
>situation (since you'd then be dedicating an entire disk to just swap).
>Others may have other advice. You mention in a later mail that the
>ada[0-3] disks make up a ZFS pool of some sort. You might try splitting
>ada0 into two slices, one for swap and the other used as a pool member.
That seems like it would just move the problem you are trying to get
me to avoid to a different set of disks. If putting swap on a raid
array is a bad thing, I am not sure how moving it to a ZFS raid array
will help.
>Again: I don't think this is necessarily a bad disk problem. The only
>way you'd be able to determine that would be to monitor on a per-disk
>basis the I/O response time of each disk member on the Areca. If the
>CLI tools provide this, awesome. Otherwise you'll probably need to
>involve Areca Support.
In the past when I have had bad disks on the areca, it did catch and
flag device timeouts. There were no such alerts leading up to this situation.
---Mike
--------------------------------------------------------------------
Mike Tancsa, tel +1 519 651 3400
Sentex Communications, mike at sentex.net
Providing Internet since 1994 www.sentex.net
Cambridge, Ontario Canada www.sentex.net/mike
More information about the freebsd-stable
mailing list