Short SMART check causes disk op timeouts

Mon Oct 27 09:08:32 PDT 2008

On Mon, Oct 27, 2008 at 11:16:59AM +0100, Vaclav Haisman wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> Hi,
> I have recently bought a new disk (Seagate 500G, ST3500320NS). I have
> enabled SMART checking using the smartmontools as usual for the disk
> (/dev/ad6 -a -S on -s (S/../.././03|L/../../7/03) -m root). The problem
> is that each time the test runs I get messages like the following in
> /var/log/messages:
> 
> Oct 26 04:54:15 35 kernel: ad6: TIMEOUT - WRITE_DMA48 retrying (1 retry
> left) LBA=836986454
> Oct 26 04:54:25 35 kernel: ad6: TIMEOUT - WRITE_DMA48 retrying (0
> retries left) LBA=836986454
> Oct 26 04:54:25 35 kernel: ad6: FAILURE - WRITE_DMA48 timed out
> LBA=836986454
> Oct 26 04:54:25 35 kernel: g_vfs_done():ad6s2d[WRITE(offset=13150142464,
> length=16384)]error = 5
> 
> And the SMART test results log on the disk contains line like this:
> 
> # 1  Short offline       Interrupted (host reset)      00%       297
>      -

First and foremost, your above smartd.conf -s flags are conflicting.
Your long offline test will never get run on Sunday; the short will run
first, and the long won't ever start (because the short is already
running).  I would recommend telling the short test to run only between
days 0-6, leaving Sunday solely for the long test.  (I noticed this
because the above "Interrupted" test indicates a short test was
interrupted and not a long).

Second, your short offline test runs at 0300, but the errors you're
seeing are at 0454 in the morning.  A short offline test does not
take 2 hours to run -- they take between 2-10 minutes -- unless the
system is also in the middle of doing a lot of I/O, in which case the
short test will be suspended.

There are cronjobs (specifically periodic jobs) that run starting at
0301 in the morning ("periodic daily"), and many of those are I/O bound.
This could possibly extend the length of the short test until 0454.

Weekly periodic jobs run at 0415 in the morning, on Sundays.  These also
perform a lot of disk I/O, so it's possible that on Sunday specifically
the short SMART test gets pushed back quite some time.

Third, the DMA timeouts you're seeing are possibly caused by the drive
taking too long when internally suspending the SMART test.

In most cases, it's safe for SMART tests (short and long) to be run
while the machine is operational, and disk I/O requests are being
performed.  When an I/O request comes and the disk is in the middle of
performing a SMART test, the drive has to stop the SMART test (e.g.
"suspend" it), complete the I/O request, then resume the SMART test.

The FreeBSD ATA layer has a 5 second timeout on I/O requests; if it
doesn't receive an acknowledgement back from the controller (disk)
within 5 seconds, it'll report a timeout on whatever operation it was
performing.  I'm thinking the disk gets stuck in a "do the offline
test, no wait stop there's an I/O request, okay its done continue the
test, no way stop there's another I/O" loop.

Another possibility is that your drive really *does* have a bad block at
LBA 836986454, and that one of those cron/periodic jobs is what's
noticing it, and that upon noticing a bad block, the drive more or less
aborts the SMART test to perform internal remapping of the block.

To confirm this, you would need to boot the SeaTools utilities from DOS
or from a CD (see Seagate's site) and run a full sector scan (NOT the
"quick" test).  This takes a few hours.  Assuming it comes back clean,
then my above claim of the offline test taking too long to suspend is
probably the case.

Possibly this is a firmware bug in the drive -- you might consider
mailing Seagate about this problem, although I'm doubting their Tier 1
support will understand what the issue is.

Is the block number always the same?  Do you only see this error on
Sundays?  These are two questions which might help narrow things down.

> This is on 7.1-PRERELEASE #0: Wed Oct 15 18:56:54 UTC 2008, with GENERIC
> kernel.
> 
> Now, does the timeout cause loss of any data? Is there anything besides
> disabling the testing that I can do about it?

Do you understand what short and long offline tests actually do and what
they're used for?  :-)  If so, you'd know that running them periodically
is more or less silly (IMHO).

If you're trying to accomplish a cheap version of disk scrubbing, e.g.
scanning the entire disk for bad blocks and report them or have them
automatically remapped by the drive, consider using sysutils/diskcheckd,
which was made for this purpose.  However, be aware of a problem I've
run into with it (still needs someone clueful to figure out why this
happens):
http://www.freebsd.org/cgi/query-pr.cgi?pr=ports/115853

I do not advocate the use of periodic offline tests on disks, especially
at such aggressive intervals (daily).  In fact, I don't even know why
Bruce added that option to smartd.  There are only a few attributes in
SMART which get updated on offline tests, so I cease to see the point.

You shouldn't be doing what you're doing, IMHO.  If you want to do
these tests once every 2 weeks or once a month, that'd be a better idea.
Stick with the short test, and do it during a time when disk I/O is
very low (try something like 7am on a Saturday).  Don't go with 2am
if your system/environment honours Daylight Saving Time, because that
could cause the test to run twice.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |