Short SMART check causes disk op timeouts
v.haisman at sh.cvut.cz
Mon Oct 27 10:22:44 PDT 2008
-----BEGIN PGP SIGNED MESSAGE-----
Jeremy Chadwick wrote:
> On Mon, Oct 27, 2008 at 11:16:59AM +0100, Vaclav Haisman wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>> I have recently bought a new disk (Seagate 500G, ST3500320NS). I have
>> enabled SMART checking using the smartmontools as usual for the disk
>> (/dev/ad6 -a -S on -s (S/../.././03|L/../../7/03) -m root). The problem
>> is that each time the test runs I get messages like the following in
>> Oct 26 04:54:15 35 kernel: ad6: TIMEOUT - WRITE_DMA48 retrying (1 retry
>> left) LBA=836986454
>> Oct 26 04:54:25 35 kernel: ad6: TIMEOUT - WRITE_DMA48 retrying (0
>> retries left) LBA=836986454
>> Oct 26 04:54:25 35 kernel: ad6: FAILURE - WRITE_DMA48 timed out
>> Oct 26 04:54:25 35 kernel: g_vfs_done():ad6s2d[WRITE(offset=13150142464,
>> length=16384)]error = 5
>> And the SMART test results log on the disk contains line like this:
>> # 1 Short offline Interrupted (host reset) 00% 297
> First and foremost, your above smartd.conf -s flags are conflicting.
> Your long offline test will never get run on Sunday; the short will run
> first, and the long won't ever start (because the short is already
> running). I would recommend telling the short test to run only between
> days 0-6, leaving Sunday solely for the long test. (I noticed this
> because the above "Interrupted" test indicates a short test was
> interrupted and not a long).
Thanks, I have not noticed the overlap at all.
> Second, your short offline test runs at 0300, but the errors you're
> seeing are at 0454 in the morning. A short offline test does not
> take 2 hours to run -- they take between 2-10 minutes -- unless the
> system is also in the middle of doing a lot of I/O, in which case the
> short test will be suspended.
> There are cronjobs (specifically periodic jobs) that run starting at
> 0301 in the morning ("periodic daily"), and many of those are I/O bound.
> This could possibly extend the length of the short test until 0454.
> Weekly periodic jobs run at 0415 in the morning, on Sundays. These also
> perform a lot of disk I/O, so it's possible that on Sunday specifically
> the short SMART test gets pushed back quite some time.
> Third, the DMA timeouts you're seeing are possibly caused by the drive
> taking too long when internally suspending the SMART test.
> In most cases, it's safe for SMART tests (short and long) to be run
> while the machine is operational, and disk I/O requests are being
> performed. When an I/O request comes and the disk is in the middle of
> performing a SMART test, the drive has to stop the SMART test (e.g.
> "suspend" it), complete the I/O request, then resume the SMART test.
> The FreeBSD ATA layer has a 5 second timeout on I/O requests; if it
> doesn't receive an acknowledgement back from the controller (disk)
> within 5 seconds, it'll report a timeout on whatever operation it was
> performing. I'm thinking the disk gets stuck in a "do the offline
> test, no wait stop there's an I/O request, okay its done continue the
> test, no way stop there's another I/O" loop.
Can I make the timeout higher? For the sake of elimination.
> Another possibility is that your drive really *does* have a bad block at
> LBA 836986454, and that one of those cron/periodic jobs is what's
> noticing it, and that upon noticing a bad block, the drive more or less
> aborts the SMART test to perform internal remapping of the block.
> To confirm this, you would need to boot the SeaTools utilities from DOS
> or from a CD (see Seagate's site) and run a full sector scan (NOT the
> "quick" test). This takes a few hours. Assuming it comes back clean,
> then my above claim of the offline test taking too long to suspend is
> probably the case.
> Possibly this is a firmware bug in the drive -- you might consider
> mailing Seagate about this problem, although I'm doubting their Tier 1
> support will understand what the issue is.
> Is the block number always the same? Do you only see this error on
> Sundays? These are two questions which might help narrow things down.
Nope, the LBA is always different and I see it in the logs once every day.
>> This is on 7.1-PRERELEASE #0: Wed Oct 15 18:56:54 UTC 2008, with GENERIC
>> Now, does the timeout cause loss of any data? Is there anything besides
>> disabling the testing that I can do about it?
> Do you understand what short and long offline tests actually do and what
> they're used for? :-) If so, you'd know that running them periodically
> is more or less silly (IMHO).
I do not, not completely :) I think I have just copied the settings from
somewhere and only just tweaked it a bit whenever I have added a disk.
> If you're trying to accomplish a cheap version of disk scrubbing, e.g.
> scanning the entire disk for bad blocks and report them or have them
> automatically remapped by the drive, consider using sysutils/diskcheckd,
> which was made for this purpose. However, be aware of a problem I've
> run into with it (still needs someone clueful to figure out why this
> I do not advocate the use of periodic offline tests on disks, especially
> at such aggressive intervals (daily). In fact, I don't even know why
> Bruce added that option to smartd. There are only a few attributes in
> SMART which get updated on offline tests, so I cease to see the point.
> You shouldn't be doing what you're doing, IMHO. If you want to do
> these tests once every 2 weeks or once a month, that'd be a better idea.
> Stick with the short test, and do it during a time when disk I/O is
> very low (try something like 7am on a Saturday). Don't go with 2am
> if your system/environment honours Daylight Saving Time, because that
> could cause the test to run twice.
Ok, I am taking the advice and I have set longer intervals of checking.
Thanks for such extensive answer.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.9 (FreeBSD)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
-----END PGP SIGNATURE-----
More information about the freebsd-stable