Short SMART check causes disk op timeouts

Mon Oct 27 10:53:45 PDT 2008

On Mon, Oct 27, 2008 at 06:22:03PM +0100, Vaclav Haisman wrote:
> Jeremy Chadwick wrote:
> > On Mon, Oct 27, 2008 at 11:16:59AM +0100, Vaclav Haisman wrote:
> > Second, your short offline test runs at 0300, but the errors you're
> > seeing are at 0454 in the morning.  A short offline test does not
> > take 2 hours to run -- they take between 2-10 minutes -- unless the
> > system is also in the middle of doing a lot of I/O, in which case the
> > short test will be suspended.
> > 
> > There are cronjobs (specifically periodic jobs) that run starting at
> > 0301 in the morning ("periodic daily"), and many of those are I/O bound.
> > This could possibly extend the length of the short test until 0454.
> > 
> > Weekly periodic jobs run at 0415 in the morning, on Sundays.  These also
> > perform a lot of disk I/O, so it's possible that on Sunday specifically
> > the short SMART test gets pushed back quite some time.
> > 
> > Third, the DMA timeouts you're seeing are possibly caused by the drive
> > taking too long when internally suspending the SMART test.
> > 
> > In most cases, it's safe for SMART tests (short and long) to be run
> > while the machine is operational, and disk I/O requests are being
> > performed.  When an I/O request comes and the disk is in the middle of
> > performing a SMART test, the drive has to stop the SMART test (e.g.
> > "suspend" it), complete the I/O request, then resume the SMART test.
> > 
> > The FreeBSD ATA layer has a 5 second timeout on I/O requests; if it
> > doesn't receive an acknowledgement back from the controller (disk)
> > within 5 seconds, it'll report a timeout on whatever operation it was
> > performing.  I'm thinking the disk gets stuck in a "do the offline
> > test, no wait stop there's an I/O request, okay its done continue the
> > test, no way stop there's another I/O" loop.
> Can I make the timeout higher? For the sake of elimination.

You will have to make modifications to the ata(4) driver code, and
rebuild+reinstall your kernel.

There is a patch from the FreeNAS folks which turns the command timeout
value into a sysctl for tuning, but that patch has not been brought into
FreeBSD (any version) at this time.  You can find it referenced below
(see one of the "Workarounds" sections).  You will probably have to
apply the patch "by hand" rather than blindly using patch < patchfile,
because the ATA code has changed since the patch was created.

http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting

> > Another possibility is that your drive really *does* have a bad block at
> > LBA 836986454, and that one of those cron/periodic jobs is what's
> > noticing it, and that upon noticing a bad block, the drive more or less
> > aborts the SMART test to perform internal remapping of the block.
> > 
> > To confirm this, you would need to boot the SeaTools utilities from DOS
> > or from a CD (see Seagate's site) and run a full sector scan (NOT the
> > "quick" test).  This takes a few hours.  Assuming it comes back clean,
> > then my above claim of the offline test taking too long to suspend is
> > probably the case.
> > 
> > Possibly this is a firmware bug in the drive -- you might consider
> > mailing Seagate about this problem, although I'm doubting their Tier 1
> > support will understand what the issue is.
> > 
> > Is the block number always the same?  Do you only see this error on
> > Sundays?  These are two questions which might help narrow things down.
> Nope, the LBA is always different and I see it in the logs once every day.

Okay, so that greatly diminishes the possibility of it being a bad
block.  I'd still advocate running SeaTools on the disk to ensure
everything is 100% okay (re: "sake of elimination"); chances are it will
pass with flying colours.

> >> This is on 7.1-PRERELEASE #0: Wed Oct 15 18:56:54 UTC 2008, with GENERIC
> >> kernel.
> >>
> >> Now, does the timeout cause loss of any data? Is there anything besides
> >> disabling the testing that I can do about it?
> > 
> > Do you understand what short and long offline tests actually do and what
> > they're used for?  :-)  If so, you'd know that running them periodically
> > is more or less silly (IMHO).
> I do not, not completely :) I think I have just copied the settings from
> somewhere and only just tweaked it a bit whenever I have added a disk.

Let me know if you figure out who or what online resource solicited
adding daily short/long tests, as I'd like to talk to them about their
decision.  I have a feeling whoever thought it up felt that the tests
were performing entire sector scans of the entire disk, which is simply
not the case.

> > If you're trying to accomplish a cheap version of disk scrubbing, e.g.
> > scanning the entire disk for bad blocks and report them or have them
> > automatically remapped by the drive, consider using sysutils/diskcheckd,
> > which was made for this purpose.  However, be aware of a problem I've
> > run into with it (still needs someone clueful to figure out why this
> > happens):
> > http://www.freebsd.org/cgi/query-pr.cgi?pr=ports/115853
> > 
> > I do not advocate the use of periodic offline tests on disks, especially
> > at such aggressive intervals (daily).  In fact, I don't even know why
> > Bruce added that option to smartd.  There are only a few attributes in
> > SMART which get updated on offline tests, so I cease to see the point.
> > 
> > You shouldn't be doing what you're doing, IMHO.  If you want to do
> > these tests once every 2 weeks or once a month, that'd be a better idea.
> > Stick with the short test, and do it during a time when disk I/O is
> > very low (try something like 7am on a Saturday).  Don't go with 2am
> > if your system/environment honours Daylight Saving Time, because that
> > could cause the test to run twice.
> Ok, I am taking the advice and I have set longer intervals of checking.
> 
> Thanks for such extensive answer.

You're welcome!  Let's see if we can figure out what the root cause of
this is; so far, my money is on the SMART tests taking too long to
suspend/resume when an I/O operation interrupts them.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |