Short SMART check causes disk op timeouts
000.fbsd at quip.cz
Mon Oct 27 12:38:51 PDT 2008
Jeremy Chadwick wrote:
> On Mon, Oct 27, 2008 at 06:22:03PM +0100, Vaclav Haisman wrote:
>>Jeremy Chadwick wrote:
>>>On Mon, Oct 27, 2008 at 11:16:59AM +0100, Vaclav Haisman wrote:
>>>Second, your short offline test runs at 0300, but the errors you're
>>>seeing are at 0454 in the morning. A short offline test does not
>>>take 2 hours to run -- they take between 2-10 minutes -- unless the
>>>system is also in the middle of doing a lot of I/O, in which case the
>>>short test will be suspended.
>>>There are cronjobs (specifically periodic jobs) that run starting at
>>>0301 in the morning ("periodic daily"), and many of those are I/O bound.
>>>This could possibly extend the length of the short test until 0454.
>>>Weekly periodic jobs run at 0415 in the morning, on Sundays. These also
>>>perform a lot of disk I/O, so it's possible that on Sunday specifically
>>>the short SMART test gets pushed back quite some time.
>>>Third, the DMA timeouts you're seeing are possibly caused by the drive
>>>taking too long when internally suspending the SMART test.
>>>In most cases, it's safe for SMART tests (short and long) to be run
>>>while the machine is operational, and disk I/O requests are being
>>>performed. When an I/O request comes and the disk is in the middle of
>>>performing a SMART test, the drive has to stop the SMART test (e.g.
>>>"suspend" it), complete the I/O request, then resume the SMART test.
>>>The FreeBSD ATA layer has a 5 second timeout on I/O requests; if it
>>>doesn't receive an acknowledgement back from the controller (disk)
>>>within 5 seconds, it'll report a timeout on whatever operation it was
>>>performing. I'm thinking the disk gets stuck in a "do the offline
>>>test, no wait stop there's an I/O request, okay its done continue the
>>>test, no way stop there's another I/O" loop.
>>Can I make the timeout higher? For the sake of elimination.
> You will have to make modifications to the ata(4) driver code, and
> rebuild+reinstall your kernel.
> There is a patch from the FreeNAS folks which turns the command timeout
> value into a sysctl for tuning, but that patch has not been brought into
> FreeBSD (any version) at this time. You can find it referenced below
> (see one of the "Workarounds" sections). You will probably have to
> apply the patch "by hand" rather than blindly using patch < patchfile,
> because the ATA code has changed since the patch was created.
>>>Another possibility is that your drive really *does* have a bad block at
>>>LBA 836986454, and that one of those cron/periodic jobs is what's
>>>noticing it, and that upon noticing a bad block, the drive more or less
>>>aborts the SMART test to perform internal remapping of the block.
>>>To confirm this, you would need to boot the SeaTools utilities from DOS
>>>or from a CD (see Seagate's site) and run a full sector scan (NOT the
>>>"quick" test). This takes a few hours. Assuming it comes back clean,
>>>then my above claim of the offline test taking too long to suspend is
>>>probably the case.
>>>Possibly this is a firmware bug in the drive -- you might consider
>>>mailing Seagate about this problem, although I'm doubting their Tier 1
>>>support will understand what the issue is.
>>>Is the block number always the same? Do you only see this error on
>>>Sundays? These are two questions which might help narrow things down.
>>Nope, the LBA is always different and I see it in the logs once every day.
> Okay, so that greatly diminishes the possibility of it being a bad
> block. I'd still advocate running SeaTools on the disk to ensure
> everything is 100% okay (re: "sake of elimination"); chances are it will
> pass with flying colours.
>>>>This is on 7.1-PRERELEASE #0: Wed Oct 15 18:56:54 UTC 2008, with GENERIC
>>>>Now, does the timeout cause loss of any data? Is there anything besides
>>>>disabling the testing that I can do about it?
>>>Do you understand what short and long offline tests actually do and what
>>>they're used for? :-) If so, you'd know that running them periodically
>>>is more or less silly (IMHO).
>>I do not, not completely :) I think I have just copied the settings from
>>somewhere and only just tweaked it a bit whenever I have added a disk.
> Let me know if you figure out who or what online resource solicited
> adding daily short/long tests, as I'd like to talk to them about their
> decision. I have a feeling whoever thought it up felt that the tests
> were performing entire sector scans of the entire disk, which is simply
> not the case.
It seems like a little modified example from smartd.conf.sample
# First (primary) ATA/IDE hard disk. Monitor all attributes, enable
# automatic online data collection, automatic Attribute autosave, and
# start a short self-test every day between 2-3am, and a long self test
# Saturdays between 3-4am.
#/dev/hda -a -o on -S on -s (S/../.././02|L/../../6/03)
I am using similar config without problem:
/dev/ad4 -a -o on -S on -m root -M test -M diminishing -s
(S/../.././01|L/../../(3|6)/05) -t -I 194
/dev/ad6 -a -o on -S on -m root -M test -M diminishing -s
(S/../.././01|L/../../(3|6)/04) -t -I 194
More information about the freebsd-stable