smartd long self-test causes drives to hang
jrhett at svcolo.com
Mon Nov 24 13:18:36 PST 2008
I've spent about 3 months tracing down what was causing my personal
colo box to start getting "sluggish" right around dawn every Saturday
morning. It took so long because some mornings I simply couldn't pull
my head out of my tail enough to do proper debugging.
The cause was *really slow* filesystem response time. No cron jobs in
that period. No specific process ran any slower than another,
although I eventually learned that ones which did no file i/o were
fine. And finally I realized that just "ls -la" was very slow (~1
minute) even after I had killed off every disk-using process in the
system. SMTP and HTTP in particular were basically fubar.
No data loss, just *real slow*. Nothing other than a soft reboot ever
solved the problem. Even leaving it running only minimal processes
for 24 hours didn't bring it back to normal.
Finally I was browsing through Jeremy Chadwick's list of known ATA
problems and spotted his comments about smartd self-tests causing
problems. Sure enough, my long self test was scheduled for 5am on
Saturday mornings. Rechecking the observed slow-down periods
confirmed that the problem never became visible before 5am.
(sometimes it took up to 45 minutes before things slowed down enough
to set off monitoring alarms)
So, long story short, if you're having weirdness in system time
response - check the smartd configuration, and try disabling the self
tests. The short self test I was running daily didn't appear to
affect anything, but the long test was just bringing the system to
just shuddering and limping at best.
More information about the freebsd-stable