Western Digital hard disks and ATA timeouts

Jeremy Chadwick koitsu at FreeBSD.org
Thu Nov 6 23:17:55 PST 2008


A user and myself on a broadband forum were discussing the possibility
of diminishing quality of hard disks (particularly 1TB models) in recent
days (specifically October).

The user continually referenced something called "deep recovery cycle",
backed with claims from Newegg reviewers (who often know very little or
nothing at all -- grain of salt concept applies), which make Western
Digital's desktop hard disks unfit for RAID or server usage.

I claimed shenanigans until the user pointed me to the following
document on Western Digital's site:

http://wdc.custhelp.com/cgi-bin/wdc.cfg/php/enduser/std_adp.php?p_faqid=1397

The feature described apparently causes the hard disk to enter some form
of aggressive sector scan/sector remapping loop, which can take up to 2
minutes to complete, during which time, the hard disk is basically
unusable.  (I imagine ATA commands sent to the disk will simply time out
or stall indefinitely, which would result in all sorts of timeout
errors).

Note that Western Digital's "RAID edition" drives claim to take up to 7
seconds to reallocate sectors, using something they call TLER, which
force-limits the amount of time the drive can spend reallocating.  TLER
cannot be disabled:

http://wdc.custhelp.com/cgi-bin/wdc.cfg/php/enduser/std_adp.php?p_faqid=1478

What baffles me is why Western Digital thinks that 2 minutes of the
drive being unusable is acceptable "but only for desktops".  Any FreeBSD
desktop will start reporting ATA timeouts if the drive wedges for more
than 5 seconds -- two minutes would just spew errors and hard-lock the
system.

What also baffles me is why Western Digital thinks the term "RAID"
always means a hardware RAID controller is involved as a buffer between
the OS and the disks.  Bzzzt, bad assumption on their part.

So why do we care?

As stated, FreeBSD's ATA command timeout is hard-set to 5 seconds, and
is not adjustable without editing the ATA code yourself and increasing
the value.  The FreeNAS folks have made patches available to turn the
timeout value into a sysctl.

Soren and/or others, please increase this timeout value.  Five seconds
has now been deemed too aggressive a default.  And please consider
migrating the timeout value into a sysctl.

P.S. -- I do not consider any of this reason to avoid Western Digital
drives.  But I would warn users to be a little more cautious before
reporting ATA timeouts when newer (circia 2007 and later) WD drives
are in use.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |



More information about the freebsd-stable mailing list