Every 12-hrs -- "ad0: TIMEOUT - WRITE DMA"

Sun Oct 4 10:24:17 UTC 2009

This is a reply to a very old thread.

I decided to reply because

 1. nobody has mentioned the real cause of the problem yet
    (some answers were misleading or even outright wrong),

 2. I've experienced the same problem in the past few weeks,

 3. my findings might be useful for other people who are
    googling for the symptoms (like me) and stumble across
    this thread.

The drive in question seems to be very popular, especially
in low-end private servers and home machines.  It is very
reliable; I still have these and similar ones in production.
The drive of mine that exhibited the problem recently is
this:

ad0: 24405MB <IBM DJNA-352500 J51OA30K> at ata0-master UDMA66

It is powering a small server running DNS, SMTP, WWW and
other things for several private domains.  The load is very
low, most of the time.

Now for the actual problem:

V.I.Victor <idmc_vivr at intgdev.com> wrote:
 > For the last 4-days, our (otherwise OK) 5.4-RELEASE machine has been
 > reporting:
 > 
 > Feb 12 12:08:05 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2701279
 > Feb 13 00:08:51 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2701279
 > Feb 13 12:09:38 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2963331
 > Feb 14 00:10:24 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2705947
 > Feb 14 12:11:09 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2706335
 > Feb 15 00:12:02 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2832383
 > Feb 15 12:12:57 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=139839
 > Feb 16 00:13:50 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=131391
 > Feb 16 12:14:36 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=131391
 > 
 > The system was created Jan 08 and, prior to the above, the ad0: timeout had
 > only been reported twice:
 > 
 > Jan 25 11:43:34 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=17920255
 > Feb 6 11:59:42 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2832383
 > [...]
 > ad0: 14664MB <IBM-DJNA-351520/J56OA30K> [29795/16/63] at ata0-master UDMA66

First of all:  The disk is *not* dying.  "SMART" won't reveal anything.
The behaviour is perfectly normal for IBM-DJNA-3* type disks.

When those disks are used in continuous operation (24/7), they
will go into automatic maintenance mode after 6 days.  This is
kind of a short self-test and recalibration to ensure reliable
continous operation.  It will be repeated after another 6 days
ad infinitum.

Note that there are exactly 12 days between your Jan 25 and Feb 6
incidents, and exactly 6 days between Feb 6 and Feb 12 incidents.
An automatic maintenance on Jan 31 apparently finished successfully
without a timeout message.

Normally the drive will wait until it detects an idle period,
then perform the maintenance, then continue normal operation.
Maintenance mode involves a short spin down / spin up cycle.

However, if the drive receives a command during spin down, it
will abort maintenance mode, spin up (which takes a few seconds
and might cause a "timeout" to the operating system), then
perform the command, and RETRY MAINTENACE AFTER 12 HOURS.

So that's where your timeout messages every 12 hours come from.
This is not in any way harmful.  Eventually the maintenance
will succeed (i.e. the idle period is long enough to finish),
then you won't get timeout messages anymore for at least 6 days.

You mentioned that the problem appeared (and disappeared) when
you set the machine's clock.  This is easy to explain, too.
The hard disk has its own clock which is not synchronized with
the system clock.  It starts counting from zero when the disk
is powered up.  By changing the system's clock, you shift the
offset between it and the drive's clock.

That means that periodic activity will happen at different times,
relative to the drive's clock.  Such periodic activity includes
cron jobs and other things.  For example, sendmail's queue runner
wakes up every 30 minutes by default.  Many other daemons also
perform periodic activity.  All of that can happen to start in
the middle of the idle period that the drive chose to use for its
maintenance, thus interrupting maintenance, as described above.

If the offset between the system's clock and the drive's clock
changes, chances are that such periodic activity will happen at
different times, from the point of view of the drive, so the
likelihood that the drive can complete its maintenance changes
(better or worse).

Unfortunately there is no way to configure or disable that
maintenance mode.  The only way to somewhat control it is to
periodically enforce a spin-down ("standby" ATA command) when
you know that the drive is idle.  This usually requires to
unmount the filesystems, though, because otherwise you can't
guarantee that they will be idle for long enough.

You can read IBM's official documentation here:

http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A7900618DED/$file/djna_sp.pdf

If that link doesn't work anymore, google for this:
"OEM HARD DISK DRIVE SPECIFICATIONS for DJNA-3xxxxx"

The maintenance mode is described in chapter 10.12 (page 99).

Best regards
   Oliver

-- 
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606,  Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün-
chen, HRB 125758,  Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen, -Produkte und mehr:  http://www.secnetix.de/bsd

"I started using PostgreSQL around a month ago, and the feeling is
similar to the switch from Linux to FreeBSD in '96 -- 'wow!'."
        -- Oddbjorn Steffensen