kern/170675: ata(4) hangs system, causing data loss

Dieter freebsd at sopwith.solgatos.com
Thu Aug 16 19:20:09 UTC 2012


>Number:         170675
>Category:       kern
>Synopsis:       ata(4) hangs system, causing data loss
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Thu Aug 16 19:20:08 UTC 2012
>Closed-Date:
>Last-Modified:
>Originator:     Dieter
>Release:        FreeBSD 8.2  amd64
>Organization:
>Environment:
FreeBSD 8.2  amd64
>Description:
FreeBSD 8.2  amd64
<nVidia nForce CK804 SATA300 controller>
ad6 is a vanilla sata drive

/var/log/messages contains:
    ad6: FAILURE - device detached
No other clues are provided.  It would be useful if ata(4) told
us *why* it decided to detach the drive.

Over 24 hours later, the system suddenly hung, for no obvious reason.
Thinking that perhaps ata(4) was having some new problem with ad6,
I unplugged ad6's data cable.  The system then recovered.

However, the system was completely hung for 19 minutes, and perhaps
would have remained hung forever without manual intervention.
THIS RESULTED IN THE UNNECESSARY LOSS OF INCOMING DATA!  COMPLETELY
UNACCEPTABLE!

Other than the device detached message, ata(4) did not output
any info at all about this problem.

There is no reason that ata(4) should have to hang the entire
system for even a millisecond, much less 19 minutes, just because
it is having some problem with one disk drive. (ad6 contained only
user data, no system partitions or swap)

News Flash: hardware isn't perfect and never will be.  Hardware
sometimes hiccups or fails altogether.  FreeBSD needs to deal with
failures gracefully and continue servicing the remaining hardware.
The phrase "can't walk and chew gum at the same time" comes to mind.

I suspect that ata(4) turned off ALL interupts (why all of them?
why not just turn off interrupts for the device being serviced?)
and then went into an infinite loop.

>How-To-Repeat:

>Fix:
(1) find the offending infinite loop (or whatever) in ata(4) and fix it.

(2) Don't turn off all interrupts, just turn off interrupts for the
device being serviced.


>Release-Note:
>Audit-Trail:
>Unformatted:


More information about the freebsd-bugs mailing list