gmirror or ata problem

Fluffles etc at fluffles.net
Wed Jan 31 23:11:38 UTC 2007


Pawel Jakub Dawidek wrote:
> On Wed, Jan 31, 2007 at 09:12:02PM +0100, Simon L. Nielsen wrote:
>   
>> On 2007.01.30 09:51:14 +0100, Oliver Fromme wrote:
>>
>>     
>>> This is strange.  gmirror just detached one of its disks
>>> for no apparent reason.  I've built a mirror consisting of
>>> the components ad0 and ad1 (both SATA drives).  It has
>>> been running fine.  This is RELENG_6 from 2006-12-20.
>>>
>>> Yesterday evening ad1 was detached.  There is no other
>>> error message logged on console or in the logs (i.e. no
>>> I/O error such as a bad sector or anything).  There was
>>> no particularly high load at that time.  In fact, the
>>> machine had been under much higher load before, without
>>> anything bad happening.
>>>
>>> This is from the logs:
>>>
>>> Jan 29 19:10:13 pluto -- MARK --
>>> Jan 29 19:20:26 pluto kernel: ad1: FAILURE - device detached
>>> Jan 29 19:20:26 pluto kernel: subdisk1: detached
>>> Jan 29 19:20:26 pluto kernel: ad1: detached
>>> Jan 29 19:20:26 pluto kernel: GEOM_MIRROR: Cannot write metadata on ad1 (device=gm0, error=6).
>>> Jan 29 19:20:26 pluto kernel: GEOM_MIRROR: Cannot update metadata on disk ad1 (error=6).
>>> Jan 29 19:20:26 pluto kernel: GEOM_MIRROR: Cannot update metadata on disk ad1 (error=6).
>>> Jan 29 19:20:26 pluto kernel: GEOM_MIRROR: Device gm0: provider ad1 disconnected.
>>> Jan 29 19:50:13 pluto -- MARK --
>>>       
>> I have seen similar problems on my graid3.  I think it's simply the
>> disk which stops responding to commands, or at least ata(4) can't talk
>> to the disk anymore...
>>
>> I see it on:
>>
>> ad10: 305245MB <WDC WD3200SD-01KNB0 08.05J08> at ata5-master SATA150
>> ad12: 305245MB <WDC WD3200SD-01KNB0 08.05J08> at ata6-master SATA150
>> ad14: 305245MB <WDC WD3200YS-01PGB0 21.00M21> at ata7-master SATA150
>>
>> After a reboot everything seems fine again and my RAID is rebuilt.
>>
>> I don't know why it happens, but it sucks :-/.  I'm running 7-CURRENT
>> BTW.
>>     
>
> It seems that when gmirror/graid3 writes to more than one disk at a
> time, this puts too much load on ata channel or something and ata
> disconnects the disk. I don't really know how it works exactly, but
> maybe some timeout should be increased in the ata code?
>   

My experiences are that even a single disk will timeout; 5 seconds is
just not enough for the disk to spinup. Most disks will need 10 seconds
at least.
In ata-disk.c the timeout is set at 5 seconds. When set at 15 seconds;
the ataidle-sleep mode works perfectly. I think this should be patched.
Right now ataidle is broken on FreeBSD i would say, without patching the
sourcecode at least.

For those not being able to wait for an official patch; try this:
- edit /usr/src/sys/dev/ata/ata-disk.c
- search for "timeout" case-insensitive
- you will find:     request->timeout = 5;
- change the value 5 to 15
- save and execute: cd /usr/src; make kernel KERNCONF=GENERIC
- after reboot you can test ataidle and it should work perfectly; with
any geom raid layer or as 'single disk'

- Veronica


More information about the freebsd-geom mailing list