gmirror Cannot add disk ad5 to gm0 (error=22)
Miroslav Lachman
000.fbsd at quip.cz
Thu Aug 3 09:12:46 UTC 2006
Rick C. Petty wrote:
> On Thu, Aug 03, 2006 at 12:43:12AM +0200, Miroslav Lachman wrote:
>
>>Something is definitely wrong. Gmirror status still shows 0% after
>>couple of minutes (normaly synchronization progress is about 1% per minute)
>
>
> Under what conditions do you define "normally"? I think you can tweak
> the numbers to make it go faster or slower, and I think it's dependent
> upon (disk) idle time.
normally = few days ago, same HW, same BIOS settings etc. Whole
synchronization of 250GB disks was done after about 90 minutes.
>>systat -vmstat shows less then 1MB/s instead of usual 40MB/s, but 100% busy.
>>
>>Disks ad4 ad5
>>KB/t 121 128
>>tps 4 4
>>MB/s 0.45 0.45
>>% busy 83 103
>
>
> What other activity is happening on the box? Are you in the middle of a
> background fsck?
Almost no other activities, system has installed apache, mysql, postfix
etc., but not serving any requests. Fsck was not running.
> What does the output of "atacontrol mode ad4" (and ad5) show? Are you
> sure your "normal" synchronization happened when you were in IDE mode
> instead of AHCI?
Yes, "normal" synchronization was with IDE mode. IDE mode was set more
then week ago and as I play with gmirror I run synchronization many times.
# atacontrol mode ad4
current mode = SATA150
# atacontrol mode ad5
current mode = SATA150
>>Is there any chance to found source of problems without step by step
>>replacement of each component?
>
>
> That depends upon the problems. To diagnose anything, you need to be
> able to reliably bring down the mirror-- e.g. heavy disk activity.
>
>
>>I can't believe that I have bad cables in
>>4 new machines or bad hard drives in each machine... :o(
>
>
> I bought identical machines (cpus, boards, disks, cables, etc.) and had
> different results on each. Especially when you buy identical stuff,
> there is a small probability that they'll all have the same problems--
> for example, a bad batch of disks. In your case, I'd investigate which
> steps you have to preform to repeatably cause the failures. On my
> systems, the heavier the disk load, the higher the probability of failure.
> Upgrading to the latest 6.1-STABLE might help in some cases.
Same here - heavier disk load, more often failures. After few crashes,
disks disappeared in the middle of gmirror synchronization (heavy disk
load). The disk was replaced with new one without success, then the
whole server was replaced and running fine for about 1 week under heavy
test load (concurrent copying of ports tree in infinete loop). Now the
mentioned problem occured.
Now it seems that it is disk problem this time. Synchronization was
running whole night with tens or hunderds of messages like this:
ad5: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=9719424
ad5: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout -
completing request directly
ad5: error issuing SETFEATURES SET TRANSFER MODE command
After six hours I got message from smartd
Device: /dev/ad5, FAILED SMART self-check. BACK UP DATA NOW!
Device: /dev/ad5, 52 Currently unreadable (pending) sectors
Device: /dev/ad5, 52 Offline uncorrectable sectors
90 minutes later, system reboot itself, trying rebuild provider ad5 and
/var/log/messeges is full of
ad5: FAILURE - SETFEATURES SET TRANSFER MODE
status=71<READY,DMA_READY,DSC,ERROR> error=4<ABORTED>
ad5: FAILURE - SETFEATURES ENABLE RCACHE
status=71<READY,DMA_READY,DSC,ERROR> error=4<ABORTED>
ad5: FAILURE - SETFEATURES ENABLE WCACHE
status=71<READY,DMA_READY,DSC,ERROR> error=4<ABORTED>
ad5: FAILURE - SET_MULTI status=71<READY,DMA_READY,DSC,ERROR>
error=4<ABORTED>
ad5: TIMEOUT - READ_DMA retrying (1 retry left) LBA=1
1 hour later
ad5: FAILURE - ATA_IDENTIFY status=71<READY,DMA_READY,DSC,ERROR>
error=4<ABORTED> LBA=0
ad5: FAILURE - ATAPI_IDENTIFY status=71<READY,DMA_READY,DSC,ERROR>
error=4<ABORTED> LBA=0
smartd[506]: Device: /dev/ad5, failed to read SMART Attribute Data
In MRTG graphs I got disk temperature (38°C) and Reallocated Sector
Count which is increasing from time of synchronization start and after 5
hours the number of reallocated sectors goes above 2000! (out of range
of the graph)
After manual reboot, there is no ad5 device. I hope new drive helps, but
I am still nervous, because I have similar troubles with 2 machines
(both replaced with new one - so I played with 4 machines)...
Thank you for your help.
Miroslav Lachman
More information about the freebsd-geom
mailing list