disk "flipped" - a known problem?
Christian Gusenbauer
c47g at gmx.at
Mon Jan 21 16:35:50 UTC 2013
Hi!
On Sunday 20 January 2013 20:00:15 Andriy Gapon wrote:
> Today something unusual happened on one of my machines:
> kernel: (ada0:ahcich0:0:0:0): lost device
> kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00
> 00 00 kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout
> kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted
> kernel: (aprobe1:ahcich0:0:15:0): NOP. ACB: 00 00 00 00 00 00 00 00 00 00
> 00 00 kernel: (aprobe1:ahcich0:0:15:0): CAM status: Command timeout
> kernel: (aprobe1:ahcich0:0:15:0): Error 5, Retries exhausted
> kernel: cam_periph_alloc: attempt to re-allocate valid device ada0 rejected
> flags 0x18 refcount 1
> kernel: adaasync: Unable to attach to new device due to status 0x6
>
> It looks like the disk disappeared from the bus and then re-appeared on the
> bus, but not to the OS.
>
> One of the partitions that the disk hosted was a swap partition and it
> seems to be the cause of some of the following consequences.
>
> The consequences:
>
> * ZFS properly noticed disappearance of the disk, but its diagnostic was a
> little bit misleading:
>
> pool: pond
> state: DEGRADED
> status: One or more devices has been removed by the administrator.
> Sufficient replicas exist for the pool to continue functioning in a
> degraded state.
> action: Online the device using 'zpool online' or replace the device with
> 'zpool replace'.
> scan: scrub repaired 0 in 8h55m with 0 errors on Sat Dec 22 12:06:30 2012
> config:
>
> NAME STATE READ
> WRITE CKSUM pond DEGRADED 0
> 0 0 mirror-0 DEGRADED 0
> 0 0 12725235722288301230 REMOVED 0 0
> 0 was /dev/gptid/fcf3558b-493b-11de-a8b9-001cc08221ff
> gptid/48782c6e-8fbd-11de-b3e1-00241d20d446 ONLINE 0
> 0 0
>
> Yes, I agree that the disk got removed/lost, but disagree that "the
> administrator" did it.
>
> * geom_event thread started consuming 100% of CPU in g_wither_washer()
>
> * /dev/ada0 disappeared but camcontrol devlist still reported ada0:
> <ST3500410AS CC34> at scbus0 target 0 lun 0 (pass0,ada0)
>
> * As seen in the system messages, CAM layer refused to re-attach the disk
>
> * gpart command would just crash
>
>
> So, I can explain the behavior of the geom_event thread - apparently
> swapgeom_orphan doesn't do anything that is really meaningful to GEOM and
> so g_wither_washer is stuck waiting until the swap consumer goes way
> (drops its access bits).
>
> (Another sad thing about this state is that I couldn't swapoff the device,
> because there was no device entry.)
>
> I am not sure if the "attempt to re-allocate valid device" failure was
> caused by this, but it could be, if something in CAM layer was waiting for
> GEOM layer to be done with the disk.
>
> It would be nice if the swap code properly supported disappearance of the
> underlying disks. Especially in this case where the swap was actually
> never used / touched at all (few hours after reboot and completely idle
> system).
I don't know if it's related, but my new 2 TB WD green harddisk vanished three
times during the last couple of weeks, too, Some guys over there at hackers@
told me that that might be due to bad blocks on the disk, but unfortunately
(or luckily?) neither of the smart tests did find any errors :-(. So I wonder
if there's a hardware or software problem. That happened on 9.1 stable when I
was copying data from/to that harddisk (UFS).
Ciao,
Christian.
More information about the freebsd-current
mailing list