Troubleshooting a gmirror disk marked broken

Thu Jun 27 03:09:34 UTC 2013

On Wed, Jun 26, 2013 at 9:38 PM, Nikola Pavlović <nzp at riseup.net> wrote:

> Hi,
>
> Last night during a massive (~1 year worth :| )
> portsnap fetch
>
> the server went unresponsive and ssh eventually disconnected.  I decided
> to leave it during the night, and, sure enough, the situation was the
> same in the morning, so I had to do a hard reset.  It came back up, but
> one of the two gmirror components was marked as broken and deactivated.
>
> The hang happened during the 'fetching new files or ports' (~24000 of
> them, there are currently ~10000 snapshots in /var/db/portsnap) phase
> of postsnap fetch.
>
> /var/log/messages was completely silent during the period between the
> hang and the reset.
>
> Googling around I found a mention that it's possible to sometimes get a
> 'blip'[*] during busy periods, so I decided to just bite the bullet and
> reinsert the component with
> # gmirror forget gm0
> # gmirror clean ad4
> # gmirror insert gm0 ad4
>
> Currently it's syncing and things *seem* OK.  My question is how much
> should I be worried and what could be the cause of this?  Is it possible
> that  ports snapshot fetching caused this, or that perhaps it was the other
> way around (a failing disk causing the machine to choke during the huge
> portsnap fetch)?  How to proceed? :)
>

The messages log definitely shows problems with your io.  The smart log of
the disks are also at least mildly concerning and indicates the drives are
in a preliminary stage of death.  Some HD deaths take years to complete.
Expect random glitches and intermittent reduced performance as a continuous
degradation.   You might be able to alleviate some of this by switching to
the AHCI driver and bumping up timeouts but at the end of the day 2 flaky
disks in a mirror don't inspire confidence.

-- 
Adam Vande More