hast: can't restore after disk failure
Jeremy Chadwick
jdc at koitsu.org
Wed Jun 12 09:37:01 UTC 2013
On Wed, Jun 12, 2013 at 11:44:54AM +0300, Mikolaj Golub wrote:
> On Wed, Jun 12, 2013 at 12:23:52AM +0400, Dmitry Morozovsky wrote:
> > On Tue, 11 Jun 2013, Mikolaj Golub wrote:
> >
> > > On Tue, Jun 11, 2013 at 12:40:08AM +0400, Dmitry Morozovsky wrote:
> > > > On Mon, 10 Jun 2013, Mikolaj Golub wrote:
> > > >
> > > > [snipall]
> > > >
> > > > > > Jun 10 16:56:20 <console.info> cthulhu3 kernel: Jun 10 16:56:20 <daemon.err>
> > > > > > cthulhu3 hastd[765]: [d1] (secondary) Worker process exited ungracefully
> > > > > > (pid=14380, exitcode=66).
> > > > > >
> > > > > > Any hints? Thanks!
> > > > >
> > > > > Have you run hastctl create to initialize metadata?
> > > >
> > > > Yes, but did it naively:
> > > >
> > > > hastctl create d1
> > >
> > > No errors?
> >
> > no visible, but hast instance ungracefully exits
> >
> > > > and status still reported 0 as provider size...
> > >
> > > I assume /dev/ada1p1 is present and readable/writable?
> > >
> > > Symptoms are like if it did not exist.
> >
> > nope, it does:
> >
> > root at cthulhu3:/# diskinfo /dev/ada1p1
> > /dev/ada1p1 512 999654686720 1952450560 0 1048576 1936954 16 63
> > root at cthulhu3:/# diskinfo /dev/ada0p1
> > /dev/ada0p1 512 999653638144 1952448512 0 1048576 1936952 16 63
> >
>
> Hm, looking in the source where this error is generated:
>
> cthulhu3 hastd[14379]: [d1] (secondary) Unable to read metadata from /dev/ada1p1: No such file or directory.
>
> it looks like hastd successfully read metadata from disk but failed to
> parse it (did not found an entry). This usually happens when metadata
> is not initialized by `hastctl create`.
>
> Does `hastctl dump d1' not work too?
Note up front: I have zero familiarity with hast stuff. I'm just
looking at source code, because your comment seems to indicate that
ENOENT (errno 2; No such file or directory) is actually false/incorrect.
I did spend almost 30 minutes digging through the hastd code. This is
hard to follow -- very specifically, the error/errno situational code.
It's a very deep rabbit hole. Variable names are common or re-used
(legitimately due to local scope), and the actual error that gets
printed comes directly from the global errno variable.
I honestly cannot see how nv->nv_error (which is what nv_error()
returns) gets set to ENOENT within the function call stack:
- metadata_read() is what prints the error (line 152 in nv.c)
- Error printing done by pjdlog_errno(), which uses the global errno
to print its errors
- nv = nv_ntoh(eb)
- nv_ntoh() sets nv->nv_error to 0 initially, but then calls
nv_validate() later on which can modify nv->error
- nv_validate() explicitly sets error (which later can get assigned
to nv->nv_error) to EINVAL in many cases, but not ENOENT.
Therefore, I am honestly not sure how ENOENT gets returned to the user
in this case. It looks like it's a misleading errno and is probably
meant to be something else. If it's correct, I would absolutely love
for someone to show me how/where.
The code is here:
http://svnweb.freebsd.org/base/stable/9/sbin/hastd/
--
| Jeremy Chadwick jdc at koitsu.org |
| UNIX Systems Administrator http://jdc.koitsu.org/ |
| Making life hard for others since 1977. PGP 4BD6C0CB |
More information about the freebsd-fs
mailing list