hast: can't restore after disk failure

Wed Jun 12 09:37:01 UTC 2013

On Wed, Jun 12, 2013 at 11:44:54AM +0300, Mikolaj Golub wrote:
> On Wed, Jun 12, 2013 at 12:23:52AM +0400, Dmitry Morozovsky wrote:
> > On Tue, 11 Jun 2013, Mikolaj Golub wrote:
> > 
> > > On Tue, Jun 11, 2013 at 12:40:08AM +0400, Dmitry Morozovsky wrote:
> > > > On Mon, 10 Jun 2013, Mikolaj Golub wrote:
> > > > 
> > > > [snipall]
> > > > 
> > > > > > Jun 10 16:56:20 <console.info> cthulhu3 kernel: Jun 10 16:56:20 <daemon.err> 
> > > > > > cthulhu3 hastd[765]: [d1] (secondary) Worker process exited ungracefully 
> > > > > > (pid=14380, exitcode=66).
> > > > > > 
> > > > > > Any hints? Thanks!
> > > > > 
> > > > > Have you run hastctl create to initialize metadata?
> > > > 
> > > > Yes, but did it naively:
> > > > 
> > > > hastctl create d1
> > > 
> > > No errors?
> > 
> > no visible, but hast instance ungracefully exits
> > 
> > > > and status still reported 0 as provider size...
> > > 
> > > I assume /dev/ada1p1 is present and readable/writable?
> > > 
> > > Symptoms are like if it did not exist.
> > 
> > nope, it does:
> > 
> > root at cthulhu3:/# diskinfo /dev/ada1p1
> > /dev/ada1p1     512     999654686720    1952450560      0       1048576 1936954 16      63
> > root at cthulhu3:/# diskinfo /dev/ada0p1
> > /dev/ada0p1     512     999653638144    1952448512      0       1048576 1936952 16      63
> > 
> 
> Hm, looking in the source where this error is generated:
> 
>   cthulhu3 hastd[14379]: [d1] (secondary) Unable to read metadata from /dev/ada1p1: No such file or directory.
>
> it looks like hastd successfully read metadata from disk but failed to
> parse it (did not found an entry). This usually happens when metadata
> is not initialized by `hastctl create`.
> 
> Does `hastctl dump d1' not work too?

Note up front: I have zero familiarity with hast stuff.  I'm just
looking at source code, because your comment seems to indicate that
ENOENT (errno 2; No such file or directory) is actually false/incorrect.

I did spend almost 30 minutes digging through the hastd code.  This is
hard to follow -- very specifically, the error/errno situational code.
It's a very deep rabbit hole.  Variable names are common or re-used
(legitimately due to local scope), and the actual error that gets
printed comes directly from the global errno variable.

I honestly cannot see how nv->nv_error (which is what nv_error()
returns) gets set to ENOENT within the function call stack:

- metadata_read() is what prints the error (line 152 in nv.c)
- Error printing done by pjdlog_errno(), which uses the global errno
  to print its errors
- nv = nv_ntoh(eb)
- nv_ntoh() sets nv->nv_error to 0 initially, but then calls
  nv_validate() later on which can modify nv->error
- nv_validate() explicitly sets error (which later can get assigned
  to nv->nv_error) to EINVAL in many cases, but not ENOENT.

Therefore, I am honestly not sure how ENOENT gets returned to the user
in this case.  It looks like it's a misleading errno and is probably
meant to be something else.  If it's correct, I would absolutely love
for someone to show me how/where.

The code is here:

http://svnweb.freebsd.org/base/stable/9/sbin/hastd/

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Making life hard for others since 1977.             PGP 4BD6C0CB |