hast: can't restore after disk failure

Wed Jun 12 10:41:51 UTC 2013

On Wed, Jun 12, 2013 at 01:03:33PM +0300, Mikolaj Golub wrote:
> On Wed, Jun 12, 2013 at 02:36:39AM -0700, Jeremy Chadwick wrote:
> 
> > I honestly cannot see how nv->nv_error (which is what nv_error()
> > returns) gets set to ENOENT within the function call stack:
> > 
> > - metadata_read() is what prints the error (line 152 in nv.c)
> > - Error printing done by pjdlog_errno(), which uses the global errno
> >   to print its errors
> > - nv = nv_ntoh(eb)
> > - nv_ntoh() sets nv->nv_error to 0 initially, but then calls
> >   nv_validate() later on which can modify nv->error
> > - nv_validate() explicitly sets error (which later can get assigned
> >   to nv->nv_error) to EINVAL in many cases, but not ENOENT.
> > 
> > Therefore, I am honestly not sure how ENOENT gets returned to the user
> > in this case.  It looks like it's a misleading errno and is probably
> > meant to be something else.  If it's correct, I would absolutely love
> > for someone to show me how/where.
> 
> nv_find() (which is used by nv_get_* functions) sets ENOENT when it
> fails.

How wonderful -- when I reviewed the code, I thought "Oh surely those
can't be responsible...".  I did see nv_find(), but I did not think
nv_get_*() would call that.  My fault/failure.

> "No such file or directory" really looks confusing in this case. I am
> not sure what a code from errno.h would be better here though. ENOATTR?

Sorry to make this longer than it needs to be, but I'm brain dumping:

What exactly is the error condition that is happening in the above case?
All I read was that the partition size differed between nodes and that
this caused the issue?

IMO, that condition should be checked and handled elegantly, and that
the error message should not use an errno at all but instead just tell
the user about the device size mismatch between nodes (for that specific
device) -- the device sizes must match between both nodes, correct?

There must be some kind of communication protocol between the nodes that
can indicate something along those lines.

If an errno is really needed, ENOATTR isn't relevant (that's referring
to extended filesystem attributes).  See intro(2) for the official
explanation of all of them.

I would choose EIO, ENXIO, ENOSPC, EOPNOTSUPP, or EPROTO.

I have not looked at what OpenBSD and NetBSD have for errno.h.  That
might be good to do first.

Else, Linux has some errno.h entries in it which look like they might be
more relevant, such as EBADFD, EREMOTEIO, or EMEDIUMTYPE (this one might
be a bit misleading).

http://www.virtsync.com/c-error-codes-include-errno

Some of these are even part of our recent BSM audit(2) stuff; check out
include/bsm/audit_errno.h (some are Solaris specific but look like they
might help, and I see some duplicates between those and what Linux has
too).

Important: I do not know the implications of adding/enhancing errno.
POSIX is involved, thus it would be wise to ask Bruce Evans.

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Making life hard for others since 1977.             PGP 4BD6C0CB |