HAST - detect failure and restore avoiding an outage?

Sun Feb 24 10:05:17 UTC 2013

On Sat, Feb 23, 2013 at 09:51:03PM +0100, Pawel Jakub Dawidek wrote:

> I'm fine with the patchi except for missing breaks in switch added to
> hastd/primary.c.

Oops. Fixed. Thanks!

> I'm also wondering... You count all those errors separately just to
> print them as one number. If we do that already let's print them
> separately, eg.
> 
> 	local i/o errors: read(0), write(3), delete(5), flush(9)

The idea was that hastd provided all available counters, and hastctl
showed only aggregated counter just to save a screen space, but if one
wanted to write its own utility to monitor hastd, which would talk
directly to hastd via socket, she would be able to see all counters
separately.

But your idea with writing errors in one string looks better, as it
allows to save a screen space and provide more detailed info. I would
prefer a little different output though:

  role: secondary
  provname: test
  localpath: /dev/md102
  extentsize: 2097152 (2.0MB)
  keepdirty: 0
  remoteaddr: kopusha:7771
  replication: memsync
  status: complete
  dirty: 0 (0B)
  statistics:
    reads: 13
    writes: 521
    deletes: 0
    flushes: 0
    activemap updates: 0
    local i/o errors:
      read: 13, write: 425, delete: 0, flush: 0

but don't have a strong opinion and will be ok with yours if you don't
like my version.

> 
> BTW. Why not to count activemap update errors as write and flush errors?

I need (internally) separate counters for activemap errors because
they are updated by the different thread and I wouldn't want to
introduce locking for error counter update operations. As hastctl was
supposed to show an aggregated counter I didn't bother much how to
make activemap update errors to count as write and flush errors. I
improved this too in the updated patch:

http://people.freebsd.org/~trociny/hast.stat_error.2.patch

-- 
Mikolaj Golub