Hast locking up under 9.2

Pete French petefrench at ingresso.co.uk
Fri Nov 22 11:18:41 UTC 2013


> I remember already asking you about replication mode you was using and
> don't remember you answered. One of the significant changes is memsync
> mode, which is default in 9.2 (it was fullsync in eralier versions).
> So if you are using default settings you can try switching to fullsync
> as a workaround.

Yes, I am using the default settings, so that is something I
can try. After three days of downtime last week I will not try it
in the immedaiet future though, for fear of my colleaguyes wanting
to strange me :-) Will enable on the test system however, and try on live
in a couple of weeks if I can.

> signal=6 means that hastd crashed due to some assertion failed.
> Usually "Assertion failed ..." message precedes this line in the
> logs. Don't you see such a message? It might be very helpful.

Yes, I do actually!

"Assertion failed: (!hio->hio_done), function write_complete, file /usr/src/sbin/hastd/primary.c, line 1130."

> Do you always see this error when it gets stuck?

That I do not know I am afraid - I was too busy getting the systems back online
to have time to try and recocnile the tdowntimes with what is in the logfiles.
It was only yesterday that I started trying to tarce what might have
happened

> Unfortunately the crash did not generated core (due to capsicum). When
> I want to get a coredump I rebuild hastd with CFLAGS+=-DHAVE_CAPSICUM
> removed in Makefile (and with debugging symbols). There might be an
> easier method but I don't know.
>
> If you don't find the assertion message and the crashes are
> reproducible, it would be helpful to rebuild hastd with symbols and
> capsicum disabled to make it coredump and provide the backtrace.
>
> Also, when you have hastd got stuck you can generate a core of the
> live process with gcore(1).

I didnt know about gcore - thats a very useful feature! The crash
is reproducible, but not on any machine that I could actually
crash without causing extensive downtime to the rest of the business
unfortunately. I can't deliberately crash our master database and
it doesnt crash ont he test setup we have. But what I can do is to run it up
live again with your suggested change to the config, and if it gets stuck
try and generate some more useful debugging then.

> What revision are you using? Recently there was a fix for crashes
> triggered by this failed assertion:
>
>  Assertion failed: (amp->am_memtab[ext] > 0), function
>  activemap_write_complete, file activemap.c, line 351.

I'm using r257795 - I did an upgrade to get the fix for the above assertion,
and in general I keep an eve onm the commits and anything involving hast
or zfs I take as soon as I can to try and improve stability.

Thanks for the help - if I get any more info I will let
you know, of if the above assertyion helps you track something down
then I may be able to try some patches.

cheers,

-pete.


More information about the freebsd-stable mailing list