RFC: enhanced watchdog.

Ian Lepore ian at FreeBSD.org
Mon Jan 21 04:38:11 UTC 2013


On Fri, 2013-01-18 at 22:29 -0800, Alfred Perlstein wrote:
> We at iX are trying to enhance the watchdog and we think some of the 
> changes may benefit the community as a whole.
> 
> Basically we want to make it easy for developers to prototype watchdog 
> scripts in a "test-only" mode that basically logs if the watchdog had 
> failed.
> 
> I have most of the code done, but could really use help on two things:
> 
> 1) review
> 2) suggestion for inserting the warning messages from the userland 
> watchdogd into the kernel message buffer.
> 3) suggestion for logging/warning of pending death.
> 
> In detail:
> 1) The reason for review should be obvious, we want to make sure that 
> this works for everyone.
> 2) The reason for inserting messages into the kernel log is because that 
> is the easiest place for us to recover the diagnostics when we do have a 
> crash due to watchdog.  Maybe there is a smarter thing to do?

I've recently wished for a way that a sufficiently-credentialed userland
process could, in effect, kernel-printf.  I've been burned a number of
times by init(8) failing to start up for various reasons such as
no /dev, and it has no way to say what's wrong.  It's surprisingly hard
to figure out what the problem is.

For your need, a possiblity I guess would be to have the watchdog device
do it for you, since you're already talking to it.  Who knows, maybe
some special watchdog hardware would be able to do something useful with
a short message.  I've worked with hardware that has a few registers
designed to survive a reboot, for communicating with your reincarnated
self; nothing big enough for arbitrary strings yet, but hardware just
keeps getting cooler all the time.

> 3) What is a good way to warn of impeding death?  I was thinking of just 
> another thread in the process that would be signalled before the 
> watchdog script was run and would log when the timer is about to expire 
> or based on a configurable threshold.
> 

SIGALRM that fires shortly before death?

> 
> Finally, there is some thought about adding a kernel daemon to the 
> watchdog facility that would allow us to strobe watchdogs with low max 
> values while our userland watchdog was polling the system.
> 
> Why??? Well because the ICH driver has a max timeout of ~2 minutes.  We 
> really want to be able to leverage this watchdog, but also go higher 
> than this.  The way to do this is to drive the system almost like a step 
> up electrical relay.
> 

I very much like this.  A new ARM SoC I'm about to start working with
has a max 16 second watchdog, and I'm afraid things like firmware
updaters might lock out userland for longer than that on such a wimpy
chip.

> [... code ...]

I skimmed through the code, but it's been a long day of reading code for
me, so I'm not gonna pretend it was a thorough review.  The main thing
that popped out at me was 'carp'.  Shouldn't a watchdog bark? :)

I'm also curious why you chose CLOCK_UPTIME_FAST, which I'm not familiar
with (gonna be reading a manpage in a minute).  Not knowing about some
of the newer choices, I probably would've used CLOCK_MONOTONIC.

-- Ian




More information about the freebsd-arch mailing list