can the hardware watchdog reboot a hung kernel?

Daniel Braniss danny at cs.huji.ac.il
Sat Nov 16 09:09:56 UTC 2019



> On 15 Nov 2019, at 19:11, Ian Lepore <ian at freebsd.org> wrote:
> 
> On Fri, 2019-11-15 at 18:58 +0200, Daniel Braniss wrote:
>>> On 14 Nov 2019, at 20:19, Ian Lepore <ian at freebsd.org> wrote:
>>> 
>>> 
> [...]
>>> 
>>> One thing to be careful of here is multicore systems.  If you have
>>> a
>>> critical app running on a multicore system, that app can hang
>>> (maybe it
>>> tries to read from a device that has malfunctioned and essentially
>>> gets
>>> hung forever in a device driver that doesn't implement timeouts
>>> very
>>> well or something).  In that case, only one core is hung, so
>>> watchdogd
>>> will be able to keep petting the dog to prevent a reboot, but since
>>> your app is hung on a different core, you aren't really getting the
>>> protection you need.
>>> 
>>> The fix for that is to either turn you app into watchdogd (have it
>>> make
>>> the periodic ioctl() calls to pet the dog), or use the '-e cmd'
>>> option
>>> with watchdogd, and make 'cmd' be a script that somehow verifies
>>> that
>>> your critical application is still running properly.
>>> 
>>> —Ian
>> 
>> in my case the kernel is hung, probably by my app - which is using 2
>> i2c devices, , BTW, this does not happen very often, 
>> maybe once a month, but is annoying.
>> 
>> now the watchdog stuff:
>> 1- the all winner/nanopi neo can only handle up to 8 sec timeout (the
>> next  is 16sec (2^34))
>>    the watchdogd complainsif >8sec:
>> 	aw_wdog0: Can't arm, timeout is more than 16 sec
>>   and continues trying - IMHO it should exit.
>> 
> 
> This basically comes down to "know your hardware and don't ask for
> things it can't do".  There is a lot of variance in watchdog hardware,
> and unfortunately our watchdog software interface is kinda braindead. 
> It uses a power-of-2 timeout which is great if you need a large variety
> of subsecond timeouts ranging from a few nanoseconds to a half second. 
> But it's absolutely horrible for what the real world usually wants: 
> some medium-sized integer number of seconds.  Your choices are pretty
> much just 8, 16, 32, 64, 128.  Lots of hardware maxes at 16 or 32
> seconds.
> 
> If aw maxes at 16 it's probably best to set it for that, with petting
> at either 4 or 8 second intervals.
> 
>> 2- this is a bit more annoying:
>> 	entering the debugger will trigger the timeout and it will the
>> perform a clean reboot (*)
> 
> In the debugger, enter "watchdog" without any parameter to disable the
> watchdog.  (Or give a parameter to change the timeout.)
> 
> Some watchdog hardware cannot be disabled once you've enabled it.
> 
>> 	doing a shutdown -r leaves the watchdog in some weird state so
>> the reboot hangs when starting the watchdog
>> 	  win some, loose some :-)
>> 
> 
> This is likely another flavor of "some watchdog hardware cannot be
> disabled".  But it might just be a bug in the aw watchdog driver too.
> 
> —Ian
> 
> 

i have a workaround,
 start the watchdogd by hand (not via rc.conf) then shutdown does not stop the watchdog, and all is ok
I guess there must be some bug in the reset logic in aw_dog.c

danny




More information about the freebsd-hackers mailing list