panic on one cpu leaves others running...

Thu Apr 8 15:22:10 PDT 2004

On Thu, Apr 08, 2004 at 04:27:43PM +0200, Bernd Walter wrote:
>On Thu, Apr 08, 2004 at 09:44:41PM +1000, Peter Jeremy wrote:
>> >  A panic usually means that
>> >something unrecoverable happened, and that continuing on is not safe.
>> 
>> I realise that.  Hence actually being able to continue after a panic
>> would be extremely difficult to do safely.  (Probably not possible in
>> general, though it might be in some special cases).
>
>If it's save to continue then there's no need to panic at all.
>Just stoping the faulting parts would be enough in that case.

Except FreeBSD (and most Unices) don't do this in general.

I was thinking of hardware failures - if a CPU fails and it wasn't
holding any locks then it would seem feasible to just abort the
thread/process that was using the CPU and limp along on the remaining
CPU(s).

Likewise an unrecoverable memory error in a clean page should (in most
cases) be able to be recovered by marking that page unusable and
loading another copy of the data into another page.  (Obviously this
is problematic if the page in question is part of the kernel VM
subsystem or the device driver for the relevant backing store).  Even
a dirty page may be recoverable by aborting the affected process or
treating it similarly to an I/O error on a filesystem.

The marketing spin from at least one vendor suggests that their
high-end systems can manage this sort of fault recovery.  I'm not sure
whether this is an area that FreeBSD should aspire to - I suspect that
the effort needed to implement and test this would not be justified by
the small size of the additional potential market.

Peter