OS support for fault tolerance

Julian Elischer julian at freebsd.org
Wed Feb 15 05:41:11 UTC 2012


On 2/14/12 3:51 PM, Jan Mikkelsen wrote:
>
> Coming back to the multicore issue:
>
> The problem when a core fails is that it has affected more than its own state. It will be holding locks on shared resources and may have corrupted shared memory or asked a device to do the wrong thing. By the time you detect a fault in a core, it is too late. Checkpointing to main memory means that you need to be able to roll back to a checkpoint, and replay operations you know about. That involves more that CPU core state, that includes process file and device state.
>
I think that/s more or less what I was saying but with more concrete 
examples.
and yes I rememebr the tandem boxes from computer shows in Perth and 
Sydney, but never saw one in the field.



More information about the freebsd-hackers mailing list