OS support for fault tolerance
Julian Elischer
julian at freebsd.org
Wed Feb 15 05:41:11 UTC 2012
On 2/14/12 3:51 PM, Jan Mikkelsen wrote:
>
> Coming back to the multicore issue:
>
> The problem when a core fails is that it has affected more than its own state. It will be holding locks on shared resources and may have corrupted shared memory or asked a device to do the wrong thing. By the time you detect a fault in a core, it is too late. Checkpointing to main memory means that you need to be able to roll back to a checkpoint, and replay operations you know about. That involves more that CPU core state, that includes process file and device state.
>
I think that/s more or less what I was saying but with more concrete
examples.
and yes I rememebr the tandem boxes from computer shows in Perth and
Sydney, but never saw one in the field.
More information about the freebsd-hackers
mailing list