OS support for fault tolerance
julian at freebsd.org
Tue Feb 21 08:23:02 UTC 2012
On 2/20/12 6:32 AM, Da Rock wrote:
> On 02/15/12 03:25, Brandon Falk wrote:
>> On 2/14/2012 12:05 PM, Jason Hellenthal wrote:
>>> On Tue, Feb 14, 2012 at 08:57:10AM -0800, Julian Elischer wrote:
>>>> On 2/14/12 6:23 AM, Maninya M wrote:
>>>>> For multicore desktop computers, suppose one of the cores fails,
>>>>> FreeBSD OS crashes. My question is about how I can make the OS
>>>>> this hardware fault.
>>>>> The strategy is to checkpoint the state of each core at specific
>>>>> of time in main memory. Once a core fails, its previous state is
>>>>> from the main memory, and the processes that were running on it are
>>>>> rescheduled on the remaining cores.
>>>>> I read that the OS tolerates faults in large servers. I need to
>>>>> make it do
>>>>> this for a Desktop OS. I assume I would have to change the
>>>>> program. I am using FreeBSD 9.0 on an Intel core i5 quad core
>>>>> How do I go about doing this? What exactly do I need to save for
>>>>> "state" of the core? What else do I need to know?
>>>>> I have absolutely no experience with kernel programming or with
>>>>> Any pointers to good sources about modifying the source-code of
>>>>> would be greatly appreciated.
>>>> This question has always intrigued me, because I'm always amazed
>>>> that people actually try.
>>>> From my viewpoint, There's really not much you can do if the core
>>>> that is currently holding the scheduler lock fails.
>>>> And what do you mean by 'fails"? do you run constant diagnostics?
>>>> how do you tell when it is failed? It'd be hard to detect that
>>>> has suddenly started giving bad results now and then.
>>>> if it just "stops" then you might be able to have a watchdog that
>>>> notices, but what do you do when it was half way through
>>>> a list of items? First, you have to find out that it held
>>>> the lock for the module and then you have to find out what it had
>>>> done and clean up the mess.
>>>> This requires rewriting many many parts of the kernel to remove
>>>> 'transient inconsistent states". and even then, what do you do if it
>>>> was half way through manipulating some hardware..
>>>> and when you've figured that all out, how do you cope with the
>>>> mess it made because it was dying?
>>>> Say for example it had started calculating bad memory offsets
>>>> before writing out some stuff and written data out over random
>>>> but I'm interested in any answers people may have
>>> How about core redundancy ? effectively this would reduce the
>>> amount of
>>> available cores in half in you spread a process to run on two
>>> cores at
>>> the same time but with an option to adjust this per process etc... I
>>> don't see it as unfeasable.
>> The overhead for all of the error checking and redundancy makes
>> this idea pretty
>> impractical. You'd have to have 2 cores to do the exact same thing,
>> then some
>> 'master' core that makes sure they're doing the right stuff, and if
>> you really
>> want to think about it... what if the core monitoring the cores
>> fails... there's
>> a threshold of when redundancy gets pointless.
> Make no mistake here, I'm not really up with the guts of what this
> would require (the dog may not hunt at all). Consider me as the
> little boy throwing rocks at a hornets nest :)
> That out of the way, how about this scenario: why can't the master
> be dynamic amongst the cores? 1 core be the master of any 2 cores
> (not itself).
> Another thought (probably more scifi then anything else) is about
> using the cores as individuals which work as a team and fire a weak
> team member that is failing.
> I have absolutely no idea how to accomplish this, but I thought it
> might fire a few neurons in someone who does... :)
There are so many reasons this would be ineffective on standard hardware
I have no idea where to begin, but see my email above..
>> Perhaps I'm missing out on something, but you can't check the
>> checker (without
>> infinite redundancy).
>> Honestly, if you're worried about a core failing, please take your
>> cluster out of the 1000 deg C forge.
> freebsd-hackers at freebsd.org mailing list
> To unsubscribe, send any mail to
> "freebsd-hackers-unsubscribe at freebsd.org"
More information about the freebsd-hackers