OS support for fault tolerance

Julian Elischer julian at freebsd.org
Tue Feb 21 08:23:02 UTC 2012

On 2/20/12 6:32 AM, Da Rock wrote:
> On 02/15/12 03:25, Brandon Falk wrote:
>> On 2/14/2012 12:05 PM, Jason Hellenthal wrote:
>>> On Tue, Feb 14, 2012 at 08:57:10AM -0800, Julian Elischer wrote:
>>>> On 2/14/12 6:23 AM, Maninya M wrote:
>>>>> For multicore desktop computers, suppose one of the cores fails, 
>>>>> the
>>>>> FreeBSD OS crashes. My question is about how I can make the OS 
>>>>> tolerate
>>>>> this hardware fault.
>>>>> The strategy is to checkpoint the state of each core at specific 
>>>>> intervals
>>>>> of time in main memory. Once a core fails, its previous state is 
>>>>> retrieved
>>>>> from the main memory, and the processes that were running on it are
>>>>> rescheduled on the remaining cores.
>>>>> I read that the OS tolerates faults in large servers. I need to 
>>>>> make it do
>>>>> this for a Desktop OS. I assume I would have to change the 
>>>>> scheduler
>>>>> program. I am using FreeBSD 9.0 on an Intel core i5 quad core 
>>>>> machine.
>>>>> How do I go about doing this? What exactly do I need to save for 
>>>>> the
>>>>> "state" of the core? What else do I need to know?
>>>>> I have absolutely no experience with kernel programming or with 
>>>>> FreeBSD.
>>>>> Any pointers to good sources about modifying the source-code of 
>>>>> FreeBSD
>>>>> would be greatly appreciated.
>>>> This question has always intrigued me, because I'm always amazed
>>>> that people actually try.
>>>>   From my viewpoint, There's really not much you can do if the core
>>>> that is currently holding the scheduler lock fails.
>>>> And what do you mean by 'fails"?  do you run constant diagnostics?
>>>> how do you tell when it is failed? It'd be hard to detect that 
>>>> 'multiply'
>>>> has suddenly started giving bad results now and then.
>>>> if it just "stops" then you might be able to have a watchdog that
>>>> notices,  but what do you do when it was half way through 
>>>> rearranging
>>>> a list of items? First, you have to find out that it held
>>>> the lock for the module and then you have to find out what it had
>>>> done and clean up the mess.
>>>> This requires rewriting many many parts of the kernel to remove
>>>> 'transient inconsistent states". and even then, what do you do if it
>>>> was half way through manipulating some hardware..
>>>> and when you've figured that all out, how do you cope with the
>>>> mess it made because it was dying?
>>>> Say for example it had started calculating bad memory offsets
>>>> before writing out some stuff and written data out over random 
>>>> memory?
>>>> but I'm interested in any answers people may have
>>> How about core redundancy ? effectively this would reduce the 
>>> amount of
>>> available cores in half in you spread a process to run on two 
>>> cores at
>>> the same time but with an option to adjust this per process etc... I
>>> don't see it as unfeasable.
>> The overhead for all of the error checking and redundancy makes 
>> this idea pretty
>> impractical. You'd have to have 2 cores to do the exact same thing, 
>> then some
>> 'master' core that makes sure they're doing the right stuff, and if 
>> you really
>> want to think about it... what if the core monitoring the cores 
>> fails... there's
>> a threshold of when redundancy gets pointless.
> Make no mistake here, I'm not really up with the guts of what this 
> would require (the dog may not hunt at all). Consider me as the 
> little boy throwing rocks at a hornets nest :)
> That out of the way, how about this scenario: why can't the master 
> be dynamic amongst the cores? 1 core be the master of any 2 cores 
> (not itself).
> Another thought (probably more scifi then anything else) is about 
> using the cores as individuals which work as a team and fire a weak 
> team member that is failing.
> I have absolutely no idea how to accomplish this, but I thought it 
> might fire a few neurons in someone who does... :)

There are so many reasons this would be ineffective on standard hardware
I have no idea where to begin, but see my email above..

>> Perhaps I'm missing out on something, but you can't check the 
>> checker (without
>> infinite redundancy).
>> Honestly, if you're worried about a core failing, please take your 
>> server
>> cluster out of the 1000 deg C forge.
>> -Brandon
> _______________________________________________
> freebsd-hackers at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to 
> "freebsd-hackers-unsubscribe at freebsd.org"

More information about the freebsd-hackers mailing list