From: Clark Dorman <clark@s3i.com>
Newsgroups: comp.parallel.pvm
Subject: Re: Fault tolerance
Date: 30 Dec 1998 08:16:49 -0500
Organization: PSINet
Message-Id: <d90fpkbwe.fsf@s3i.com>
References: <76b019$1d7$1@news.flashnet.it>
Cc: f.piantone@flashnet.it


"Faber" <f.piantone@flashnet.it> writes:
> I'd like to know more infos about:
> Fault tolerance in a distributed system under PVM and Linux386 .
> Reference and other.

I cannot give you a reference because I do not have one.  In my
experience though, the PVM system can be fault tolerant for every
system except the one that has the master daemon.  If any other
machine or process in the parallel machine dies, it is possible to
have other processes eventually notice this and error recover and
re-start the process.  My distributed system does this, periodically
listening for messages that basically say "I'm still alive".  If the
main program does not get such a message, it will go and look for the
process.  If the process does not exist, or the machine's daemon is
gone, it will start over.  Note that you have to do this yourself.
PVM is a low-level system and while it gives you great flexibility, it
also requires you to do things, like check for faults, yourself. 

For the main daemon, however, the system is not fault tolerant at all.
If the main daemon dies, or if the system that has the main daemon
crashes, the rest of the daemons and processes are lost.  I do not
know how to solve this problem, because it seems to be inherent in the
design of the system.  I hope that Harness will have a system
organization that will correct for this.

-- 
Clark

