Newsgroups: comp.parallel.pvm,comp.lang.perl.modules
From: gowen@forte.cs.tufts.edu (Gregory Owen)
Subject: Parallel::Pvm, pvm_notify, and fault tolerant programming
Organization: Tufts University Department of Computer Science
Date: 24 Mar 1997 16:22:16 GMT
Message-ID: <GOWEN.97Mar24112216@forte.cs.tufts.edu>


        I am writing an application using the Parallel::Pvm module for
Perl from http://www.nsrc.nus.sg/STAFF/edward/Pvm.html, and I'm having
difficulty adding fault tolerance.  In particular, I'm having problems
unpacking messages automatically generated when a host is rebooted, a
task on that host having been given to Parallel::Pvm::notify
(pvm_notify) with the PvmTaskExit flag.

        I am using Parallel::Pvm version 1.1 under Perl 5.003 under
SunOS 4.1.4 (Solaris 1.1.2).  All machines used in my VM are Sparcs
running SuOS 4.

        What I am trying to do is to receive notification when a host
fails (reboot, network error, power outage, etc) so that I return all
the jobs allocated to tasks on that host to the "yet to be done"
queue.  For each host upon which I will have a task, I have called
Parallel::Pvm::notify with PvmTaskExit, a message identifier, and the
tid of the task for that host.  When I reboot a machine, a message is
indeed generated; I receive it with Parallel::Pvm::recv and then try
to unpack the message, and get an error (code and log below).

        There is a Parallel::Pvm::recv_notify function in Perl PVM
1.1, but there is little documentation for it, and no mention of it
anywhere in the PVM 3.3 distribution (source or docs).  One of my
questions is, could it be that only recv_notify can be used to receive
a notification message, even though the Parallel::Pvm docs don't
suggest that anywhere?

        My program at this point basically does a loop where
Parallel::Pvm::recv is the loop test -- in otherwords, it loops
forever, each iteration being a received message.  As a result, it is
not easy to selectively call recv_notify versus recv.  I could call
Parallel::Pvm::probe to see if the next message is a notify message,
and call the appropriate recv, but my understanding is that probe is
non-blocking, so I'd have to do a busy wait until a message came in
(and I'd rather save resources by doing a blocking wait).

        Any help, either answers to these questions or directions to a
better forum for my question, would be greatly appreciated.

        Here is a relevant code fragment and a log of the problem:

The start of my programs loop:

while (($ginfo = Parallel::Pvm::recv) >= 0) {
    my ($bufid) = Parallel::Pvm::getrbuf;       # debug...
    my ($info,$bytes,$tag,$tid) = Parallel::Pvm::bufinfo($bufid); # debug...
    printf STDERR "Got mesg with tag $tag, $bytes bytes, tid t%x\n\n",
	$tid;
    if ($tag == 999) {   # If the message has that tag we passed pvm_notify...
        print STDERR "got a tag 999\n";
        Parallel::Pvm::unpack;  # This causes an error, see log below
        print STDERR "unpacked\n";
        next;  
    }
    my ($tid, $command) = Parallel::Pvm::unpack;  # And for non-notify messages
    printf STDERR    "received command $command from tid t%x\n", $tid;
    # ...go off and do whatever the message requested


A log of stderr when I reboot a host that's in the VM:

...
Got mesg with tag 999, 4 bytes, tid t80000000

got a tag 999
libpvm [t40002]: pvm_upkstr(): End of buffer
unpacked


And the matching pvml.uid log:

[t80040000] beowulf (13.246.76.230:2411) SUN4 3.3.11
[t80040000] ready Mon Mar 24 10:33:49 1997
[t80040000] netinput() FIN|ACK from adrenaline
[t80040000]  hd_dump() ref 1 t100000 n "adrenaline" a "" ar "SUN4"
[t80040000]            lo "" so "" dx "" ep "" bx "" wd "" sp 1000
[t80040000]            sa 13.246.76.61:1081 mtu 4096 f 0x0 e 0 txq 0
[t80040000]            tx 6 rx 3 rtt 0.252347


Greg Owen { gowen@cs.tufts.edu,@xis.xerox.com } http://www.cs.tufts.edu/~gowen/
 1.01 GCS/GO d++ p+ c++ l++ u++ e+ -m+ s++/- n- h !(f)? g+ -w+ t+ r-- y?
 "I want to permeate the air you breathe/slide my way under your skin/place 
myself behind your eyes/and watch you, watch me, looking in." Katell Keineg