Newsgroups: comp.parallel.pvm,comp.lang.perl.modules From: gowen@forte.cs.tufts.edu (Gregory Owen) Subject: Parallel::Pvm, pvm_notify, and fault tolerant programming Organization: Tufts University Department of Computer Science Date: 24 Mar 1997 16:22:16 GMT Message-ID: I am writing an application using the Parallel::Pvm module for Perl from http://www.nsrc.nus.sg/STAFF/edward/Pvm.html, and I'm having difficulty adding fault tolerance. In particular, I'm having problems unpacking messages automatically generated when a host is rebooted, a task on that host having been given to Parallel::Pvm::notify (pvm_notify) with the PvmTaskExit flag. I am using Parallel::Pvm version 1.1 under Perl 5.003 under SunOS 4.1.4 (Solaris 1.1.2). All machines used in my VM are Sparcs running SuOS 4. What I am trying to do is to receive notification when a host fails (reboot, network error, power outage, etc) so that I return all the jobs allocated to tasks on that host to the "yet to be done" queue. For each host upon which I will have a task, I have called Parallel::Pvm::notify with PvmTaskExit, a message identifier, and the tid of the task for that host. When I reboot a machine, a message is indeed generated; I receive it with Parallel::Pvm::recv and then try to unpack the message, and get an error (code and log below). There is a Parallel::Pvm::recv_notify function in Perl PVM 1.1, but there is little documentation for it, and no mention of it anywhere in the PVM 3.3 distribution (source or docs). One of my questions is, could it be that only recv_notify can be used to receive a notification message, even though the Parallel::Pvm docs don't suggest that anywhere? My program at this point basically does a loop where Parallel::Pvm::recv is the loop test -- in otherwords, it loops forever, each iteration being a received message. As a result, it is not easy to selectively call recv_notify versus recv. I could call Parallel::Pvm::probe to see if the next message is a notify message, and call the appropriate recv, but my understanding is that probe is non-blocking, so I'd have to do a busy wait until a message came in (and I'd rather save resources by doing a blocking wait). Any help, either answers to these questions or directions to a better forum for my question, would be greatly appreciated. Here is a relevant code fragment and a log of the problem: The start of my programs loop: while (($ginfo = Parallel::Pvm::recv) >= 0) { my ($bufid) = Parallel::Pvm::getrbuf; # debug... my ($info,$bytes,$tag,$tid) = Parallel::Pvm::bufinfo($bufid); # debug... printf STDERR "Got mesg with tag $tag, $bytes bytes, tid t%x\n\n", $tid; if ($tag == 999) { # If the message has that tag we passed pvm_notify... print STDERR "got a tag 999\n"; Parallel::Pvm::unpack; # This causes an error, see log below print STDERR "unpacked\n"; next; } my ($tid, $command) = Parallel::Pvm::unpack; # And for non-notify messages printf STDERR "received command $command from tid t%x\n", $tid; # ...go off and do whatever the message requested A log of stderr when I reboot a host that's in the VM: ... Got mesg with tag 999, 4 bytes, tid t80000000 got a tag 999 libpvm [t40002]: pvm_upkstr(): End of buffer unpacked And the matching pvml.uid log: [t80040000] beowulf (13.246.76.230:2411) SUN4 3.3.11 [t80040000] ready Mon Mar 24 10:33:49 1997 [t80040000] netinput() FIN|ACK from adrenaline [t80040000] hd_dump() ref 1 t100000 n "adrenaline" a "" ar "SUN4" [t80040000] lo "" so "" dx "" ep "" bx "" wd "" sp 1000 [t80040000] sa 13.246.76.61:1081 mtu 4096 f 0x0 e 0 txq 0 [t80040000] tx 6 rx 3 rtt 0.252347 Greg Owen { gowen@cs.tufts.edu,@xis.xerox.com } http://www.cs.tufts.edu/~gowen/ 1.01 GCS/GO d++ p+ c++ l++ u++ e+ -m+ s++/- n- h !(f)? g+ -w+ t+ r-- y? "I want to permeate the air you breathe/slide my way under your skin/place myself behind your eyes/and watch you, watch me, looking in." Katell Keineg