Newsgroups: comp.parallel.pvm
From: JonesMB <jmb@uakron.edu>
Subject: pvm and SUNMP and xpvm
Organization: No, not organized !!
Date: Tue, 01 Jul 1997 11:21:21 -0500
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <33B92E81.5019@uakron.edu>

Hello all,

I have PVM running on a bunch of machines (Sun and Silicon Graphics). 
The architectures are SGI5 SGI64 SUN4SOL2 and SUNMP.  We are having
problems with the SUNMP hosts (dual processor Sparc20 and Ultra).  After
the programs have been running for about 2 or 3 hours pvm on the SUNMP
machine, it goes to sleep.  It no longer communicates with the other
pvmds.  This is what shows up in a truss (system call trace) of the pvm
daemon on the SUNMP host.
-----------
5784:   poll(0xEFFFDA90, 8, 4)                          = 0
5784:   poll(0xEFFFDA90, 8, 0)                          = 0
5784:   putmsg(8, 0xEFFFF944, 0xEFFFF8A8, 0)            = 0
5784:   poll(0xEFFFDA90, 8, 4)                          = 1
5784:   getmsg(8, 0xEFFFF930, 0xEFFFF93C, 0xEFFFF91C)   = 0
5784:   poll(0xEFFFDA90, 8, 59996)      (sleeping...)
5784:   poll(0xEFFFDA90, 8, 59996)                      = 1
5784:   getmsg(8, 0xEFFFF930, 0xEFFFF93C, 0xEFFFF91C)   = 0
5784:   kill(5787, SIGINT)                              = 0
5784:   putmsg(8, 0xEFFFF944, 0xEFFFF8A8, 0)            = 0
5784:   poll(0xEFFFDA90, 8, 48186)                      = 1
5784:   read(9, " T F C   -   l o c a l  ".., 3999)     = 24
5784:   putmsg(8, 0xEFFFF944, 0xEFFFF8A8, 0)            = 0
5784:   lwp_mutex_lock(0xEE600000)      (sleeping...)
5784:   signotifywait()                                 = 15
5784:   lwp_sigredirect(1, SIGTERM)                     = 0
5784:       Received signal #15, SIGTERM, in lwp_mutex_lock() [caught]
5784:         siginfo: SIGTERM pid=6221 uid=1022
5784:   lwp_mutex_lock(0xEE600000)                      Err#4 EINTR
5784:   signotifywait()                                 = 25
5784:   sigaction(SIGTERM, 0xEFFFF228, 0x00000000)      = 0
5784:   sigprocmask(SIG_SETMASK, 0xEF6C36E4, 0x00000000) = 0
5784:   sigaction(SIGINT, 0xEFFFF098, 0xEFFFF218)       = 0
5784:   sigaction(SIGTERM, 0xEFFFF098, 0xEFFFF218)      = 0
5784:   write(6, " [ t 8 0 0 c 0 0 0 0 ]  ", 12)        = 12
5784:   write(6, " c a t c h ( )   c a u g".., 25)      = 25
5784:   write(6, " [ t 8 0 0 c 0 0 0 0 ]  ", 12)        = 12
5784:   write(6, " p v m b a i l o u t ( 1".., 15)      = 15
5784:   unlink("/tmp/pvmd.1022")                        = 0
5784:   shmctl(1300, 10, 0)                             = 0
5784:   shmdt(0xEE600000)                               = 0
5784:   shmctl(22, 10, 0)                               = 0
5784:   shmdt(0xEE780000)                               = 0
5784:   shmctl(21, 10, 0)                               = 0
5784:   shmdt(0xEE900000)                               = 0
5784:   shmctl(20, 10, 0)                               = 0
5784:   kill(5785, SIGTERM)                             = 0
5784:   kill(5786, SIGTERM)                             = 0
5784:   kill(5787, SIGTERM)                             = 0
5784:   putmsg(8, 0xEFFFF12C, 0xEFFFF090, 0)            = 0
5784:   lseek(0, 0, SEEK_CUR)                           = 0
5784:   _exit(15)
-----------
Notice the system call "lwp_mutex_lock(0xEE600000)".  This is the point
at which the daemon stops talking to anyone.   pvm_halt() from the
master stopped all the other pvmds in the VM (5 of them) and timed out
trying to communicate with the daemon on the SUNMP machine.
I logged in on the offending machine and tried to start the pvm
console.  This also failed.  The daemon finally went away after it got a
kill -15 signal.

Below is the output of /tmp/pvml.UID from the master host
-----------
# cat /tmp/pvml.1022
[t80040000] ratbert (172.16.10.50:51770) SUN4SOL2 3.3.11
[t80040000] ready Mon Jun 30 15:15:07 1997
[t80040000] netoutput() sendto: No child processes
[t80040000] this message brought to you by solaris
[t80040000] [tc0003] TFC - local log enabled
[t80040000] netoutput() timed out sending to catbert after 20,
184.307732
[t80040000]  hd_dump() ref 1 tc0000 n "catbert" a "" ar "SUNMP"
[t80040000]            lo "" so "" dx "" ep "" bx "" wd "" sp 1000
[t80040000]            sa 172.16.10.30:49056 mtu 8160 f 0x0 e 0 txq 4
[t80040000]            tx 556 rx 571 rtt 0.008182
[t80040000] dm_halt() from (ratbert), halting...
[t80040000] work() pvmd halting
[t80040000] pvmbailout(0)
-----------

Any ideas anyone.

I am also looking for xpvm binaries for SUN4SOL2 or SUNMP (running on a
Sparc 5, 20 or Ultra).  I have tried to compile one but my tcl/tk
libraries are broken and it will take some work to fix them.  I figure
it will be easier to get precompiled binaries now and fix my libs later.

Regards
JonesMB

-- 
Jones MB : <jonesmb@ziplink.net>
           http://www.ecgf.uakron.edu/~el6501/index.html 

Reason #191 to fear technology...

   o      o     o    o     o    <o     <o>    o>    o
  .|.    \|.   \|/   //    X     \      |    <|    <|>
   /\     >\   /<    >\   /<     >\    /<     >\    /<

Mr. Asciihead learns the Macarena.

