Newsgroups: comp.parallel.pvm
From: diego@cs.ualberta.ca (Diego Novillo)
Subject: Cannot execute PVM programs under LoadLeveler 1.3.0.2
Organization: Computing Science, U of Alberta, Edmonton, Canada
Date: Wed, 19 Feb 1997 10:25:37 -0700 (MST)
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Message-ID: <5efd3f$s22@scapa.cs.ualberta.ca>


We have been having problems trying to execute PVM programs under
LoadLeveler 1.3.0.2. We are using PVM 3.3.11 (non-SP) on an 8-node SP2
running AIX 4.1.  We can run PVM programs manually without a problem, but
when we try to submit pvm jobs they're rejected by LL.

I went through all the configuration files and the submission file to make
sure that they included all the appropriate keywords described in the
manual and they look fine.

I also created a submission file for a sequential job that started PVM and
added the hosts itself. The job was accepted and ran fine. But everytime I
want LL to start PVM, it rejects the job. The troubleshooting section of
the manual does not seem to have anything related to this problem.

Has someone experienced this problem before? I included a portion of the
StarterLog file with the error log. I'd appreciate any suggestions you may
have.

Thanks in advance. Diego.

------------------------------------------------------------------------------
2/12 12:34:07 <Unitialized Job> ********** STARTER starting up ***********
[ ... deleted ... ]
2/12 12:34:07 husky2.22.0 This host (husky2.ucs.ualberta.ca) is the parallel sys
tem master.
[ ... deleted ... ]
2/12 12:34:09 Starting pvm:
2/12 12:34:09   exec /usr/local/pvm3311/lib/pvmd -nhusky2-en0 /tmp/ll_pvm.bootfi
le.35688
[ ... deleted ... ]
2/12 12:34:11 husky2.22.0 Master pvmd started. tid = t40001
2/12 12:34:11 husky2.22.0 Starter registered as PVM hoster.
2/12 12:34:11 husky2.22.0 next_state(ParallelStaging,Pvm3MasterStarted)->Pvm3Sta
rtupWait
2/12 12:34:11 husky2.22.0 next_state(Pvm3StartupWait,Pvm3GotRemoteCommands)->Pvm
3HostsStarting
2/12 12:34:11 husky2.22.0 hoster() 4 to start, from_tid t80000000
2/12 12:34:11 husky2.22.0 PVM add-hosts: 0. t80000 husky4-en0 cmd="$PVM_ROOT/lib
/pvmd -s -d0 -nhusky4-en0 1 c0a80002:1384 4096 2 c0a80004:0000 -f -S"
2/12 12:34:11 husky2.22.0 PVM add-hosts: 1. tc0000 husky3-en0 cmd="$PVM_ROOT/lib
/pvmd -s -d0 -nhusky3-en0 1 c0a80002:1384 4096 3 c0a80003:0000 -f -S"
2/12 12:34:11 husky2.22.0 PVM add-hosts: 2. t100000 husky8-en0 cmd="$PVM_ROOT/li
b/pvmd -s -d0 -nhusky8-en0 1 c0a80002:1384 4096 4 c0a80008:0000 -f -S"
2/12 12:34:11 husky2.22.0 PVM add-hosts: 3. t140000 husky7-en0 cmd="$PVM_ROOT/li
b/pvmd -s -d0 -nhusky7-en0 1 c0a80002:1384 4096 5 c0a80007:0000 -f -S"
2/12 12:34:11 husky2.22.0 next_state(Pvm3HostsStarting,Pvm3StartRemoteHosts)->Pv
m3AddhostLoop
2/12 12:34:41 husky2.22.0 next_state(Pvm3AddhostLoop,Pvm3AddhostMustwait)->Pvm3A
ddhostWait
2/12 12:34:41 husky2.22.0 next_state(Pvm3AddhostWait,Pvm3ResponsePending)->Pvm3R
eceiveResponse
2/12 12:34:41 PVM_ADD_HOSTS: Remote Tid = 0
2/12 12:34:41 husky2.22.0 PVM Add hosts: Cannot start slave on host husky4-en0.

2/12 12:34:41 queue_signal(): Signal = 15
libpvm [t40001]: mxfer() EOF on pvmd sock
2/12 12:34:41 husky2.22.0 next_state(Pvm3ReceiveResponse,ParallelStartupFailure)
->ParallelVacateSubsysWait
2/12 12:34:41 husky2.22.0 received signal 15 while waiting for child to die 
2/12 12:34:41 husky2.22.0 sent signal SIGKILL to child
2/12 12:34:41 husky2.22.0 Pid 29802 exited, termsig = 0, coredump = 0, retcode =
 0
2/12 12:34:41 husky2.22.0 next_state(ParallelVacateSubsysWait,ParallelProcessExi
ted)->UEEpilog
2/12 12:34:41 husky2.22.0 next_state(UEEpilog,NoUEEpilog)->Epilog
2/12 12:34:41 husky2.22.0 next_state(Epilog,NoEpilog)->Done
2/12 12:34:41 husky2.22.0 send_job_status: startd contacted, status=JOB_REJECTED



