Newsgroups: comp.parallel.mpi
From: Dominic Baines <rdab100@NOSPAM.hermes.cam.ac.uk>
Subject: Re: LAM MPI & Linux Install problem - Please help.
Organization: University of Cambridge, England
Date: Sun, 26 Jul 1998 13:01:31 +0100
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <35BB1A9B.4B4B89C3@NOSPAM.hermes.cam.ac.uk>

Dominic Baines wrote:

> I can build and install the LAM61 MPI ditribution on linux no problem.
>
> 8 PC's on their own network 192.168.3.x
> Linux 2.0.34 kernel
> Can ping each machine, telnet and rsh login ok (need to provide
> password)
>
> However when run recon -v lamhosts from one of the node I get mesage:
>
> recon: testing n0 (enterprise.domscluster)
> recon: testing n1 (kirk.domscluster)
> Permission denied.
> recon: "kirk.domscluster" cannot be booted.
>
> Is there a simple solution ? Do I have to add paths, shell
> configurations etc ?
>
> Dominic

I may be a little closer but it still falls short of sucess.
Could someone suggest what may be wrong:

Each machine has a user root (obviously), and rdab100.

In rdab100's home directory is a .rhosts file:

# 8 node LAM 'The Enterprise'
enterprise.domscluster
kirk.domscluster
scotty.domscluster
spock.domscluster
sulu.domscluster
uhra.domscluster
chekov.domscluster
mccoy.domscluster

The /etc/hosts.equiv file contains similar entries. The /etc/hosts file on
each does
a name, alias to IP lookup.

I obtained LAM and using the linux config version built it as user root.

I have a separate drive on each machine so I put it at /data and LAM at
/data/LAM/lam61

rdab100 has the LAM binaries directory PATH added to it:
export PATH=$PATH:/data/LAM/lam61/bin
this exists in the users .bash_profile

the /data/LAM/lam61/bin directory contents are all  755 except hcp* and
sweep* which are 655 (?)

Surprise, surprise lamhosts has the the same entries to the .rhosts file.
I have a copy of this in the rdab100 users home directory as it didn't
seem to pick it up from the  /data/LAM/lam61 directory

I can run lamboot -v as the user rdab100 just having logged in to any
machine.
If I rsh <newhost> it allows a login without password and once again I
can run lamboot -v and the local node starts. I run lamclean -v afterwards
just to make
sure it is not left up However I can't initialise the cluster by running
recon -v lamhosts.

The screen output is:

recon: testing no (enterprise.domscluster)
recon: testing n1 (kirk.domscluster)
bash: tkill: command not found
recon: "kirk.domscluster" cannot be booted.

and the process stops.

All the machines (IP address, hostname etc.. aside) are identical
installations.

Can anyone advise ?

Dominic


