From: don schad <dschad@frontiernet.net>
Newsgroups: comp.parallel.mpi
Subject: Linux LAM6.1 hboot timeout error
Date: 28 Sep 1998 21:51:02 GMT
Organization: Frontier Internet Rochester N.Y. (716)-777-SURF
Message-Id: <6up0c6$1foe$1@node17.cwnet.frontiernet.net>
User-Agent: tin/pre-1.4-980226 (UNIX) (AIX/4-2)


Hi,

I am having trouble trying to get LAM 6.1 to lamboot my
multi-computer (a Linux 2cpu PPro running Red Hat 5.1, and
1 cpu P-MMX running Slackware) properly. I have 2 nodes (osiris(P-MMX)), 
breca), and when I use osiris as the origin node LAM bombs out
and returns a timeout error.  I can do a recon, which indicates
that all the nodes are up and running, but when i actually try
to boot the system it doesnt work:

	osiris:/usr/local/linuxshare/tests%recon -v pp2hosts
	recon: testing n0 (breca)
	recon: testing n1 (osiris)
	osiris:/usr/local/linuxshare/tests%lamboot -v pp2hosts 

	LAM 6.1 - Ohio Supercomputer Center

	hboot n0 (breca)...
	lamboot: hboot failed on n0 (breca)
	lamboot: Connection timed out

	wipe ...
	tkill n0 (breca)...
	osiris:/usr/local/linuxshare/tests%

When I use breca as the origin computer, everything works fine:

	breca:/usr/local/linuxshare/tests>recon -v pp2hosts 
	recon: testing n0 (breca)
	recon: testing n1 (osiris)
	breca:/usr/local/linuxshare/tests>lamboot -v pp2hosts 

	LAM 6.1 - Ohio Supercomputer Center

	hboot n0 (breca)...
	hboot n1 (osiris)...
	topology done      
	breca:/usr/local/linuxshare/tests>


I can start a LAM using osiris and another computer with either
of these as the origin system, and it works fine either way.

rsh does work from both computers, so I don't think that it is
a permission problem, and when I rsh from osiris I can find
hboot in my path).

I would appreciate any thoughts/help very much.

Thanks, 

don

Hydroqual, Inc.
-- 
-------------------------------------------------------
Brought to you sloooooooowly by Frontier Communications,
the Intermittent Service Provider. 

