Newsgroups: comp.parallel.mpi
From: Bruce Allen <ballen@dirac.phys.uwm.edu>
Subject: having some troubles with mpich on our beowulf system
Organization: University of Wisconsin - Milwaukee, Physics Department
Date: Mon, 08 Jun 1998 21:46:45 -0500
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="------------904CB5ED14ABF97CD52CEC63"
Message-ID: <357CA215.2856E656@dirac.phys.uwm.edu>


--------------904CB5ED14ABF97CD52CEC63
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

We have a beowulf system at UWM, composed of 48 alpha-based nodes
running
linux 2.0.30.

Normally mpich works fine on the system.  We generally start processes
using the
mpirun script, which uses rsh to start processes on the different nodes.
[IE we normally
do NOT use the secure server.]

We have noticed however that if immediately following the termination of
an mpirun-submitted
job, we submit another one, then it often fails.  Doing ps -x shows that
there are lots of rsh connections
to the different nodes, but a few that say
    (rsh  <zombie>)
where apparently the connection failed.

Could anyone suggest why this happens, or how to prevent it?  Details of
our machine configuration may be
found at www.lsg-group.phys.uwm.edu/~www/docs/beowulf/index.html

We also tried using the secure server to replace the normal way of
spawning processes. A curious
thing happens: it fails apparently for 2 reasons:
    1.  Although the server is running on every node, and we have the
environment variables set as
          specified in the manual, we STILL see an rsh starting up for
every process.  Is this right?  It often
          fails for the same reason as above!

    2.  If we specify more than 32 nodes, the code fails.  netstat shows
that we have connections open
          to port 1234 for nodes n000->n031.  Is there some parameter
that needs to be changed in mpich
          or in the linux 2.0.30 kernel?

By the way, I would appreciate copies of any suggestions sent to
ballen@dirac.phys.uwm.edu as I don't
normally browse this group.

Cheers,
    Bruce Allen

--------------904CB5ED14ABF97CD52CEC63
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 7bit

<HTML>
We have a beowulf system at UWM, composed of 48 alpha-based nodes running
<BR>linux 2.0.30.

<P>Normally mpich works fine on the system.&nbsp; We generally start processes
using the
<BR>mpirun script, which uses rsh to start processes on the different nodes.
[IE we normally
<BR>do NOT&nbsp;use the secure server.]

<P>We have noticed however that if immediately following the termination
of an mpirun-submitted
<BR>job, we submit another one, then it often fails.&nbsp; Doing ps -x
shows that there are lots of rsh connections
<BR>to the different nodes, but a few that say
<BR>&nbsp;&nbsp;&nbsp; (rsh&nbsp; &lt;zombie>)
<BR>where apparently the connection failed.

<P>Could anyone suggest why this happens, or how to prevent it?&nbsp; Details
of our machine configuration may be
<BR>found at <A HREF="http://www.lsg-group.phys.uwm.edu/~www/docs/beowulf/index.html">www.lsg-group.phys.uwm.edu/~www/docs/beowulf/index.html</A>

<P>We also tried using the secure server to replace the normal way of spawning
processes. A curious
<BR>thing happens: it fails apparently for 2 reasons:
<BR>&nbsp;&nbsp;&nbsp; 1.&nbsp; Although the server is running on every
node, and we have the environment variables set as
<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; specified in
the manual, we STILL&nbsp;see an rsh starting up for every process.&nbsp;
Is this right?&nbsp; It often
<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; fails for the
same reason as above!

<P>&nbsp;&nbsp;&nbsp; 2.&nbsp; If we specify more than 32 nodes, the code
fails.&nbsp; netstat shows that we have connections open
<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; to port 1234
for nodes n000->n031.&nbsp; Is there some parameter that needs to be changed
in mpich
<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; or in the linux
2.0.30 kernel?

<P>By the way, I would appreciate copies of any suggestions sent to <A HREF="mailto:ballen@dirac.phys.uwm.edu">ballen@dirac.phys.uwm.edu</A>
as I don't
<BR>normally browse this group.

<P>Cheers,
<BR>&nbsp;&nbsp;&nbsp; Bruce Allen</HTML>

--------------904CB5ED14ABF97CD52CEC63--


