Newsgroups: comp.parallel.mpi
From: Bruce Allen <ballen@dirac.phys.uwm.edu>
Subject: Re: having some troubles with mpich on our beowulf system
Organization: University of Wisconsin - Milwaukee, Physics Department
Date: Tue, 09 Jun 1998 18:59:09 -0500
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="------------ACF4FB81FFA158DE33DC3CB6"
Message-ID: <357DCC4D.86FAEF5A@dirac.phys.uwm.edu>


--------------ACF4FB81FFA158DE33DC3CB6
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Greg Lindahl wrote:

> Bruce Allen <ballen@dirac.phys.uwm.edu> writes:
>
> > We have noticed however that if immediately following the termination of
> > an mpirun-submitted
> > job, we submit another one, then it often fails.  Doing ps -x shows that
> > there are lots of rsh connections
> > to the different nodes, but a few that say
> >     (rsh  <zombie>)
> > where apparently the connection failed.
>
> If you also do "netstat -t" you'll see connections stuck forever in
> the CLOSE_WAIT state. It seems that for some combinations of kernels
> and ethernet drivers, they don't always timeout properly. This is
> compounded by the fact that rsh uses a fixed starting point looking
> for ports, so it tends to reuse the same ones. If you re-order the
> hosts in your hosts file it will allow you to work around the problem.
>
> You can report the kernel bug to Alan Cox, I'm not sure if he belives
> I'm the only person who sees it, or what. I can get it on my x86 boxes
> as well.

Greg, thanks, this is very helpful.  A few questions:

   *  Could you report a few kernel/driver combos that do/don't work
     properly?
   * What do you mean by "reordering hosts"?  Do you mean to re-order the
     hosts in the
     argument to -machinefile?  Or in /etc/hosts?  On each machine?  When
     starting each
     new MPI job?
   * I'd be glad to send a note to Cox -- could you send me his email
     address?

Cheers,
    Bruce

--------------ACF4FB81FFA158DE33DC3CB6
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 7bit

<HTML>
Greg Lindahl wrote:
<BLOCKQUOTE TYPE=CITE>Bruce Allen &lt;ballen@dirac.phys.uwm.edu> writes:

<P>> We have noticed however that if immediately following the termination
of
<BR>> an mpirun-submitted
<BR>> job, we submit another one, then it often fails.&nbsp; Doing ps -x
shows that
<BR>> there are lots of rsh connections
<BR>> to the different nodes, but a few that say
<BR>>&nbsp;&nbsp;&nbsp;&nbsp; (rsh&nbsp; &lt;zombie>)
<BR>> where apparently the connection failed.

<P>If you also do "netstat -t" you'll see connections stuck forever in
<BR>the CLOSE_WAIT state. It seems that for some combinations of kernels
<BR>and ethernet drivers, they don't always timeout properly. This is
<BR>compounded by the fact that rsh uses a fixed starting point looking
<BR>for ports, so it tends to reuse the same ones. If you re-order the
<BR>hosts in your hosts file it will allow you to work around the problem.

<P>You can report the kernel bug to Alan Cox, I'm not sure if he belives
<BR>I'm the only person who sees it, or what. I can get it on my x86 boxes
<BR>as well.</BLOCKQUOTE>
Greg, thanks, this is very helpful.&nbsp; A few questions:
<UL>
<LI>
&nbsp;Could you report a few kernel/driver combos that do/don't work properly?</LI>

<LI>
What do you mean by "reordering hosts"?&nbsp; Do you mean to re-order the
hosts in the<BR>
argument to -machinefile?&nbsp; Or in /etc/hosts?&nbsp; On each machine?&nbsp;
When starting each<BR>
new MPI job?</LI>

<LI>
I'd be glad to send a note to Cox -- could you send me his email address?</LI>
</UL>
Cheers,
<BR>&nbsp;&nbsp;&nbsp; Bruce</HTML>

--------------ACF4FB81FFA158DE33DC3CB6--


