From: Anders Jorgensen <ajorg@lanl.gov>
Newsgroups: comp.parallel.pvm
Subject: Re: BUG in pvmd?
Date: Mon, 26 Apr 1999 15:49:07 -0600
Organization: Los Alamos National Laboratory
Message-Id: <3724DF53.C1993125@lanl.gov>
References: <3724DCE1.B8A3F87F@lanl.gov>
Reply-To: anders.jorgensen.1998@alum.bu.edu
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Xref: ukc comp.parallel.pvm:8311


A followup thought:
the pvmd.c code is used for both the slave pvmds and the master pvmds,
so rather than replacing the pvmbailout(0) with a error logging, we need
to replace it with error logging only if this pvmd is the master. How do
we know if it is?

if (slave)
    pvmbailout(0)
else if (master)
    pvmperror("Yucky yucky ... etc....

Does this make sense? But what do I put for slave and master?
                    ----anders

Anders Jorgensen wrote:
> 
> I'm having some problems with my master pvmd bailing out. Sometimes it
> happens after hours of operation, sometime after days of operation. Here
> is the end of /tmp/pvml.xxxx:
> 
> [t80000000] 04/26 10:47:23 stdout@tontoa3: EOF
> [t80040000] 04/26 10:47:23 startack() host tontoa3 expected version, got
> "PvmCan
> tStart"
> [t80040000] 04/26 11:08:18 netinput() FIN|ACK from radio
> [t80040000] 04/26 11:08:18  hd_dump() ref 1 t 0x12c0000 n "radio" a ""
> ar "SUN4S
> OL2" dsig 0x658eb59
> [t80040000] 04/26 11:08:18            lo "" so "" dx
> "/n/fedtmule/home/ajorg/pac
> kages/pvm3/lib/pvmd" ep "/n/fedtmule/home/ajorg/pvm3/bin/SUN4SOL2/" bx
> "" wd ""
> sp 1000
> [t80040000] 04/26 11:08:18            sa 128.165.207.236:33902 mtu 4080
> f 0x0 e
> 0 txq 0
> [t80040000] 04/26 11:08:18            tx 78 rx 43 rtt 0.005853
> [t80040000] 04/26 11:08:18 netoutput() sendto: Connection refused
> [t80040000] 04/26 11:08:18 pvmbailout(0)
> 
> As you can see, netoutput bails out. Here is the code around the spot
> where the bailout is called:
> 
> #if 0
>                 /* drop (don't send) random packets */
>                 if (!(random() & 3)) {
>                         pvmlogerror("netoutput() darn, dropped one\n");
>                         cc = -1;
>                 } else
> #endif
>                         if ((cc = sendto(netsock, cp, len, 0,
>                                         (struct sockaddr*)&hp->hd_sad, sizeof(hp->hd_sad))) == -1
>                         && errno != EINTR
> #ifndef WIN32
>                         && errno != ENOBUFS
> #endif
>                         /* hope this works for all archs, not just linux */
>                         && errno != ENOMEM
>                         ) {
>                                 pvmlogperror("netoutput() sendto");
> #if defined(IMA_SUN4SOL2) || defined(IMA_X86SOL2) || defined(IMA_SUNMP)
> || defined(IMA_UXPM) || defined(IMA_UXPV)
>         /* life, don't talk to me about life... */
>                                 if (errno == ECHILD)
>                                         pvmlogerror("this message brought to you by solaris\n");
>                                 else
> #endif
>                                 pvmbailout(0);
>                         }
> #ifdef  STATISTICS
>                 if (cc == -1)
>                         stats.sdneg++;
>                 else
>                         stats.sdok++;
> #endif
> 
>                 BCOPY(dummy, cp, sizeof(dummy));        /* restore under header */
> 
>         /*
>         * set timer for next retry
>         */
>                 if (cc != -1) {
>                         if ((pp->pk_flag & (FFFIN|FFACK)) == (FFFIN|FFACK)) {
>                                 pk_free(pp);
>                                 if (hp != hosts->ht_hosts[0]) {
> 
> in 3.4.0b7 it is around line 1939, and in 3.4.0 it is around line 1981.
> 
> It appears to me that pvmd bails out because it get's a connection
> refused. Is that a correct interpretation? Why not have it retry? From
> what I read, it appears that pvmd will retry it's send IF the error was
> one of a select group (ENOMEM, ENOBUF, etc.). If it is not, then it will
> bail. So the connection refused error is not part of that group? Would
> there be any problem with adding it to that group? Or would there be any
> problem to having it retry no matter what the error? I don't really want
> the master pvmd to bail on me just because it can't contact a host.
> 
> Would there be any problem in simply replacing the pvmbailout(0) call
> with:
> 
> pvmlogerror("Yucky Yucky error - but I'm charging on!\n");
> 
> Any comments?
> 
> Bailing master pvmds suck.  ---anders / ajorg@lanl.gov

