Newsgroups: comp.parallel.pvm From: Christian Worley Subject: More info on: Re: How to handle multiple network interfaces? Organization: Alta Technology Date: Mon, 16 Mar 1998 17:26:42 -0700 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <350DC342.457B161E@altatech.com> I've traced this problem in pvmd's "work" routine. I'm using the "so=ms" option on my slave, and starting each pvmd manually (on the host and client). pvmd -d0xff -n It responds by telling me the way to start pvmd on the other node, which I type in. That pvm gives me the response to give back to the first node, which I respond (and it says "Thanks"). I can use "netstat -a" to see both daemons sitting on their ports. I can use "tcpdump" (on both machines) to see the host sending stuff thru it's port to the slaves port using the desired interface (and neither side is using the the other interface for any PVM communications). The slave never replies. It's sitting on a "select" in "work()" (pvmd.c, approx line 1480)... every once in a while, it times-out and reports nothing to receive, and goes back sitting in "select". At the same time, "netstat -a" shows that there's nothing queued, on either the sender or receiver side of the sockets. The data seems to be dropped on the floor as if it were meant to be. It's interesting to send a HUP to the slave "pvmd", see it signal the host telling it of it's demise, and the host never gets that packet... even though the port numbers in the packets are correct for both ends: %netstat -i myri0 tcpdump: listening on myri0 17:06:04.325196 arp who-has root_m tell node1_m 17:06:04.325196 arp reply root_m is-at 0:60:dd:7f:f7:37 [Both sides are started... the slave is node1_m using port 1082, the host is root_m using port 1136...] 17:06:35.616211 root_m.1136 > node1_m.1082: udp 48 17:06:37.649414 root_m.1136 > node1_m.1082: udp 48 17:06:41.713867 root_m.1136 > node1_m.1082: udp 48 17:06:49.958008 root_m.1136 > node1_m.1082: udp 48 [The previous message repeats every few seconds, while the slave (node1_m) is sitting on a "select" receiving nothing, and "netstat -a" is saying nothing held up in a queue] [So, I kill pvmd on the slave, you see it wake and signal the root of it's demise...] 17:06:51.215821 node1_m.1082 > root_m.1136: udp 16 [But, the root never gets the message, and keeps on trying to signal the slave] 17:07:06.215821 root_m.1136 > node1_m.1082: udp 48 17:07:06.215821 node1_m > root_m: icmp: node1_m udp port 1082 unreachable [tos 0xc0] The final message, telling of the bad port, indicates the system knew the port was open... on both ends. But no messages are being received by either side! Does this ring any bells? Also, I'm on PVM version 3.4.beta6. Thanks, Chris P.S.: here's just a copy of the previous post: Christian Worley wrote: > > Hi, > > I've got two machines running 100BT on the same network (call them N1 > and N2). > > The same two machines are connected with Myricom network boards (So, > N1 and N2 also have the names X1 and X2). I can ping, ftp, netperf, > etc... across either network. For the Myricom network, I use dummy IP > addresses of 191.0.0.1 and 191.0.0.2, with a netmask of 255.255.255.0, > and setup the routing appropriately. > > The PVM and home directories are mounted on both machines via NFS... > same location. "rsh" works across either interface just fine. > > Running apps distributed by pvm across N1 and N2 is no problem. > > But, when I use the other network interface, pvm hangs. > > For example, I run "xpvm" from N1/X1 with the argument "-N X1", and a > hosts file that includes X1 and X2. > > I look on X2, and a pvmd has been started... it's command line args > show that it's using the myricom interface (the pvmd in the process > queue on X1 looks good too). > > But, xpvm is hung at that point. After timeout, it gives the message: > > Connecting to PVMD already running... XPVM 1.2.4 connected as > TID=0x40001. > [globs.tcl][procs.tcl][util.tcl] > Initializing XPVM.............................................. done. > libpvm [t40001]: pvm_host_sync(): Host failed > libpvm [t40001]: Host Sync: Host failed > > Pvm reacts similarly. Apps using pvm timeout after a bit, and only > run processes on the host node. > > I'm running RedHat 5.0/Linux 2.0.30 on Alpha PC164's. > > Does anyone else have success/problems with PVM when multiple network > interfaces exist? Could the myricom boards be the problem (does pvm > do some network trick that ping, ftp, or netperf wouldn't)? > > Any ideas on what's going wrong? > > Thanks, > > Chris > -- > When I die, please cast my ashes upon Bill Gates > --for once, let him clean up after me! -- "Would you buy a car with the hood welded shut?" -- George Bonser