Newsgroups: comp.parallel.mpi From: zz@dart.cc.purdue.edu (Tony Z.) Subject: p4_error: create_rm_processes: more slaves than msg queues Organization: Purdue University, West Lafayette, IN Date: 2 Aug 1997 00:21:09 -0500 Message-ID: <5rug45$1106@dart.cc.purdue.edu> Hi, could any one help me with the following problem? I run a job on a set of SGI workstations with Mpich implementation. When I run the job on 2 SGIs, the job works fine and completes. But if I run it on 3 or more workstations, I get teh following error messages: ======================== p0_10401: p4_error: net_recv recv: EOF on socket: 732 rm_6106: p4_error: create_rm_processes: more slaves than msg queues : 2 rm_3800: p4_error: net_recv recv: EOF on socket: 555364 rm_l_1_3801: p4_error: interrupt SIGINT: 2 bm_list_10402: p4_error: interrupt SIGINT: 2 P4 procgroup file is /home/zz/test/hosts4.sgi. ========================== where file '/home/zz/test/hosts4.sgi' specifies 4 workstation machine names and executable path(all machine share same file system), like this: ===========hosts4.sgi============ sgi1 0 /home/zz/test/a.out sgi2 1 /home/zz/test/a.out sgi3 2 /home/zz/test/a.out sgi4 3 /home/zz/test/a.out ================================= And I start the job from machine 'sgi1': ============== mpirun -p4pg /home/zz/test/hosts4.sgi /home/zz/test/a.out ============= right after it starts, the above error messages come out and the program is interrupted. If I remove any of 2 machines from the last 3 in the host list file 'hosts4.sgi', then run the above command, it just runs ok. I tried an old version of mpich and the latest, the same problem. If you have any idea of what problem it could be, please email me at zz@cc.purdue.edu Thanks. -Tony