Newsgroups: comp.parallel.mpi
From: Luc.Vereecken@chem.kuleuven.ac.be (Vereecken Luc)
Subject: Re: MPICH Error Condition
Organization: KULeuvenNet
Date: Fri, 30 Jan 1998 15:54:20 GMT
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Message-ID: <886261825.704333@marvin.kulnet.kuleuven.ac.be>

Rob Cunningham <rkc@ll.mit.edu> wrote:


>I have an application where two processes talk to each other via the MPICH
>implementation of MPI.  Depending on the size of the messages that are sent, I
>can get one of the following to occur:

>Small size
>p1_893:  p4_error: interrupt SIGBUS: 10
>rm_l_1_894:  p4_error: interrupt SIGINT: 2

>Moderate size
>rm_l_1_1012:  p4_error: net_recv recv:  EOF on socket: 1

>Large size
>p1_1060:  p4_error: interrupt SIGSEGV: 11
>rm_l_1_1061:  p4_error: interrupt SIGINT: 2

>Extremely large size
>No error.
	or not visible ?

This looks like a memory overwrite somewhere where the offset depends
on the size of the message. SIGSEGV and SIGBUS could indicate that
sometimes the overwrite occurs outside the segment or causes other
weird problems.
Can you reproduce this in a small testprogram, or does it only occur
in your main program ? 

>I have noticed that MPICH seems to spawn an intermediate task to handle the
>communications, and it is this task that is reporting the problem, and in some
>cases, causing the child process (rank 1) to die.  
That seems logical : if the communication fails, a SIGINT is send to
(all) the other processes.


>Details:

>Version: MPICH 1.1 (April, 1997)
>OS: Solaris 5.5.1
Is that solaris 2 with SunOS 5 ?
>P4 communicator

Luc Vereecken


