Newsgroups: comp.parallel.mpi From: Armando Lins Netto Reply-To: armando@me.berkeley.edu Subject: lam mpi error with linux Organization: U.C. Berkeley Date: Fri, 24 Jul 1998 15:09:29 -0700 Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-2; x-mac-type="54455854"; x-mac-creator="4D4F5353" Content-Transfer-Encoding: 7bit Message-ID: <35B9060A.FBAE5E37@me.berkeley.edu> I am running lam mpi in a linux system and I've been getting an error after the code runs for a while (hours up to a day). Basically what happens is that one or more messages get lost for no aparent reason. The messages cannot be found with mpimsg and mpitask does not responde after this. If I run the code again with exactly the same parameters (it should give the same results in my code) and the problems does not repeat itself in the same point of the simulation. It happens again in a different location...... When I run a debbuger attached to all processes, I found out that a segmentation fault error occurs in a mpilam routine (lam_send () and others) and crashes everything. All the parameters for the MPI_Send were within proper range and correc as far as what my code expects. The interesting thing is that only processes in the same machine get this segmentation fault. When I run again it can happen at other machine but then it only affects this new machines's processes. I have reduced the number of pending messages in the receive side and the program run for more time than before, but it still crashes...... Does anyone know what can be happening and what to do to fix it? Thanks for any help! Armando