From: Jim Tuccillo <jjt@radia1.com>
Newsgroups: comp.parallel.mpi
Subject: Re: Performance of overlapped communication and computation
Date: Sat, 09 Jan 1999 10:31:11 -0500
Organization: Posted via RemarQ, http://www.remarQ.com - Discussions start
    here!
Message-Id: <3697763F.3F54@radia1.com>
References: <76tkr7$efc$1@lux.doc.ic.ac.uk>
    <772v8p$8q7$1@nnrp1.dejanews.com>
    <3695C224.4557861A@mufasa.informatik.uni-mannheim.de>
    <77505v$749$1@lux.doc.ic.ac.uk> <775lro$t00$1@walter-fddi.cray.com>
    <7767ip$qq7$1@lux.doc.ic.ac.uk> <36974F60.59E2@radia1.com>
    <777qun$i35$2@lux.doc.ic.ac.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit


Hi Bill,

Yes, you are correct. The SP will do comunication during the
computational loop so that you dont wind up with all communications
actually taking place on the MPI_WAIT ( which can result in the
processor being stalled as it waits for communication to finish ).


I hadnt tried threading your test code. The "nodes" on the SP are
actually SMPs - either 2 or 4 CPUs per node. I'll try threading your
test program so that there are two threads of control - one will do the
MPI and the other will do the computational loop. The results to date
were generated by using only a single CPU for each MPI task. The
comunication will be from one 2-way node to another 2-way node. I would
expect that the 3% overhead would completely disappear - perhaps not the
best use of the CPU but I want to make sure that it works as it is
suppose to. 

Regards,
  Jim

William Knottenbelt wrote:
> 
> Hi Jim
> 
> : I'm not sure if you saw my previous post regarding the SP. The CPU
> : "overhead" is about 3% as determined by looking at the number of "ticks"
> : with and without communication. As indicated by others, despite the
> : presence of hardware to handle the communications in the SP, some work
> : must still be done by the CPU to process the communication protocol
> : which executes in user space.
> 
> Yes, one would not expect a total absence of CPU involvement. And at
> least the SP gives a proper overlap, so that any drop off in CPU
> utilization by the computation (say during a read of data from disk)
> can be taken up by the communication. I note that the overhead you
> measure is low compared to the overhead of 9% I get with a threaded
> MPI_Send on the AP3000 (which is the only way of getting a
> non-blocking send that I can see since MPI_Isend() doesn't do the
> job).
> 
> Cheers
> --
> William Knottenbelt
> Department of Computing
> Imperial College
> 180 Queens Gate
> South Kensington
> LONDON SW7 2BZ

-- 
=========================================================================
Jim Tuccillo			jjt@radia1.com
IBM				tuccillo@us.ibm.com
415 Loyd Road			voice: 770.487.6694, fax: same, call first
Peachtree City, GA 30269	http://www.radia1.com/jjt1
=========================================================================