From: salo@cray.com (Eric Salo)
Newsgroups: comp.parallel.mpi
Subject: Re: Performance of overlapped communication and computation
Date: 8 Jan 1999 19:21:28 GMT
Organization: Silicon Graphics, Inc.
Message-Id: <775lro$t00$1@walter-fddi.cray.com>
References: <76tkr7$efc$1@lux.doc.ic.ac.uk>
    <772v8p$8q7$1@nnrp1.dejanews.com>
    <3695C224.4557861A@mufasa.informatik.uni-mannheim.de>
    <77505v$749$1@lux.doc.ic.ac.uk>


Overlapping computation and MPI communication is a nice idea in theory
but is *extraordinarily* difficult to make happen in practice. There are
several reasons for this, not the least of which is that it requires
dedicated hardware which in general can almost always be put to better
use(s) elsewhere.

Consider: you've got a 16-cpu machine running a 16-process MPI job, and
one of the MPI processes calls MPI_Isend(). Where is the asynchronous
agent to make progress for you after the call returns? There's no free
lunch - you only have 16 cpus to play with. Sure, your MPI implementation
might decide to burn one/some of them "under the covers", but that has
several unpleasant drawbacks:

	1) It complicates the implementation considerably.
	2) It creates a bottleneck if there's less than one extra cpu
	   available per MPI process.
	3) It reduces the number of cpus available for computation.
	4) It increases the latency of your messages.

So, given the above, MPI implementations are much more likely to simply
have the sending/receiving processes move the bits directly.

Across a cluster the situation isn't really all that different. Yes,
while you do typically have extra hardware available on the NIC, MPI
point-to-point semantics are so baroque that it's usually highly
impractical to put sufficient intelligence into a NIC such that it
can provide any useful degree of overlap.

--
Eric Salo      Silicon Graphics      salo@sgi.com