From: wjk@pluto.doc.ic.ac.uk (William Knottenbelt)
Newsgroups: comp.parallel.mpi
Subject: Re: Performance of overlapped communication and computation
Date: 9 Jan 1999 00:23:53 GMT
Organization: Dept. of Computing, Imperial College, University of London, UK.
Message-Id: <7767ip$qq7$1@lux.doc.ic.ac.uk>
References: <76tkr7$efc$1@lux.doc.ic.ac.uk>
    <772v8p$8q7$1@nnrp1.dejanews.com>
    <3695C224.4557861A@mufasa.informatik.uni-mannheim.de>
    <77505v$749$1@lux.doc.ic.ac.uk> <775lro$t00$1@walter-fddi.cray.com>


Hi Eric

: Overlapping computation and MPI communication is a nice idea in theory
: but is *extraordinarily* difficult to make happen in practice. 

[snip]

: Consider: you've got a 16-cpu machine running a 16-process MPI job, and
: one of the MPI processes calls MPI_Isend(). Where is the asynchronous
: agent to make progress for you after the call returns? There's no free
: lunch - you only have 16 cpus to play with. 

Thanks for your comments. But surely on a serious parallel machine you
would expect a send/receive mechanism involving DMA/interrupts and
dedicated network hardware which didn't require much CPU intervention
(can you tell I have absolutely no experience of actually trying to
build a parallel computer)? It doesn't seem to me that the CPU should
have to do anything more than copy the data into a DMA buffer and then
receive a hardware interrupt when the transfer is complete (so perhaps
the network hardware is where I think the asychronous progress-making
agent should be). My supervisor claims to have been using
DMA/interrupts to overlap communication and computation on a
transputer many years ago, and can't understand why there should be
problems achieving the same effect now.

I have to say Jim's results on the IBM SP definitely appear to be the
best so far - I'd be interested to see more results on SP to see if it
does the transfer without much CPU overhead (by comparing the reported
tick count with non-blocking send/receive to the tick count without
any sending/receiving). If it does then IBM would appear to have
got the right end of the stick many moons ago.

And even if sending has to use a little CPU, I don't expect the
MPI_Isend() call to block and take as long as (or even longer than) an
MPI_Send() call (which is what is happening on the AP3000). At worst I
would expect a thread to take over the task so that the call could
return swiftly and so that computation and communication could both
proceed in parallel (albeit at reduced efficiency owing to the fact
that they need to share the same CPU). 

The ability to overlap communication and computation effectively is
such an important part of achieving good efficiency on large problems
involving large amounts of communication that I can't believe
designers of today's computers don't keep it in mind as a primary
goal! And the problems which need overlap are not obscure problems
either - I stumbled across the problem trying to do large-scale sparse
matrix-vector multiplication. 

Best regards
Will

P.S. Fujitsu have been very prompt and efficient in acknowledging that
there appears to be a problem with their MPI_Isend() call - they are
investigating and I will keep you up to date on developments.

