From: wjk@pluto.doc.ic.ac.uk (William Knottenbelt)
Newsgroups: comp.parallel.mpi
Subject: Re: Performance of overlapped communication and computation
Date: 6 Jan 1999 17:15:52 GMT
Organization: Dept. of Computing, Imperial College, University of London, UK.
Message-Id: <7705o8$jf$1@lux.doc.ic.ac.uk>
References: <76tkr7$efc$1@lux.doc.ic.ac.uk>
    <76tn46$ni3$1@pegasus.csx.cam.ac.uk>


Hi all

: >Is it reasonable to expect the communication to occur during the 
: >computation - or am I expecting too much? Or is it supposed to happen
: >but I have a lousy MPI implementation? I'd very much appreciate
: >any comments, and I attach my code below in case I'm doing something 
: >silly (or if you want to test it on your own machine/MPI version since 
: >I'd be very interested in the results)! 

: It is reasonable but unrealistic.  MPI communication is not simple,
: and cannot usually be done somely by DMA even when that hardware
: exists - it needs a CPU to reorganise the data, control the transfer
: and so on.  So almost all MPI implementations on machines with one
: processor per node are effectively "always blocking", and most are
: even when there is more than one processor per node.

That's a pity because I would have thought the I/O would be interrupt
driven - how does anyone do efficient parallel matrix-vector
multiplication (for example) if this doesn't work? Has anyone out
there actually achieved a successful computation/communication
overlap?

I have collected some actual statistics for the algorithm outlined in
my previous message, i.e.:

    start timing
    do for 30 iterations:
       barrier
       start a non-blocking receive of a vector from the other processor
       start a non-blocking send to transmit a vector to the other processor
       while 2 seconds have not elapsed (much more time than needed to 
                                         transmit the vector)
         do "computation"
       wait for completion of non-blocking send/receive
    stop timing

Here are the stats for sending a 12MB vector between 2 nodes (transfer
time +/- 0.4 seconds, 30 iterations):

				         average          total 
	                                 waiting          time
				         time per         (ideal 60 seconds)
			                 iteration
			     
no sending/receiving	                 1.27077e-05	  60.3553

blocking send/blocking receive	         0.394289	  72.1096
blocking combined send-receive           0.328132	  70.2531 
(before "computation")

non-blocking receive/non-blocking send   0.38673	  71.7682
(supposed to happen during "computation")

As you can see, it hardly matters whether you use non-blocking or
blocking sends. Can some MPI implementors (or anyone else!) *please*
tell me if this is usual behaviour? If it is, then why do textbooks
talk about the wonders of non-blocking sending/receiving if it's
actually of no practical benefit?

Thanks
--
William Knottenbelt
Department of Computing
Imperial College
180 Queens Gate
South Kensington
LONDON SW7 2BZ