From: salo@cray.com (Eric Salo)
Newsgroups: comp.parallel.mpi
Subject: Re: Nonblocking on shared memory computers
Date: 15 Jun 1999 17:28:36 GMT
Organization: Silicon Graphics, Inc.
Message-Id: <7k62g4$qm8$1@walter-fddi.cray.com>
References: <3766832B.56340DAB@t12.lanl.gov>
Xref: ukc comp.parallel.mpi:5202


> I'm trying to develop portable code for irregular scientific
> computation on a shared memory computer (SGI's ASCI blue mountain).
> To achieve this I am using non-blocking communication and trying to
> overlap communication and computation.
>
> The problem I have is that the SGI version of MPI_ISEND and MPI_IRECV
> actually appear to block. Thus, if I comment out the work that is supposed
> to mask communication latency, the communication time actually decreases.

One problem that always creeps up when trying to discuss this issue is
that nearly everyone has a slightly different definition of what it means
for something to be either blocking or non-blocking. There are definitely
situations in which our MPI_ISEND() might not return "instantaneously" to
the caller, but it should always return in finite time - as required by
the standard. And in fact this "finite time" should generally be very
brief: never more than a handful of CPU microseconds if there is no traffic,
but potentially much longer if there is work that must be done by MPI (such
as, say, moving bits around). Ditto for MPI_IRECV().

> The SGI response is pretty much that non-blocking is not implemented in
> hardware, and I should drop MPI in favor of SHMEM or OpenMP.

I don't know which SGI employee(s) said that or what their exact context
may have been, but this sounds really idiotic to me. Clearly if you are
running on an ASCI blue mountain machine, it is reasonable to conclude
that at some point you may want to also try running on *multiple* machines,
in which case neither SHMEM nor OpenMP is going to be of much help; you'll
need MPI.

> It occurs to me that MPICH must be able to get non-blocking to work without
> hardware support, for example on Beowulf type clusters.

I don't think you're talking about "non-blocking" here in the MPI sense;
I think you're equating it with "overlap of computation and communication",
which is what application programmers *really* want, as you mention at the
top of your message. You simply cannot have the latter without extra
hardware, end of story - there is no free lunch. On a single SMP such as
an Origin, all of the bits must be moved by one of the CPUs. Deferring
the transfer until sometime after MPI_ISEND() is called won't make the bits
move any more quickly, it just shifts the place where your application will
need to wait. And it is generally viewed as a poor implementation choice
to dedicate entire CPUs as data movers for MPI jobs, because most of the
time those CPUs will be idle when they could instead be doing real work as,
say, an additional MPI process.

Beowulf clusters are slightly different, because there you do have extra
hardware in the network controller. A similar situation exists on ASCI
blue with our HIPPI (and soon GSN) controllers, and in fact we do manage
to achieve a small overlap of computation and communication using our
HIPPI bypass protocol. But even there, the CPUs generally need to get
involved so that we can keep the network pipeline full and allow roughly
equal access to the controller for all MPI processes. (This is another
area where Beowulf clusters are different, since each machine generally
has only a very few CPUs.)

--
Eric Salo      Silicon Graphics      salo@sgi.com