From: "Alain Coetmeur" <alain.coetmeur@icdc.caissedesdepots.fr>
Newsgroups: comp.parallel.mpi
Subject: Re: Lack of Performance under WINNT 4.0
Date: Tue, 10 Nov 1998 18:59:40 +0100
Organization: Informatique-CDC
Message-Id: <729v3j$r8f1@puligny.idt.cdc.fr>
References: <364879BE.5B2C8837@lhm.mw.tu-muenchen.de>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: 8bit


Martin F. Schuster a écrit dans le message <364879BE.5B2C8837@lhm.mw.tu-muenchen.de>...
>Hello everyone,
>
>I have implemented WMPI v1.01 under WINNT 4.0/DIGITAL VISUAL FORTRAN 5.0
>in my CFD CODE. This release is from a server in Portugal,
>which I can't trace back anymore.

see also
http://mazsola.iit.uni-miskolc.hu/stuff/doc/mpi/wmpi/

perhaps related win32 MPI
http://dsg.dei.uc.pt/w32mpi/intro.html
Patent MPI 4.0
http://www.genias.de/products/patent/

>
>The problem I am working on contains
>multiple block connections, which total up to a amount of app. 40 block
>interfaces. The data rate is quite high. Running my programm exhibited
>several unpleasant problems listed below:
how high...
check this relatively to the network bandwidth...

>
>1. The programm runs just fine running on a single or a dual processor
>PC . The data exchange seems acceptable.
WMPI may use shared memory if it is well programmed...
and i think it is...

>As soon as distribute my
>problem
> on several PC units, the data exchange has got to be performed via our
>network (own CISCO 5000 switch, 2 100Bit hubs, 2 10 bit hubs). This

maximum network bandwidth
on classic repeated ethernet this is about 30% of max bandwidth. (total bandwidth)
the network card limit individual bandwidth anyway.

on connected ethernet each link is separate. you can even configure
it to be full duplex instead of half duplex.
I've obtained 5 MByte/s on a TCP/IP FTP connection between 2 NT station on a full duplex 100Mb/s
xylan commuter.

>however,
>slows down the computational performance by factor 10, even though I
>have
>distributed it on e.g. 6 instead of 2 CPUs.

you seems to be communication bound
(bandwidth or latency).

for a good performance model look about the BSP model.
this model define performance
assuming you work by step, during which you compute and
between which you send data, synchronize and gather data
then you can evaluate performance of your program from
3 numbers:
1- number of flop per step (MFlop)
2- number of synchronization, of step (Step/s)
3- maximum data volume in or out each of the nodes between step (MB)

then for a given network architecture you can
estimate:
latency (cost of a sync)
bandwidth (cost of exchanging bytes)
Flop duration (cost of computing)

to estimate this, evaluate the flop
by classic CPU bench.
evaluate latency by MPI short-ping benchmarks
and evaluate bandwidth by large block ping benchmark...
all you need is an order (+/- 50% 8)  don't hope to be precise)

with this model you can then
estimate how to limit the number
of block of your CFD model
to limit communications....

look at
http://gruffle.comlab.ox.ac.uk/oucl/oxpara/bsp/bspmodel.htm

they propose a library, but you don't need to (should not?)  use it.
all you need is estimate an equivaent BSP program from your's...

>
>2. Therefore I have made some assumtions, which might be related to
>the occured problem and I kindly ask you for your feedback:
>
>- we are using the TCP/IP protocoll under WINNT - is it possible that
>    data colissions occur? - is there a better way to go?

if you are using a repeated ethernet , this is ethernet that have collision...
if you are using commuters, then no collision... try to configure link to full duplex
if possible. the return (acknowledge) message of TCP loc
the channel when the sender want to send... with full duplex, all get more fluent.

anyway i thin WMPI use UDP , by have acknowledge anyway
with the same problems

>
>-  dependent on the data rate, mpi errors occur: obviously there's a
>     limit for the maximum rate of data transmission - might this be
>related
>     to the PC's network card ? Is there a buffer size to be extended ?


>
>- is the mpi implementation I am using obsolete? Somehow it seems to be
>    flawed or bugged. Where can I get a newer release?

maybe... check the web addres I give you.

>
>- I detected the folowwing bug: Under certain conditions the rang order
>on
>    a remote machine does not necesserily follow the order of the .pg
>file.
>    It seemd that processes, requireing less storage resources where
>favoured
>    and were assigend a lower rank number -that was really strange and
>    is of course fatal in my case

is preference in the MPI standard... if no, the MPI lib may
try to optimize... In fact I don't understand the problem, but
check if your supposition (id related to host) are legal.
maybe does MPI try the faster before and then the slower..
maybe does WMPI try slow to respond host, in a second turn after fast responding...

>
>3. The lack of performance is significant and leads the whole procedure
>     ad adsurdum. Therefore I need a better solution - are there
>advanced
>    concepts. Is my problem known?


limit the number of block...
you are basicaly discovering the trashing zone of
the speedup to number of processor curve...

any algorithm have such a trasing zone...
finding a good efficent point is the Art of Parallelism.

>
>Since I am not exactly king of the hill in the field of parallel
>computing due
>to missing experience, I'll be greatful for every hint.
>
with such hard problems to solve,
you'll be soon...