Newsgroups: comp.parallel
From: sakumar@magnus.acs.ohio-state.edu (Sanjay Kumar)
Subject: Dynamic load balancing/processor-farming : Summary (belated)
Organization: The Ohio State University
Date: 31 Aug 1993 09:06:02 GMT

Dear Netters,

A few days back I had posted a request for some help regarding 
load balancing/processor-farm concept. I got only a few but 
very good  reponses which I am including here. Thanks to following
for thier replies:

	"Angelo Bassanino"  <BASSANIN@odie.ee.wits.ac.za>
	A.D.Ben-Dyke@computer-science.birmingham.ac.uk
	rbs@aisb.ed.ac.uk
	vnug@phoenix.az.stratus.com (Vasu Nugala)
	artg@watson.ibm.com (Art Goldberg)


With regards,

Sanjay Kumar


=========================ORIGINAL POST=============================

 While implementing a network-based distributed genetic algorithm I
 came across  the following classic problem:

 There are q tasks to be performed on p processors. All tasks are
 indepenedent  and can be performed concurrently. Idea is to distribute 
 load subject to following conditions:

 - p << q
 - tasks take almost equal time but  not quite equal.
 - processors have variable performance rating (and being workstations
   may be shared/multitasked, but for simplicity we can assume them to be
   dedicated)
 - p does not divide q

 After this step,a small sequential part exists (to be done by the
 master task). Implementation is in familiar master-slave paradigm and 
 dynamic load  balancing based on 'pool of the tasks' or processors-farming 
 has been incorporated. So, the master task farms out the slave processes 
 and collects results from them. Then it does the sequantial part.

 My questions are:

 -What are other ways of implementing dynamic load distribution ?

 -Is it advisable to multitask the master host between a master and
  slave task. That is since most of the time master host is idle, why not 
  let it execute one of the slaves ?

 -Are there any references which describe the situation better
 (particularly the terminology of task/processor/host/process) ?

  and most importantly

 -where can I find some discussion about expected performance measures
  like speedup in heterogenous computing environment ?

 Thanks once again.

 Sanjay Kumar

 Graduate Student, Civil Engg.
 The Ohio State University

 email: skumar@magnus.acs.ohio-state.edu
 phone: 614-297-8352


=========================RESPONSES=============================

To: sakumar@magnus.acs.ohio-state.edu
From: "Angelo Bassanino"  <BASSANIN@odie.ee.wits.ac.za>
Date:         16 Aug 93 14:03:19 SAT
Subject:      Re: Dynamic load balancing/faming : Need help

Hello Sanjay,

I've been involved in parallel processing using task farming for some
time now - particularly for the parallelisation of engineering
simulation programs. I'm using tightly coupled transputers as the
processors, either in a PC plug-in card, or on the Parsyec system.
I've been using the command-line 3L parallel Pascal compiler, though
we have facilities for C++, FORTRAN and the Helios operating system.
I'd like to know of any other responses you get from the network, and
while I'm at it I'll add some comments to your message which you may
find useful.

>
> While implementing a network-based distributed genetic algorithm I
> came across
> the following classic problem:
>
> There are q tasks to be performed on p processors. All tasks are
> indepenedent
> and can be performed concurrently. Idea is to distribute load subject
> to
> following conditions:
>
> - p << q
> - tasks take almost equal time but  not quite equal.
> - processors have variable performance rating (and being workstations
> may be
>   shared/multitasked, but for simplicity we can assume them to be
> dedicated)
> - p does not divide q
>

The p << q does not HAVE to be the case - I think the potential
parallelism can be stifled a single major issue - grain size of the
worker tasks. If the workers can be made to spend >> calculation time
than communication time, the system will probably scale up quite well.
Of course if your worker tasks differ in the time they take to run to
completion, then having p of the order of q will not produce good
speedup because there will be processors sitting idle while the
processor with the more complicated task will chug away.

If (q DIV p <> 0), the last phase of the iteration will mean that not
all the processors are busy - this can be wasteful. If at all
possible, the number of processors should be increased to become a
multiple of p; if this is unreasonable, then your criteria of p << q
will ammortise the time lost in the last iteration.

>
> After this step,a small sequential part exists (to be done by the
> master
> task).
> Implementation is in familiar master-slave paradigm and dynamic load
> balancing
> based on 'pool of the tasks' or processors-farming has been
> incorporated. So,
> the master task farms out the slave processes and collects results
> from them.
> Then it does the sequantial part.
>
> My questions are:
>
> -What are other ways of implementing dynamic load distribution ?

My guess is that for your particular application, farming is probably
the best option because it's easy to port and gives pretty good
scalability over number of processors if some simple guidelines (as
you and I have described above) are taken into account.

> -Is it advisable to multitask the master host between a master and
> slave task.
>   That is since most of the time master host is idle, why not let it
> execute
>   one of the slaves ?

YES! In fact, the 3L built-in farming mechanism automatically
allocates a worker task to the Master processor. A word of warning
though, you may want to set priorities on the various Master tasks so
as not to unnecessarily delay the generation of data for the Workers
and the accumulation of results from the Workers.

> -Are there any references which describe the situation better
> (particularly
>  the terminology of task/processor/host/process) ?

I've found it difficult to find a definitive work on farming, but the
method is mentioned in most texts on parallel programming as one of
the ways in which to parallelise a program. If you're interested, I
can send you a few references.

>
>  and most importantly
>
> -where can I find some discussion about expected performance measures
>  like speedup in heterogenous computing environment ?

Sorry, can't help here, but I WOULD be interested in finding out of
you responses.

>
>
> Thanks a lot in advance. I haven't got a reponse to  my previous
> request
> for references about genetic algorithms. However many have requested
> to
> forward
> info I get. I have since found some myself, and will be sending you a
> copy.
>
> Thanks once again.
>
> Sanjay Kumar
>
> Graduate Student, Civil Engg.
> The Ohio State University
>
> email: skumar@magnus.acs.ohio-state.edu
> phone: 614-297-8352
>

Regards,
----------------------------------------------------------------------
Angelo P Bassanino                                 |  C'e`
Software Engineering Applications Laboratory       |     qualcuno
Department of Electrical Engineering,              |  che mi sa
University of the Witwatersrand                    |     dire
Private Bag 3, Johannesburg, 2000, South Africa.   |  che cos'e` ...
Tel: (27) (11) 716 5390   Fax: (27) (11) 403 1929  |
Internet: bassanin@odie.ee.wits.ac.za              |     E Ramazzotti
----------------------------------------------------------------------

=====================My reponse to Above reply==========================


Hello Angelo,

Thanks a lot for your response. The problem at hand is indeed and engineering
one for me too. It involes analysis and design though the algo can be 
used in process simulation also. I will use simulation only in following 
discussion.

>The p << q does not HAVE to be the case - I think the potential
>parallelism can be stifled a single major issue - grain size of the
>worker tasks. If the workers can be made to spend >> calculation time
>than communication time, the system will probably scale up quite well.
>of course if your worker tasks differ in the time they take to run to
>completion, then having p of the order of q will not produce good
>speedup because there will be processors sitting idle while the
>processor with the more complicated task will chug away.


>If (q DIV p <> 0), the last phase of the iteration will mean that not
>all the processors are busy - this can be wasteful. If at all
>possible, the number of processors should be increased to become a
>multiple of p; if this is unreasonable, then your criteria of p << q
>will ammortise the time lost in the last iteration.


I should have added in the original post

 -tasks are indeed large grained (engg. simulations run for long time)
 -p << q because p is in  dozens (workstations) and q maybe
   a few hundreds. In that case making q a multiple of p will not really 
  help the case, since taska re not quite equal and machines are mutitasked 
  anyway. SO the danger of some processor falling idele in final phase is 
  always there. 

  In fact, after these q simulations have been done the small sequential part 
  that refreed to was just a design step. So, A better way out for me seems to 
  be (and my algo allows for it) introducing asynchronosity (sp?) in 
  algorithm. Don't wait for all tasks to finish.
  Start with next iteration of algo (which again will involve q simulations 
  on same p slaves). In that case slave will fall idle only in the last 
  iteration of algo ( each iteration is a design iteration). No of design 
  iterations is usually 60 in my case.  So a few slaves ( < p)  will be  
  idle in 60th iteration and busy almost all the time before that (except 
  when master is doing seqential design).
	
 -On a machine like CM-5, p will be almost equal to q. In 
  fact one would make q a multiple of p or just same as p and 
  doa static load balancing. Since all tasks are not equal there will be
  slight load imbalance. But then you have large number of processor and hence
  hence should be happy.
 
>My guess is that for your particular application, farming is probably
>the best option because it's easy to port and gives pretty good
>scalability over number of processors if some simple guidelines (as
>you and I have described above) are taken into account.

Yes, I get a speedup of 10 on 11 workstations currently. Asynchronous
algo has not been implemented so far.

>> -Is it advisable to multitask the master host between a master and
>> slave task.
>>   That is since most of the time master host is idle, why not let it
>> execute
>>   one of the slaves ?

>YES! In fact, the 3L built-in farming mechanism automatically
>allocates a worker task to the Master processor. A word of warning
>though, you may want to set priorities on the various Master tasks so
>as not to unnecessarily delay the generation of data for the Workers
>and the accumulation of results from the Workers.

I remember having tried this (program was complete in spring.)
I don't know for what reason, though, timings doubled when I set the priority 
of master  to lowest possible (nice +20 on unix or was it -20 ? I am 
forgetting. Or may be I lowered the priority of slave ;-) ). I think I 
should try this again.

Anyway, its easy to explain. Since master does sequential part
all slaves are idle for a  longer time if its priorty is lowered. Whether 
that will affect execution so much needs to explored in a formal way. Since
my master does a busy wait for arrival of results from slaves, its wasting
some CPU cycles. Right now both slave and master take about 45% of CPU cycles
of master processors. May be master should be locked while waiting.

Sanjay Kumar


=========================II=============================

To: sakumar@magnus.acs.ohio-state.edu (Sanjay Kumar)
Subject: Re: Dynamic load balancing/farming : Need help
Date: Mon, 16 Aug 93 13:41:06 BST
From: A.D.Ben-Dyke@computer-science.birmingham.ac.uk
Status: RO


>What are other ways of implementing dynamic load distribution ?

Well, a couple of alternatives are:

1. If you can "rate" each workstation, then probably the simplest
scheme is to statically allocate tasks to machines - this is
an attractive option as you have no need for additional runtime
overhead, and the comms overhead is negligable - however, it is
inflexible and may have to be remapped each time you alter the program
parameters. 

2. A combination of static and dynamic mapping - using a simple
diffusion system (process stealing by "hungry" processors) to 
take the slack from the static mapping. 

3.  (farming) If q is vey large then it may be worth having several
arbitarty partitions, with a dedicated master for each set.  This will
reduce the comms overhead, but will make final coordination a little
more tricky.

-Is it advisable to multitask the master host between a master and slave task.
  That is since most of the time master host is idle, why not let it execute
  one of the slaves ?

It really depends on your slave to master ratio  - if it is relatively
high then the master will be active for a significant amount of time.
Try a couple of different configs and see which is the best for your
particular application.


-Are there any references which describe the situation better (particularly
 the terminology of task/processor/host/process) ?

I've included a couple of refs at the end of this - they cover the
concept in a more formal way (I.m not claiming they're better than
your own description - horses for courses and all that).

-where can I find some discussion about expected performance measures
 like speedup in heterogenous computing environment ? 

Sorry, I can't help with this one - I'm strictly a homogenous sort of 
guy :)

Hope this hasn't wasted your time,

		Andy.

P.S. All the following papers are available via ftp - the sites and
directories should be listed with each entry.

@InProceedings{Kelly:Skeletons,
  author =	{J. Darlington and A. J. Field and P. G. Harrison
		 and P. H. J. Kelly and D. W. N. Sharp and Q. Wu},
  title =	{Parallel Programming Using Skeleton Functions},
  institution =	{Imperial College, London},
  booktitle =	{PARLE},
  year =	{1993},
  email =	{jd, ajf, pgh, phjk, dwns, wq@doc.ic.ac.uk},
  ftp =		{santos.doc.ic.ac.uk:/pub/papers/P.Kelly/
		 SkeletonsParle93.ps.Z},
  annote =	{Describes the advantages of skeletons, identifying
		 the following three components: skeletons
		 (declarative meaning coupled with a specific
		 behaviour on a set of machines); performance model
		 (used to guide the decision making  process at all
		 levels); program transformation (for transforming 
		 between skeletons for systems that don't support
		 efficient implementations of certain models).  The
		 skeletons discussed are pipes, farms, ramps (map then
		 fold), DC, and dynamic programming.  A couple of 
		 example applications are given, but none that require
		 multiple skeletons.  The output of the performance
		 model can be used to decided the level of division
		 in the DC model, granule size etc.  A version has
		 been developed for Hope+ running on the meiko surface
		 - no results are given, but it is suggested that 
		 performance is satisfactory (the interaction with 
		 Hope+'s lazy eval scheme is not mentioned).  Future
		 work includes the development of application specific
		 skeletons (leading to the notion of languageless 
		 programming, based on a visual style of construction
		 (with solid modelling and data bases being the intial
		 target areas).},
  got =		{yes},

}

	
@Article{P3L:Skeletons, 
  author = 	{M. Danelutto and R. Di Meglio and S. Orlando and S.
		Pelagatti and M. Vanneschi},
  title = 	{A Methodology for the Development and the Support of
		 Massively Parallel Programs}, 
  institution =	{Universit\`{a} di Pisa and Hewlett Packard 
		 Laboratories, Italy},
  journal = 	{Future Generation Computer Systems},
  volume = 	{8},
  pages = 	{205--220},
  month = 	{August},
  year = 	{1992},
  email =	{marcod, susanna@di.unipi.it or sp@dcs.ed.ac.uk},
  ftp =		{ftp.di.unipi.it:/pub/Papers/danelutto/florence.ps.Z},
  annote =	{Describes the P3L system based on the skeletons
		 approach (noting that the majority of parallel
		 programs use a restricted set of patterns, resulting
		 in highly regular process graphs).  There are two
		 types of code: true algorithmic and machine dependent
		 (not assembler, but covering concepts such as number
		 of processors and the network toplogies etc).  The
		 machine dependent code is the problem, resulting in
		 difficulty of construction, porting.  Also, as the
		 two  types are intermixed there are problems
		 recognising machine dependent code which could be 
		 optimised for specific cases.  P3L is built from 
		 sequential parts (C++) and a set of constructors for
		 the exploitation of parallelism (including farms
		 (pure, dedicated and MISD), pipes, data parallelism,
		 loops,  trees and geometric).  The overall structure
		 of such a program is a hierarchical composition of
		 constructors which is strictlt functional, enabling
		 formal techniques to prove and transform such 
		 programs.  The main advantages of the skeleton
		 approach are listed as 1. programmer is not forced
		 to take into account machine depednent features such
		 as no processors, so easing programming and improving
		 portability; 2. regular application structure can be
		 used by the compiler to efficiently map the program
		 onto different MIMD architectures.  Normall, the
		 mapping and load balancing issues are NP hard, but
		 by using a restricted set of constructors, whose 
		 performance properties can be modelled, efficient 
		 algorithms (or at least heuristics) can be developed.
		 The various components are then outlined, and a 
		 server application described which is a farm F1 of
		 (farm F2 of (sequential S1) and farm F3 of (pipeline
		 P1)).  The code is compiled into an intermediate 
		 language which provides costed explicit functions.
		 Using this info the mapper chooses from a library
		 of weighted schemas and selects the most appropriate
		 (an abstract machine is used to provide certain
		 info).  To facilitate the combine phase as library
		 of transformations is supplied to enable different
		 layouts (with same cost) to be tried.  Load balancing
		 decisions may the result in the merger of some of the
		 graph into single node as well as other optimisations
		 (such as identifying the heavy weight processes and
		 assigning a large slice of system resource to
		 them).},
  got =		{yes},

}
@InProceedings{HOF:Skeletons,
  author =	{Tore A. Bratvold},
  title =	{Determining Useful Parallelism in Higher Order
		 Functions}, 
  institution =	{Heriot-Watt University, Edinburgh},
  booktitle =	{4th International Workshop on the Parallel
		 Implementation of Functional Languages},
  year =	{1992},
  editor =	{H. Kuchen and R. Loogen},
  email =	{tore@cs.hw.ac.uk},
  ftp =		{ftp.informatik.rwth-aachen.de:/pub/reports/1992/92-19.dir/
		 92-19-04.ps.Z},
  annote =	{The motivation behind the skeleton approach is to
		 achieve high performance without having to explicitly
		 allocate low level machine resources.  Also
		 portability at the source level is achievable (needs
		 a notion of transformations and resource maqpping to
		 fully realise this).  The aim is not to exploit ALL
		 parallelism within an algorith - just that that is
		 known to be efficiently exploitable.  The described
		 system uses a pure subset of strict SML, running on a
		 Meiko Computing Surface.  The compilation system
		 makes use of two sets - a set of patterns defined
		 for the target architectures (along withe associated
		 performance models), and the set of HOF's
		 corrsponding to the patterns.  The actual process
		 itself is divided into 5 steps: lexical and typing,
		 HOF id and extraction, determining useful
		 parallelism, mapping+load balancing+opto and code
		 production.  The id phase is rather simple, relying
		 on the explicit appearance of the name in the 
		 source.  Useful parallelism is defined to be the
		 subset of ideal/potential parallelism which, when
		 exploited on a given architecture will yield a 
		 positive absolute speedup.  It is therefore essential
		 to determine the degree of useful parallelism
		 represented by the application of the HOF's to the
		 arguments.  Three example HOF's, all manipulating 
		 multisets (bags) are then examined in this context.
		 For farms, the time is modelled by the time to
		 distribute a task and the time to process an element.
		 The comms time is relatively easy to model, but the
		 computation needs the following paramemters to be
		 determined:  average, minimum, max and the standard
		 deviation as well as distribution (how the average
		 varies over the input) and volume (the amount of data
		 to be processed).  Three startegies for acquiring this info
		 are considered:  static analysis, profiling and user
		 annotation.  Profiling is selected on the grounds
		 that static analysis won't be able to handle the 
		 general case, and users will not be able to estimate
		 the low level performance of their code.  Similar
		 reasonning is applied to map and filter, with various
		 performance graphs being given for different configs
		 of the basic template.  Next, the integration of f(x)
		 is used as an example, with comments being given at
		 each stage of the compilation (including data
		 distribution figures).  The solution is based upon
		 2 farms of map2 and filter ops on 100 element slices.
		 Load balancing and mapping is achieved by calculating
		 the ideal ratio between the 2 elements map2 and
		 filter, which defines how the solution would scale.  
		 The actual performance gives approximately linear
		 speedup, and only deviates by a couple of percent
		 from the calculated behaviour.},
  got =		{yes},

}


=========================III=============================

From: rbs@aisb.ed.ac.uk
Date: Sun, 15 Aug 93 23:58:33 BST
Message-Id: <19734.9308152258@mink.aisb.ed.ac.uk>
To: sakumar@magnus.acs.ohio-state.edu (Sanjay Kumar)
In-Reply-To: sakumar@magnus.acs.ohio-state.edu's message of 7 Aug 1993 21:42:02 GM
Subject: Dynamic load balancing/farming : Need help
Reply-To: rbs@aisb.ed.ac.uk
Status: RO


What do you mean by implementing?
Can I assume that you have a way of distributing tasks already?

You don't tell enough about the problem to help you specifically but I'll try.

If all the tasks are available before computation starts then it is best to
partition them up for execution on the various processors.
Fast processors would obviously have proportionally more tasks than slower processors.

If you create tasks dynamically as the program executes.
Then the best way would be for processors to ask around other processors when they 
run out of tasks.
This is very effective with a recursive divide-and-conquer algorithm.

If the problem is not recursive (which you seem to suggest) then a 
task farm would be a good idea.
A master process generates the tasks.
Each processor works on a task but when it becomes idle it asks the master for a new task.

All the above approaches are automatically load balancing.
You will get best results with losts of big grained tasks that have very sparse interaction.
This is because you want to minimize processors idle time, so big tasks mean the processors
will spend most of their time processing and not communicating or idle.
Also you want to minimize message traffic or the network near the master will
be congested and will kill performance.

   -Is it advisable to multitask the master host between a master and slave task.
     That is since most of the time master host is idle, why not let it execute
     one of the slaves ?

You have answered your own question.
If the master is mostly idle then it should be used as a slave as well
otherwise you are wasting it.
But it depends on how many processors you are going to use.
The more processors the more the master will be in demand.

   -Are there any references which describe the situation better (particularly
    the terminology of task/processor/host/process) ?

   -where can I find some discussion about expected performance measures
    like speedup in heterogenous computing environment ? 

I don't know.
If you get some responses on this please mail a list to the net.
I think it would be of interest to many people.

Rob Scott
Dept of AI                                    email: rbs@aisb.ed.ac.uk
Univ of Edinburgh                               Tel: 031-650-2713
Edinburgh, SCOTLAND, EH1 1HN


=========================IV=============================

Date: Mon, 16 Aug 1993 16:35:48 -0400
From: artg@watson.ibm.com (Art Goldberg)
Message-Id: <9308162035.AA39744@hobiecat.watson.ibm.com>
To: sakumar@magnus.acs.ohio-state.edu (Sanjay Kumar)
Subject: Re: Dynamic load balancing/farming : Need help
Status: RO

Randy Nelson, who works here, has done substantial work on queuing
models of multi-queue parallel systems.


=========================V=============================

From: vnug@phoenix.az.stratus.com (Vasu Nugala)
To: sakumar@magnus.acs.ohio-state.edu (Sanjay Kumar)
Subject: Re: Dynamic load balancing/farming : Need help

Hi Sanjay:

  I don't know whether you looked Linda implementation details. I think that 
  is exactly (concept-wise) what you are looking for. In Linda, there exists 
  a tuple space, where the tasks are loaded and distributed dynamically.

  IMHO, I think you are better off by looking thru Linda references. Good luck.

  -vasu

-----------------------------------------------------------------