From: Reinhard Foessmeier <reinhard.foessmeier@munich.ixos.de>
Subject: Amdahl's law and superlinear speedup (summary of answers)
Date: Thu, 27 Oct 94 14:42:17 MET
Organisation: iXOS Software GmbH

A few days ago, I posted a query about Amdahl's law to comp.parallel.
The following is a collection of the answers I got.
I am sending this to everybody who kindly reacted to my query,
or who asked about possible results.

I am going to post a more concise summary to comp.parallel.
I don't think that the full version would pass the moderator.
Especially I shall have to drop most of A. D. Ben-Dyke's
interesting material, since it has been posted before. All of
you, to whom this message is directed, are getting the full material.

Some of my compatriots kindly answered in my native language
(German), which I gratefully acknowledge.  In these cases
I include a short resume in English (or something similar); I hope
I did nobody wrong but I am not sure I always got the full meaning.

To summarize, I think Thomas Henties gave the best (and most concise)
analysis of the problem when he said (my translation):

> In principle, superlinear speedup always can be reduced to
> a situation where the parallel case involves more memory (in any form
> whatsoever).

In the case I quoted, this extra memory is constituted by the second
processor's extra index register (or analogon) it uses to traverse
the second linear list. If the monoprocessor had such a second index
register, it could use it to avoid any context switching.

Thanks to all who answered.

Reinhard Foessmeier    <Reinhard.Foessmeier.ixos.de>
------------------------------------------------------------

From: "Theodore C. Belding" <streak@engin.umich.edu>

> Ronald Shonkweiler wrote several papers on this in connection with
> parallelizing genetic algorithms.  His papers are available by
> ftp at 128.61.5.59 (that site will be moving sometime soon to 128.61.5.35).
> Hope that helps.
> -Ted


From: kuechlin@informatik.uni-tuebingen.de (Prof. W. Kuechlin)

> Es ist eigentlich ganz einfach:
> superlinear speed-up ist nur m"oglich, wenn das parallele programm P
> weniger Arbeit verrichtet als das sequentielle Programm S.
> Wenn die Arbeit gleich ist, gibt es maximal linearen speed-up.
> 
> Etwas undurchsichtig wird das Ganze nur, weil man die _ganze_ Arbeit,
> also den ganzen zur Laufzeit abgearbeiteten Instruktionsstrom, betrachten muss,
> inklusive der Teile, die das Betriebssystem und etwaige cache-hardware etc
> ausf"uhren. Rein das Quell programm kann dabei die z.B. gleiche Arbeit tun.
> Hier kommen also Kontextwechsel, Speicherallokation, Paging etc zum Tragen.
> 
> Auch in der Informatik gilt: Es gibt [leider] keine Wunder, nur Arbeit ...
> 
> 	-wk.

Resume: organisational overhead, such as context switches, memory allocation,
        paging make things less transparent. Basically there are no miracles.

> From: eisen@cc.gatech.edu (Greg Eisenhauer)

> I don't have any references for you.  I spent some time looking for
> such references some time ago (in particular I was looking for literature
> specifically rebutting this type of claim) and came up empty.  However 
> I don't think that in general Amdahls law can be violated here.  This sort of
> situation  pops up in genetic algorithms where the probability of spending huge
> amounts of time exploring deadends drops dramatically the more processors
> you use.  You can always emulate those N processors on a single processor.
> (As I infer from your note that you know...)  Yes, the single processor will do
> more context switching, but that's no more interesting than the situations
> where cache sizes play a role.  If context switch overhead were zero for the
> uniprocessor it is at most N times slower than the multiprocessor.  By
> increasing the time between context switches one can reduce the context switch
> overhead to negligible values.  Large time slices may skew your emulation of
> the multiprocessor somewhat, but it's all the result of non-idealized
> overheads.  No fundamental truths change.
> 

From: Stefan Haenssgen <haensgen@ira.uka.de>
> 
> Hm, superlinearen Speedup kriegt man auch z.B. bei Suchproblemen, wo
> der Suchbaum auf n Prozessoren verteilt wird und man unter Umstaenden
> bei Hinzufuegen eines weiteren Prozessors (zufaellig) die Loesung
> durch die neue Verteilung sehr viel schneller findet, weil der Baum
> eben anders abgearbeitet wird. Keine Magie, aber ein dicker Sprung
> von n nach n+1.
> 
Resume: Superlinear speedup also arises in search problems where an
        an extra processor might accidentally hit the solution sooner
        due to a different search strategy.


From: A.D.Ben-Dyke@computer-science.birmingham.ac.uk
> 
> 	This subject was briefly discussed about a year ago -- I'm
> sending you all the stuff I archived away -- I hope it helps
> 
> Cheers,
> 
> 	Andy.
> 
> ----------------------------------------------------------------------
> Andy Ben-Dyke,
> School of Computer Science,	
> University of Birmingham,	Email:	   a.d.ben-dyke@cs.bham.ac.uk
> B15 2TT, England.		Tel: 	+44-(0)21 414 3734 (Fax: 4821)
> 
> 
> -----Start of Include----
> From: jss@pebbles.jpl.nasa.gov (Jeff Steinman)
> Newsgroups: comp.parallel
> Subject: Superlinear Speedup
> Date: 18 Feb 1994 13:16:59 -0500
> Organization: Yale University, Department of Computer Science, New Haven, CT
> 
> Hi, One of the classic examples of superlinear speedup is found in event list
> management for parallel discrete event simulations. When running on one node,
> all of the time-tagged events are maintained in a sorted event queue with
> hopefully log2(n) performance. When running in parallel, each node has on the
> average n/N events (where N is the number of nodes). Each local event queue
> now has (on the average) log2(n/N) performance and because each of the nodes
> are using their event queues in parallel, the overall event list management
> goes like (1/N)log2(n/N) which has superlinear speedup. Some have thought that
> this is mysterious and that the laws of common sense in parallel computing
> are violated. Of course this is not the case. The superlinear speedup is not
> caused by some dumb data structure either.
> 
> Here is the explaination. When running sequentially, events are processed in
> correct time order. When running in parallel, you have N streams of events
> processed in time order. Suppose that you want to record the sequence of
> events that were processed in global time order. You would get that for free
> in the sequential case because all events are processed in time order. However,
> in the parallel case you must merge the streams of locally processed events
> back together. You might ask, "What is the work in merging these lists of
> processed events back together?" In other words, how much work is involved
> in merging N sorted lists into one sorted list. The answer is that is exactly
> cancels the superlinear speedup term found above. There is no contradiction
> to the notion of superlinear speedup. It is really a matter of the fact that
> the sequential simulation has more information (i.e., globally knowing
> the global sequence of processed events). Usually nobody cares to merge N
> lists of processed events back together so the superlinear speedup remains
> a part of parallel discrete-event simulations (although the effect is usually
> too small to detect if good data structures are used). On the other hand,
> maybe people doing analysis should really merge their event streams back
> together for analysis purposes (i.e., so that you would have one sorted
> output file to analyze instead of one file for each node, or one unsorted
> mixed up file).
> 
> Jeff Steinman
> 
> P.S. If you want more information on what I have just desribed, I can either
> send you a paper I have written on this, or you can obtain it yourself
> (which ever is more convenient).
> 
> Steinman, Jeff 1992, "SPEEDES: A Multiple-Synchronization Environment for
> Parallel Discrete-Event Simulation." The International Journal for Computer
> Simulation, Vol. 2, No. 3, Pages 241-286.
> 
> 
> From: sundu@cvp1.sun_dad.ARPA (HARISH)
> Subject: superlinear speedup algorithms ?
> Date: 16 Feb 1994 17:51:30 -0500
> Organization: Design Automation Division, Texas Instruments.
> 
> Hi,
>   We are looking for some literature/software on parallell algorithms for any problem
>   which exhibits super linear speedup ( speedup > no. of processors ) ? Any help will
>   be highly appreciated.
> 
> From: keryell@ensmp.fr (Ronan KERYELL)
> Subject: Re: superlinear speedup algorithms ?
> Date: 18 Feb 1994 13:21:40 -0500
> Organization: E'cole Nationale Supe'rieure des Mines de Paris,
>  Centre de Recherche en Informatique (CRI-ENSMP)
> 
> >From sundu@cvp1.sun_dad.ARPA (HARISH)
> 
>   We are looking for some literature/software  on parallell algorithms
> for any problem which exhibits super linear speedup ( speedup > no. of
> processors ) ? Any help will be highly appreciated.
>   
> Look at for example :
> 
> @conference{loi_Amdahl,
> 	title = {Validity of the Single Processor Approach to Achieving Large
> 	Scale Computing Capabilities\indexbib{loi!d'Amdahl@d'\fsc{Amdahl}
> 	}\indexbib{performance!
> 	machine scalaire contre machine paralle`le}\indexbib{e'valuation!
> 	des performances!a` taille de du proble`me fixe}},
> 	author = {Gene M. Amdahl},
> 	booktitle = {Proceedings of the Spring Joint Computer Conference},
> 	year = 1967,
> 	editor = {AFIPS},
> 	pages = {483-485},
> 	}
> 
> 
> 
> @article{reeval_amdahl,
> 	title = {Reevaluating Amdahl's Law
> 		\indexbib{e'valuation!des performances!re'e'valuation de la loi
> 		d'Amdahl}\indexbib{e'valuation!des performances!a` temps d'exe'cution
> 		fixe}},
> 	author = {John L. Gustafson},
> 	journal = cacm,
> 	pages = "532-533",
> 	volume = 31,
> 	number = 5,
> 	month = may,
> 	year = 1988
> }
> 
> @article{fixed_time_benchmark,
> 	title = {The Design of a Scalable, Fixed-Time Computer Benchmark
> 		\indexbib{e'valuation!des performances!a` temps d'exe'cution
> 		fixe}\indexbib{benchmark@{\em
> 		benchmark}!a` temps  constant}\indexbib{benchmark@{\em
> 		benchmark}!SLALOM}\indexbib{e'valuation!des performances!a` temps
> 		constant}\indexbib{superline'arite'!possible}},
> 	author = {John Gustafson and Diane Rover and Stephe Elbert
> 				and Michael Carter},
> 	journal = jpdc,
> 	pages = "388-401",
> 	volume = 12,
> 	number = 4,
> 	month = aug,
> 	year = 1991
> }
> 
> @conference{another_speedup,
> 	title = {Another View on Parallel Speedup\indexbib{e'valuation!des
> 	performances!a` temps d'exe'cution fixe}\indexbib{e'valuation!des
> 	performances!a` taille me'moire
> 	fixe}\indexbib{benchmark@{\em benchmark}!a` temps
> 	constant}\indexbib{benchmark@{\em benchmark}!a` me'moire
> 	constante}\indexbib{superline'arite'!possible}},
> 	author = {Xian-He Sun and Lionel M. Ni},
> 	booktitle = {Proceedings Supercomputing '90},
> 	publisher = ieee,
> 	year = 1990,
> 	month = nov,
> 	pages = "559-568",
> 	}
> 
> @article{folk_theorems_paral,
> 	author = {Selim G. Akl and Michel Cosnard and Alfonso G. Ferreira},
> 	title = {Data-Movement-Intensive Problems: Two Folk Theorems
> 	in Parallel Computation 
> 	Revisited\indexbib{superline'arite'!possible}\indexbib{e'valuation!des
> 	performances!proble`mes de de'placement de donne'es}},
> 	journal = tcs,
> 	volume = 95,
> 	number = 2,
> 	year = 1992,
> 	institution = lip
> 	}
> 
> @string(tcs = {Theoretical Computer Science})
> 
> @techreport{lip:folk_theorems_paral,
>         author = {Selim G. Akl and Michel Cosnard and Alfonso G. Ferreira},
>         title = {Data-Movement-Intensive Problems: Two Folk Theorems
> 		Revisited\indexbib{superline'arite'!
> 		possible}\indexbib{e'valuation!des
> 		performances!proble`mes de de'placement de donne'es}},
>         year = 1990,
> 		number = {90-18},
> 		institution = lip
> 	}
> 
> @string(pc = {Parallel Computing})
> 
> @article{impossible_superlinear,
> 	title = {Superlinear Speedup of an Efficient Sequential Algorithm is not
> 		Possible\indexbib{e'valuation!des
> 		performances}\indexbib{superline'arite'!impossible}},
> 	author = {V. Faber and Olaf M. Lubeck and White, Jr., Andrew B.},
> 	journal = pc,
> 	pages = "259-260",
> 	volume = 3,
> 	number = 3,
> 	year = 1986
> }
> 
> @article{impossible_superlinear_2,
> 	title = {Comments on the paper ``Parallel Efficiency can be Greater
> 		than Unity''\indexbib{e'valuation!des
> 		performances}\indexbib{superline'arite'!impossible}},
> 	author = {V. Faber and Olaf M. Lubeck and White, Jr., Andrew B.},
> 	journal = pc,
> 	pages = "209-210",
> 	volume = 4,
> 	year = 1987
> }
> 
> ---
>     Ronan KERYELL
>     Centre de Recherche en Informatique
>     de l'Icole Nationale Supirieure des Mines de Paris (CRI-ENSMP)
>     35, Rue Saint-Honori
>     77305 FONTAINEBLEAU CEDEX
>     FRANCE                               Tel:    (+33 1) 64.69.48.44
>                                          Fax:    (+33 1) 64.69.47.09
>                                          E-mail: keryell@cri.ensmp.fr
> 
> 
> 
> From: crispin@csd.uwo.ca (Crispin Cowan)
> Subject: Re: Superlinear Speedup
> Date: 21 Feb 1994 11:41:42 -0500
> Organization: Department of Computer Science, University of Western Ontario, London, Ontario, Canada
> 
> Jeff Steinman <jss@pebbles.jpl.nasa.gov> wrote:
> >goes like (1/N)log2(n/N) which has superlinear speedup. Some have thought that
> 
>   (1/N)*(log2(n/N))
> = (1/N)*(log2(n) - log2(N))
> = log2(n)/N - log2(N)/N
> which is linear.
> 
> It all depends on what you mean by "linear".  If you mean "speedup
> greater than N", then yes, the above wins by a constant additive
> factor.  If you mean "speedup greater than some linear constant", then
> it's not really possible.  It is always possible to simulate N
> processors on a single processor with a slowdown no worse than (N +
> epsilon), where the epsilon is the cost of context switching.
> 
> Since the original poster asked for the former definition, Jeff's
> response hits the mark.  However, this is not the usual meaning of
> "linear".
> 
> Crispin, who is most definitely *not* a theory-head
> -----
> Crispin Cowan, CS PhD student, searching for a research position
> University of Western Ontario
> Phyz-mail:  Middlesex College, MC28-C, London, Ontario, N6A 5B7
> E-mail:     crispin@csd.uwo.ca          Voice:  519-661-3342
> "A distributed system is one in which I cannot get something done
> because a machine I've never heard of is down"   --Leslie Lamport
> 
> From: nicol@cs.wm.edu (David Nicol)
> Subject: Speedups that are "too good" and simulation
> Date: 21 Feb 1994 11:44:21 -0500
> Organization: The College of William and Mary
> 
> Some comments on "superlinear speedup".
>  1. Eugene Miya makes a good argument for calling it "superunity speedup"
>     instead.  "Linearity" implicitly means some sort of functional
>     response at two more more data points.   The issue we talk about
>     need only occur at one data point---when using n processors we observe
>     a speedup (measured against a righteous serial algorithm) of more than n.  
>     A minor point, and probably a lost cause.
>    
>  2. Jeff Steinman recently posted here an explanation of why
>     parallel discrete-event simulation has the theoretical capability
>     to achieve super-unity speedup.  His explanation was that there
>     is an extra post-processing merge step that, if performed, would
>     balance the complexity books.  Let me offer a different explanation.
>     A parallel simulation distributes an event list across processors.
>     The parallel simulation exploits the fact that within this list
>     are events that can be processed concurrently, there not being
>     any causality relationship between them.  Notice that
>         (a) a traditional event list data structure orders 
>             events on the basis of time. Implicitly it is expending
>             work to maintain an order between events for which
>             no order is necessary. Parallel versions don't.
>             So maybe super-unity speedup might happen because the
> 	    serial implementation isn't as good as it should be.
> 
>         (b) Parallel simulations engage in "extra work" to discover
>             lack of causality relations between events, extra work
>             that the serial event list algorithm does not.  The cost of the 
> 	    synchronization protocol in parallel simulations is not
> 	    accounted for in Steinman's analysis. So maybe the complexity
>             books balance after all.
> 
>     The classic argument against super-unity speedup (assuming flat memory
>     access costs) is that you can always emulate on a serial machine
>     a parallel algorithm that is "too good". I think that's the case
>     for parallel simulation; a careful analysis is pretty hard due to
>     the dynamics of the problem.  The only time I've seen or heard
>     of actual super-unity speedup in parallel simulation is when
>     rotten event-list algorithms are used, or, due to something non-scalable
>     in the operating system.  I jest? No, I've personally tracked down
>     speedups that were too good to a non-scalable implementation of
>     malloc!  The basic problem is a linear scan of the free block list, looking
>     for a fit.  Read all about it in
> 
>        "Inflated Speedups in Parallel Simulations via malloc",
>        International Journal on Simulation, vol 2., Dec 92, 413-426.
> 
>     available at fine shops everywhere.  ICASE Tech report versions available
>     by contacting shelly@icase.edu.
>   
>     --David Nicol
> 
> From: vsam@cs.utexas.edu (Vasilis Samoladas)
> Subject: Re: Superlinear Speedup
> Date: 1 Mar 1994 10:54:58 -0500
> Organization: CS Dept, University of Texas at Austin
> 
> Jeff Steinman (jss@pebbles.jpl.nasa.gov) wrote:
> : Hi, One of the classic examples of superlinear speedup is found in event list
> : management for parallel discrete event simulations. When running on one node,
> : all of the time-tagged events are maintained in a sorted event queue with
> : hopefully log2(n) performance. When running in parallel, each node has on the
> : average n/N events (where N is the number of nodes). Each local event queue
> : now has (on the average) log2(n/N) performance and because each of the nodes
> : are using their event queues in parallel, the overall event list management
> : goes like (1/N)log2(n/N) which has superlinear speedup. Some have thought that
> ....
> Hi,
> your argument seems bogus to me. The two problems compared are not the same.
> If on my single processor I do not use ONE queue but N different ones, then
> I get exactly linear speedup. It seems to me that the super-linear part of
> the speedup is due to the fact that on a single node the algorithm performs
> some extra (and maybe uneccesary, I'm not informed enough in simulation to
> know) work.
> 
> Well, I sure would like to be corrected on this.
> 
> Yours, Vasilis Samoladas
> 
> 
> 
> From: andras@concave.cs.wits.ac.za (Andras Salamon)
> Subject: Re: Superlinear Speedup
> Date: 1 Mar 1994 10:55:08 -0500
> Organization: Computer Science, University of the Witwatersrand
> 
> Crispin Cowan <crispin@csd.uwo.ca> wrote:
> > It is always possible to simulate N processors on a single processor
> > with a slowdown no worse than (N + epsilon), where the epsilon is the
> > cost of context switching.
> 
> I thought the disproof of this was widely known.  See
> 
> @ARTICLE( akl:92,
>     author = "Selim G. Akl and Michel Cosnard and Afonso G. Fereira",
>     title = "Data-movement-intensive problems: two folk theorems in
>             parallel computation revisited",
>     journal = tcs,
>     year = 1992,
>     volume = 95,
>     pages = "323--337",
>     summary = "Shows speedup with n processors can be 2n-1 when data
>                 movement is intrinsically part of algorithm",
> )
> 
> OK, this is theory but it does show that the simple `so simulate a
> parallel machine by timeslicing a single processor' argument isn't all
> that simple.
> 
> > Crispin, who is most definitely *not* a theory-head
> 
> Andras, who is most definitely a theory-head ;-)
> -- 
> Andr\'as Salamon                   andras@concave.cs.wits.ac.za
> 
> From: rich@dcache.uucp (Richard J. Taylor)
> Subject: Re: Superlinear Speedup
> Date: 4 Mar 1994 10:05:56 -0500
> Organization: data-CACHE Corporation, Boise, Idaho
> 
> Here is an example of superlinear speedup that I have witnessed.  Is it
> legitimate?
> 
> Take a large data intensive problem.  As the size of the dataset increases, 
> it overflows the main memory cache, and accesses have to go to disk.  Disk 
> accesses are an order of magnitude slower so the time to perform an 
> operation slows down non-linearly.  In this situation we get a superlinear 
> speedup by increasing the number of processing elements to the point where 
> there is enough cache for everything to fit into, and an operation does not 
> cause any disk accesses.
> 
> In our case the example is database, and the interesting non-linear slowdown
> occurs when we are random accessing through an index and the index gets 
> large enough that it does not fit in cache.  This occurs when we get to the 
> order of millions of rows per processing element. 
> 
> We get a non-linear speedup by adding extra memory along with the
> processing elements.  There is an upper limit to the amount of memory on a
> processing element, so we cannot make a small configuration with enough 
> memory to avoid the disk accesses. 
> 
> In databases there is a second objection to enlarging the cache, and that 
> is that a larger cache slows down the checkpoint and makes the system less 
> responsive.  As with anything, there are ways to mitigate the problem, but
> in general a parallel database machine with multiple caches will checkpoint
> faster and be more responsive than a single processor.
> 
> - richard taylor
> 
> The problem with large systems is that they require big tests that take longer
> 
> From: jss@pebbles.jpl.nasa.gov (Jeff Steinman)
> Subject: Superlinear speedup
> Date: 4 Mar 1994 10:03:03 -0500
> Organization: Yale University, Department of Computer Science, New Haven, CT
> 
> Jeff Steinman (jss@pebbles.jpl.nasa.gov) wrote:
> : Hi, One of the classic examples of superlinear speedup is found in event list
> : management for parallel discrete event simulations. When running on one node,
> : all of the time-tagged events are maintained in a sorted event queue with
> : hopefully log2(n) performance. When running in parallel, each node has on the
> : average n/N events (where N is the number of nodes). Each local event queue
> : now has (on the average) log2(n/N) performance and because each of the nodes
> : are using their event queues in parallel, the overall event list management
> : goes like (1/N)log2(n/N) which has superlinear speedup. Some have thought that....
> 
> >Hi,
> >your argument seems bogus to me. The two problems compared are not the same.
> >If on my single processor I do not use ONE queue but N different ones, then
> >I get exactly linear speedup. It seems to me that the super-linear part of
> >the speedup is due to the fact that on a single node the algorithm performs
> >some extra (and maybe uneccesary, I'm not informed enough in simulation to
> >know) work.
> 
> >Well, I sure would like to be corrected on this.
> 
> >Yours, Vasilis Samoladas
> -------------------------------------------------------------------------------
> I think that there is some confusion on this, especially if you are thinking
> of conservative simulation methodologies where concurrency is explicitely
> extracted. Of course, if you knew which events could be processed concurrently
> you could separate them into separate event lists even on one node just as if
> they were distributed across many nodes. However, there are two exceptions to
> this.
> 
> 1. If each after processing each event, output data is written to a file
> in time tagged order, then you would have to reassemble the data streams for
> events that were processed concurrently into a single stream. Otherwise the
> data from the processed events would either have to go into separate files
> (which would have to be merged later) or your single file would not have
> sorted output. In this example (of outputing data from each event into a
> single file), event list management on one node already can do this work.
> However, on multiple nodes the step of merging data from events that have
> been processed is not done. To do this work exactly cancels the superlinear
> speedup that would have been present when ignoring the merge. It becomes
> (in a way) a question of "information loss" vs. "known concurrency in the
> problem".
> 
> 2. Optimistic approaches absolutely require each node to produce a stream
> of locally processed events in their time tagged order for GVT updates.
> When a new GVT is determined, events that were processed with time tags
> less than or equal to GVT should be cleaned up. One could go through every
> object on each node and search for those events (this is sometimes). However,
> This would not scale if there were 100,000 objects on each node and only
> 100 events were processed in a GVT cycle (or worse yet, what if GVT was
> constantly being updated!). Therefore, a much better way to solve this problem
> is to make sure that locally processed events are maintained in a stream of
> sorted events. This again gives rise to the superlinear speedup problem.
> If there is no step to merge the locally sorted processed event streams,
> then you will have superlinear speedup.
> 
> Jeff Steinman
> 
> From: reinhard.foessmeier@munich.ixos.de (Reinhard Foessmeier)
> Subject: Amdahl's law and superlinear speedup
> Summary: Any pointers about superlinear speedup?
> Keywords: Amdahl's law, speedup
> Organization: iXOS Software GmbH, Muenchen, Germany
> 
> On a conference I attended someone mentioned cases of superlinear
> speedup by parallelization. Apart from cases where certain
> threshold values (e. g. cache sizes) play an important part,
> he told me of an example where several linear lists are searched
> by so many processors, which turn out to be capable to do this
> in less than 1/n of the time needed by one processor (which has to
> do a lot of context switching). The problem must be formulated
> with care, and the statistical distribution of the list elements
> is important. Under such and such circumstances, the example
> seems to violate Amdahl's law on speedup.
> 
> Does anybody know about literature on this subject? I should like
> to know more about it but have found no pointers so far.
> 
> If you send me e-mail, I shall gladly post a summary if there's
> interest.
> 
> 
> <Reinhard.Foessmeier@ixos.de>        | "M'innervosiscono i semafori e
> iXOS GmbH, Bretonischer Ring 12      |  gli stop, e la sera ritorno con la
> DE-85630 Grasbrunn                   |  noia della stanchezza"  Franco Battiato
> -- 
> Reinhard Foessmeier  | Ein Baer traf im Ural ein Gnu
> <refo@ixos.de>       | Und blinzelte ihm zaertlich zu.
> iXOS GmbH            | Da sprach das Gnu zu dem Uralbaern:
> Bretonischer Ring 12 | "Du bist heut' wieder ziemlich albern!"
> DE-85630 Grasbrunn   |
> 


From: Thomas.Henties@zfe.siemens.de (Thomas Henties)

> Im Prinzip laeuft superlinearer Speedup immer darauf hinaus, dass im
> parallelen Fall mehr Speicher (in irgendeiner Form) zur Verfuegung steht.
> | ... quotation deleted
> Warum Context Switching? Laufen n Prozesse auf einem Prozessor?
> Das sollte nicht der Vergleichsmassstab sein!

Resume: In principle, superlinear speedup always can be reduced to
a situation where the parallel case involves more memory (in any form
whatsoever).


From: kam@cs.city.ac.uk (Kevin A Murray)
> 
> I believe David Sharp (dwns@doc.ic.ac.uk) was doing some work on
> this. You could try contacting him for more information.
> 
____________________________________________________________________
Newsgroups: comp.parallel
From: "Theodore C. Belding" <Ted.Belding@umich.edu>
Subject: superlinear speedup references
Organization: University of Michigan CSE Division
Date: Sat, 5 Aug 1995 23:05:30 GMT
Message-ID: <Ted.Belding-0508951905300001@pm054-11.dialip.mich.net>

Thanks to everyone who responded several weeks ago to my request
for information on superlinear speedup in parallel search.  I'm
currently writing paper on various types of superlinear speedup,
specifically in genetic algorithms.  Here's a bibliography of
what I've come across so far.  Corrections and additions are, of
course, gratefully accepted.  Pardon the kludgy LaTeX formatting;
BibTeX and I don't seem to get along very well...
-Ted


\section{definitions of speedup}

Foster, I. (1995). {\it Designing and Building Parallel Programs.}
Reading, MA: Addison-Wesley. URL: {http://www.mcs.anl.gov/dbpp}

Golub, G., and J. M. Ortega.  (1993). {\it Scientific Computing: An
Introduction with Parallel Computing.} Boston: Academic Press.

Herzog, U. (1988). Performance evaluation principles for vector- and
multiprocessor systems. {\it Parallel Computing} {\bf 7}:425--438.

Leighton, F. T. (1992). {\it Introduction to parallel algorithms and
architectures: arrays, trees, hypercubes.} San Mateo, CA: Morgan
Kaufmann.


\section{The great debate --- is superlinear speedup possible?}

Faber, V., O. M. Lubeck, and A. B. White, Jr. (1986). Superlinear
speedup of an efficient sequential algorithm is not possible. {\it
Parallel Computing} {\bf 3}:259--260.

Faber, V., O. M. Lubeck, and A. B. White, Jr. (1987). Comments on the
paper ``Parallel efficiency can be greater than unity''. {\it Parallel
Computing} {\bf 4}:209--210.

Fischer, D. (1991). On superlinear speedups. {\it Parallel Computing}
{\bf 17}:695--697.

Forrest, S. (1990).  Emergent computation: self-organizing, collective,
and cooperative phenomena in natural and artificial computing networks.
{\it Physica D} {\bf 42}:1--11.  Reprinted in Forrest, S. (Ed.), {\it
Emergent Computation,} pp. 1--11. Cambridge, MA: MIT Press, 1991.

Jan{\ss}en, R. (1987). A note on superlinear speedup. {\it Parallel
Computing} {\bf 4}:211--213.

Parkinson, D. (1986). Parallel efficiency can be greater than unity.
{\it Parallel Computing} {\bf 3}:261--262.

Schneck, P. B. (1986). Superlinear speed-up and the halting problem.
{\it Software -- Practice and Experience} 16(9):781--782.


\section{Cache-effect and related I/O or memory-hierarchy superlinear
speedup}

Abu-Sufah, A., H. E. Husmann, and D. J. Kuck. (1986). On input/output
speedup in tightly coupled multiprocessors.  {\it IEEE Transactions on
Computers} {\bf C-35}(6):520--530.

Belding, T. C. (1995). The distributed genetic algorithm revisited. In
Eshelman, L. J. (Ed.), {\it Proceedings of the Sixth International
Conference on Genetic Algorithms,} pp. 114--121. San Francisco, CA:
Morgan Kaufmann. URL: {http://www.engin.umich.edu/~streak/dga.html}


\section{Superlinear speedup in parallel search}

Imai, M., Y. Yoshida, and T. Fukumura. (1979).  A parallel searching
scheme for multiprocessor systems and its application to combinatorial
problems. {\it IJCAI} 1979:416--418.

Janakiram, V. K., E. F. Gehringer, D. P. Agrawal, and R. Mehrotra.
(1988). A randomized parallel branch-and-bound algorithm. {\it
International Journal of Parallel Programming} {\bf 17}(3):277-301.

Kornfeld, W. A. (1981). The use of parallelism to implement a heuristic
search. {\it IJCAI} 1981:575--580.

Kumar, V., A. Grama, A. Gupta, and G. Karypis. (1994). {\it
Introduction to parallel computing: design and analysis of parallel
algorithms.} Redwood City, CA: Benjamin/Cummings.

Lai, T.-H., and S. Sahni. (1984). Anomalies in parallel
branch-and-bound algorithms.  {\it Communications of the ACM,} {\bf
27}(6):594--602.

Li, G.-J., and B. W. Wah. (1986). Coping with anomalies in parallel
branch-and-bound algorithms.  {\it IEEE Transactions on Computers} {\bf
C-35}(6):568--573.

Mehrotra, R., and E. Gehringer. (1985). Superlinear speedup through
randomized algorithms.  In {\it Proceedings of the 1985 International
Conference on Parallel Processing,} St. Charles, IL, pp. 291--300.

Quinn, M. J., and N. Deo. (1986). An upper bound for the speedup of
parallel best-bound branch-and-bound algorithms. {\it {BIT}} {\bf
26}:35--43.

Rao, V. N., and V. Kumar. (1988). Superlinear speedup in parallel
state-space search. In Nori, K. V., and S. Kumar (Eds.), {\it
Foundations of Software Technology and Theoretical Computer Science,
8th Conference,} Pune, India, December 21--23, 1988, Proceedings.
Lecture Notes in Computer Science Vol. 338, pp. 161--174. Berlin:
Springer-Verlag.

Speckenmeyer, E. (1987). Superlinear speedup for parallel backtracking.
In Houstis, E. N., T. S. Papatheodorou, and C. D. Polychronopoulos
(Eds.), {\it Supercomputing: 1st International Conference,} Athens,
Greece, June 8--12, 1987, Proceedings.  Lecture Notes in Computer
Science Vol. 297, pp. 985--993. Berlin: Springer-Verlag.

Speckenmeyer, E. (1988). Is average superlinear speedup possible? In
B\"orger, E., H. K. B\"uning, and M. M. Richter (Eds.), {\it CSL '88:
2nd Workshop on Computer Science Logic,} Duisburg, FRG, October 3--7,
1988, Proceedings.  Lecture Notes in Computer Science Vol. 385, pp.
301--312. Berlin: Springer-Verlag.

Tinker, P. A. (1988).  Performance of an {OR}-parallel logic
programming system. {\it International Journal of Parallel Programming}
{\bf 17}(1):59--92.


\section{Other superlinear speedup}

Boreddy, J., and A. Paulraj. (1990). On the performance of transputer
arrays for dense linear systems. {\it Parallel Computing} {\bf
15}:107--117. %superlinear speedup due to the reduction in array
indexing overheads in %multitransputer arrays


\section{Search-related superlinear speedup in genetic algorithms,
genetic programming, and classifier systems}

Baiardi, F., P. Gambini, A. Giani, and A. Starita. (1995). Improving
learning through rule cooperation in parallel classifier systems. URL:
{ftp://di.unipi.it/pub/Papers/giani/icgaws.ps.Z}

Collins. R. (1992). {\it Studies in Artificial Evolution.} Ph.D.
thesis, University of California -- Los Angeles, Los Angeles, CA. URL:
{ftp://ftp.cognet.ucla.edu/pub/alife/papers/collins-phd?.ps.gz} %see
chap. 7.

Giani, A., F. Baiardi, and A. Starita. (1995). PANIC: A parallel
evolutionary rule based system.  In {\it EP95}. URL:
{ftp://di.unipi.it/pub/Papers/giani/ep95.ps.Z}

Goldberg, D. E., H. Kargupta, J. Horn, and E. Cantu-Paz. (1995).
Critical deme size for serial and parallel genetic algorithms. IlliGAL
Report 95002. Urbana, IL: University of Illinois at Urbana-Champaign.

Koza, J., and D. Andre. (1995). Parallel genetic programming on a
network of transputers.  Stanford University Technical Report
STAN-CS-TR-95-1542. Stanford, CA: Stanford University. URL:
{ftp://elib.stanford.edu/pub/reports/cs/tr/95/1542/}

Maresky, J. (1994). {\it On Efficient Communication in Distributed
Genetic Algorithms.} M.S. Dissertation, Institute of Computer Science,
The Hebrew University of Jerusalem. URL:
{ftp://ftp.huji.ac.il/users/jonathan/msc-maresky.ps.gz} %superlinear
speedup

M{\"u}hlenbein, H., M. Schomisch, and J. Born. (1991). The parallel
genetic algorithm as function optimizer. {\it Parallel Computing} {\bf
17}:619--632.

Punch, W. F. (1995). Re: Superlinear speedup.  {\it Genetic Algorithm
Digest} {\bf 9}(15). URL:
{http://www.aic.nrl.navy.mil:80/galist/digests/}

Shonkwiler, R. (1993). Parallel genetic algorithms. In Forrest, S.
(Ed.), {\it Proceedings of the Fifth International Conference on
Genetic Algorithms,} pp. 199--205. San Mateo, CA: Morgan Kaufmann.

Shonkwiler, R., and E. Van Vleck. (1994). Parallel speed-up of Monte
Carlo methods for global optimization. {\it Journal of Complexity} {\bf
10}(1):64--95.

Shonkwiler, R., F. Mendivil, and A. Deliu. (1991). Genetic algorithms
for the 1-D fractal inverse problem. In Belew, R., and L. Booker
(Eds.), {\it Proceedings of the Fourth International Conference on
Genetic Algorithms,} pp. 495--501. San Mateo, CA: Morgan Kaufmann.

Tanese, R. (1987). Parallel genetic algorithms for a hypercube. In
Grefenstette, J. J. (Ed.), {\it Genetic Algorithms and Their
Applications: Proceedings of the Second International Conference on
Genetic Algorithms,} pp. 177-183.  Hillsdale, NJ: Lawrence Erlbaum.

--
Ted Belding           Ted.Belding@umich.edu or streak@engin.umich.edu
University of Michigan   Division of Computer Science and Engineering