Newsgroups: comp.arch,comp.parallel,comp.benchmarks,comp.sys.super
From: zxu@aloha.usc.edu (Zhiwei Xu)
Subject: Overheads of Parallel Systems (Revised Summary)
Organization: University of Southern California, Los Angeles, CA
Date: Thu, 8 Sep 1994 20:44:05 GMT
Message-ID: <34nt2l$2de@aloha.usc.edu>

This is a revision of a posting I put on the net a few weeks ago, with
answers to some questions I received. To Greg, Alf, and other people, I am
sorry for the delay in answering your questions. I thought it would be best to
collect and answer all questions on the net. If you can not find the answer to
your question, it is because I don't have a simple answer.
Please email me if you want further info.

Again, I would like to thank the following people for their contribution:

potter@patpserv.epfl.ch
rosenkra@convex1.convex.com
prins@cs.unc.edu
bradc@oregon.cray.com
kowallik@sgi.ifh.de
thompson@inmos.co.uk
lm@CS.Stanford.EDU
A.D.Ben-Dyke@computer-science.birmingham.ac.uk
jan@neuroinformatik.ruhr-uni-bochum.de
grunwald@kilt.cs.colorado.edu
ssl@cs.berkeley.edu
R.W. Numrich at Cray Research
P.L. Springer and J.C. Peterson at JPL
R.W. Hockney at Southampton University
lig@waltz.ncic.ac.cn
stunkel@watson.ibm.com
danh@qnx.com
D. Kranz, K. Johnson, A. Agarwal, J. Kubiatowicz, B.-H. Lim of MIT

I would like to make some comments about the data:

(1)	These data are collected from the viewpoint of a programmer, not a
	system purchaser. Some data listed in the same tale are obtained under
	different test conditions. They should be taken only as a VERY VERY
	rough estimation. The data values will change if any
	of a number of testing conditions change. These data are not meant
	to be used in argument like "my machine is better than yours". Rather,
	such data may help users to choose appropriate parallel algorithms,
	styles, etc. E.g., if overhead is large, I may choose a coarse grained
	approach. If barrier overhead is small, I may choose a synchronous
	algorithm.

  (2)	Some data are (terribly) outdated.

  (3)	The best data value is listed when more than one values are available.

  (4)	This list is terribly imcomplete. Could the knowledgeable people
	fill in the missing information or provide more updated data?

  (5)	I have double checked the data values. But if you find any error,
	  please inform me, I'll post the corrections on the net.

Some observations:

  (1)	All those peak FLOPS, peak bandwidth data, SPEC, NAS benchmark, LINPAC
	benchmark, etc. are fine and useful. But they don't really help a
	programmer or algorithm developer. E.g., when I try to decide whether
	to use a synchronous algorithm or an asynchronous one to solve a
	problem, one of the most important factor is how fast the barrier is.

  (2)	There should be a law :-) requiring all parallel machine companies to
	provide the following data, besides traditional peaks and benchmarks:

	  (a) task creation (and termination) time:
		  - at the least, time to create a null task
		  - several data values for more realistic cases (e.g., with
			  different memory allocation sizes)
		  - if task, process, thread are different concepts, timing
			  values for each
	  (b) context switch time

	  (c) time for each of traditional synchronization operations:
		  - lock (and semaphore)
		  - event
		  - critical region
	  (d) time for each of collective interaction operations
		  - barrier
		  - reduction
		  - scan
		  - broadcast
		  - eureka (as in Cray T3D)
	  (e) latency and bandwidth for point-to-point communication 
		  - between two neighboring nodes
		  - between any two nodes
For (b)-(e), give the best values and several realistic, representative values



  ================= Overhead Data of Some Machines ======================

  (1) Cray X-MP Multitasking (1984!!! data):

  tskstart         tskwait         evpost      evclear   evwait

1,500,000        1,500           1,500       200       1,500     clock periods
  or 4,000         100

  (2) Process creation time in microseconds

  Platform                 Process Creation Operation	Process Creation Time 

  Encore Multimax 520       Unix Fork  		          about 16000
  Encore Multimax 520       Mach Thread Create    	  about 4000
  Encore Multimax 520       nUnix Shared Fork 		  about 1800
  Transputer T9000          startp instruction		  about 1
  J-Machine                 Estimated			  1-2

  Alewife (33MHz)	    remote thread invokation	  0.5 to 7.4

  (3) Context-switching times in microseconds

  Platform,                Operating System		Context Switching Time
  Processor Speed (Mhz)

  HP-735, 99               HP-UX 9.01			68
  486DX, 33                QNX 4.2			80
  DEC 4000/610, 160        OSF/1 v1.3   		87
  486DX2, 66               Linux 0.99.10		98
  IBM RS6000/580, 62       AIX 3.2                      102
  Sun SS10, 40             SunOS 4.1.3			128
  Sun SS10, 40             SunOS 5.2			225
  Sun SS2, 40              SunOS 5.2			454


The following is quoted from danh@qnx.com:
----------------------- beginning of quote ----------------
Note that these times are the measured result of a pair of processes 
signalling each other, which includes the additional kernel call overhead 
for the signalling calls.  A benefit of this approach is that the source is 
portable to a wide range of UNIX systems, howver, a more correct result, 
but currently less portable, can be obtained by using the sched_yield() 
call in a POSIX 1003.1b (previously 1003.4) OS.  Since this approach 
executes only one kernel call per context switch, with no signal processing 
being done, it is a more accurate indication of the actual context switch 
time.  Without the signal processing, the intervals are also much shorter. 
For example, the Pentium/90 number for QNX drops from 23 usec to 2 usecs, 
the 486/66 number drops from 44 to 5 usecs.  Here's a more complete listing 
of numbers collected from various responses on the net:

Platform,                Operating System        Context Switching Time
Processor Speed (Mhz)                              in microseconds

  90 MHz PCI Pentium     QNX 4.21                  23
  60 MHz ALR Pentium     QNX 4.2                   28
 100 MHz HP-747x         HP-RT 1.1                 34
  66 MHz 486DX2          QNX 4.2                   44
 100 Mhz HP-735          HP-UX 9.01                68
  50 MHz HP-742          HP-RT 1.1                 71
  33 MHz 486DX           QNX 4.2                   80
 150 MHz 21064           DEC OSF/1 v1.3            93
  80 MHz HP-712/80       HP-UX 9.05                96
  40 MHz DEC 5000/240    Ultrix 4.2                98
  66 MHz 486DX2          Linux 0.99.10             98
 150 MHz DEC 3000/500    ?                        100
  62 MHz IBM-RS6000/580  AIX 3.2                  102
  66 Mhz Snake           HP-UX 9.x                106
 150 MHz DEC 3000/400    ?                        127
  40 Mhz Viking          SunOS 4.1.3              128
  25 MHz DEC 5000/200    ?                        154
  50 MHz R4000 SGI       ?                        158
  33 MHz Sparc           SunOS 4.1.1              198
  40 MHz MIPS R3000      Ultrix 4.3               198
  33 Mhz 486             386BSD 0.1               210
  85 MHz Sparcstation-5  SunOS 4.1.3_U1           210
  50 Mhz RIOS            AIX 3.2                  212
  20 MHz DEC 5000/120    ?                        281
  33 MHz R3000/3010 SGI  Irix 4.0.5               345
  33 MHz 486             NetBSD                   380
  50 MHz Sparcstation-LX Solaris 2.3              620
  50 MHz 486DX2          Solaris                  636
  16 MHz 386SX           NetBSD                  2603
----------------- end-of-quote -------------------------


(4) Collective Interactions Overheads (in microseconds)

Platform	    Barrier          Broadcast       	Reduce     	Scan

MasPar MP-1  		0		4		100		150
TMC CM-5        	4		13		10   		12
HP NOW               	61		80		81		102
IBM SP1 		314		187		221		329
Intel Paragon		705		1597		951		2039

(5) Message Passing Timing for Several Parallel Computer Systems
(Latency in microseconds, Bandwidth in MB/s )

Platform,            Measured       Measured        Projected      Projected
		     Latency	    Bandwidth	    Latency        Bandwidth

Cray C90              0.1              1180
Cray T3D              1.5               106          0.2             300
TMC CM-5              3.3                10          3                20
HP NOW                8.8                39
IBM SP1              28                   8.3
IBM SP2	  	     39		   48.2        5               300
Meiko CS-2            8                  43
Intel Delta          61                   7
Intel Paragon        82                  38         25               175

Measured means the values are obtained by running user-level programs, such
as ping-pong. LATENCY refers to the time for a null (0-byte) message. In the
original post, the measured latency data for CM-5, HP, and Paragon were round-
trip latency, i.e., time for a node to send a null message and receive a reply
>from the destination node. These data values are cut to half in this revision
to be compatible with the rest, where one-way latency values are used.
Projected means what the company has projected/promised or what you might be
able to achieve by system/assembly/hardware level techniques.

The Paragon used 8 nodes and the NX message passing library, with the message
processors enabled.

  Zhiwei Xu	zxu@aloha.usc.edu
