PTLIB Recommended Reading

Contribute to this page

Contents of this page

Parallel debugging and performance analysis

Books

ed. Margaret Simmons, Ann Hayes, Jeffrey Brown, and Daniel Reed. Debugging and Performance Tuning for Parallel Computing Systems. IEEE Computer Society, 1996.

Papers

R. Wismuller, M. Oberhuber, J. Krammer, and O. Hansen. Interactive debugging and performance analysis of massively parallel applications. Parallel Computing 22(3), March 1996, pp. 415-442.
This paper describes a tool environment for the debugging and performance tuning phase of parallel program development. The environment includes a parallel debugger (DETOP), a performance analyzer (PATOP), and a common distributed monitoring system, and is available for Parsytec's PowerPC based parallel computing systems.

John May and Francine Berman. Creating Views for Debugging Parallel Programs. Proc. SHPCC'94, Knoxville, TN, May 1994.
This paper describes the features of the Panorama parallel debugger that allow programmers to implement new graphical views of a program's state and augment existing ones. The user part of Panorama runs on a sequential computer and gathers information from the debugger that is running on the parallel machine and is supplied by the hardware vendor. New views are written in Tcl/Tk code, some of which is automatically generated by the Panorama Display Builder tool. When the debugger is running, a view can be updated manually by the user pushing a button or automatically by callbacks that have been registered with the Panorama run server.
See also.

Doreen Cheng and Robert Hood. A Portable Debugger for Parallel and Distributed Programs. Proc. Supercomputing '94.
The goal of the p2d2 project at NASA Ames is to build a debugger that supports debugging parallel and distributed computations at the source code level and that is portable across a variety of platforms and programming models. The strategy of the p2d2 design is to use a client-server model to partition the debugger. The debugger server contains the platform-dependent code while the debugger client contains the portable user-interface code. It is intended that the server be implemented by the platform vendor and the client by the vendor or a user community or other third party. In the prototype p2d2 implementation, the debugger server is based on gdb. The p2d2 project has defined the following three interaction protocols:
  1. the debugger server (Ds) protocol that specifies how the user interface client can access debugging abstractions provided by the server (e.g., setting a BreakpointTrap by registering a callback function to be executed when a specific code location is reached)
  2. the communication library protocol for interaction between a message passing library such as PVM and the debugger. The protocol specifies a collection of debugger library entry points and requires the communication library to register run-time events.
  3. the distribution preprocessor protocol which gives the debugger access to information about how data structures in an HPF program are distributed and aligned
For debugging in a heterogeneous distributed computing environment, the distribution manager component, which executes on the client machine along with the user interface component, handles the mapping of user requests into operations provided by a remote set of debugger servers and handles the aggregration of responses. See also.

D. Abramson and R. Sosic. A Debugging Tool for Software Evolution. CASE-95, 7th International Workshop on Computer-Aided Software Engineering Toronto, Ontario, Canada, July 1995.
GUARD (Griffith University Parallel Debugger) is a debugging tool which addresses debugging problems in the context of evolving software from existing programs or of porting software from one system to another, for example porting a vector code to a parallel computer. Guard facilitates comparison of a new program with an existing reference version which is assumed to produce correct results. Guard requires the user to specify key points in the two programs at which various data structures should be equivalent within a predefined tolerance. Errors in the new program are then located automatically by runtime comparison of the internal state of the new and reference versions. Guard makes no assumptions about the control flow of the two programs, which may be different. The programs may also be written in different languages. Guard is aimed at scientific and numerical computing and currently handles variables of base data types integer, real, character, and multidimensional arrays of these base types. A visualization system uses pixel maps to show graphically when corresponding values of arrays are not the same, thus allowing visualization of large data sets present in numeric programs. Guard works in a distributed, heterogeneous environment, and it is possible to execute the reference code, the program being debugged, and Guard on three different computer systems connected by a network. Current work includes extending Guard to support conversion of existing sequential applications into data parallel language form. See also.

Kranzmuller, Grabner, and Volkert. Event Graph Visualization for Debugging Large Applications. Proc. SPDT'96: ACM SIGMETRICS Symposium on Parallel and Distributed Tools, May 1996, Philadelphia, pp. 108--117.
This paper describes ATEMPT (A Tool for Event ManiPulaTion) which is part of the MAD (Monitoring And Debugging) environment for debugging and performance analysis of message passing programs on distributed memory machines. ATEMPT provides error detection facilities for quickly detecting communication errors such as isolated send and receive events and corresponding message events with different message length. ATEMPT also includes functionality for detecting race conditions, for reordering race condition events, and for simulating the modified event graph. Another feature of ATEMPT is the generation of a "cut", which is a set of breakpoints on more than one process. ATEMPT includes a source code connection for relating a graphically displayed event symbol to the line of code which generated the event. For handling large applications, ATEMPT provides horizontal (time-axis) abstraction -- e.g., representation of a communication pattern with one symbol -- and vertical (process-axis) abstraction -- e.g., grouping of processes. See also.

Vikram S. Adve and John Mellor-Crummey and Mark Anderson and Ken Kennedy and Jhy-Chun Wang and Daniel A. Reed. An Integrated Compilation and Performance Analysis Environment for Data Parallel Programs. Proc. Supercomputing '95.
This paper describes work on integrating data parallel fortran compilers (Fortran 77D and HPF) with the Illinois Pablo performance analysis environment. The Fortran 77D compiler translates a Fortran 77 program annotated with data layout directives into an explicitly parallel SPMD message-passing code. The compiler also performs complex optimizing transformations. In the integrated system, the FortranD compiler decides where instrumentation should be inserted in the synthesized message-passing code and inserts calls to the appropriate Pablo routines. The compiler also records static mapping information about how source code constructs relate to one another and to corresponding constructs in the compiler-synthesized code. Dynamic records generated during an instrumented procedure call include the tag of a static data correlation record. The FortranD editor graphically presents the measured computation and communication costs of individual procedures and loops in the source code program, as well as showing dependences that cause communication. Information about remote data access patterns can also be collected, although at a high cost of dynamically recording the bounds of array sections communicated in each message, but the authors are investigating how to reduce this overhead. The dPablo browser is an off-line performance tool that displays performance metrics and array reference patterns computed over user-specified code regions and execution intervals. The data correlation software and dPablo interface have been ported to the Portland Group HPF compiler.

Margaret R. Martonosi and David Ofelt and Mark Heinrich. Integrating Performance Monitoring and Communication in Parallel Computers. ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, May 1996.
Because of significant relative increases in memory latencies compared to processor speeds, memory system behavior is of great importance to application performance. The authors argue that hardware support for memory performance monitoring is needed to asses program behavior accurately and efficiently. They observe that mechanisms used to implement cache coherence are similar to what is desired for performance monitoring. Current hardware monitors (e.g., hardware miss counters) are efficient but difficult to trigger selectively or have actions other than aggregate counting occur. Software-based monitors can categorize miss counts according to the code and data structures that caused them but have considerably higher overhead. Within many cache-coherent shared-memory multiprocessors, the three components needed for performance monitoring -- trigger points, handlers, and state storage -- have already been implemented for the purpose of cache coherence support. Furthermore, memory events that cause major delays are almost always those requiring coherence actions. As an example of integrating performance monitoring with coherence support, the authors have implemented Flashpoint, a performance monitoring tool for the FLASH multiprocessor. Flashpoint is implemented as a modified cache coherence protocol. Flashpoint maintains per-data-structure, per-procedure statistics.

William Gropp. An Introduction to Performance Debugging for Parallel Computers. Technical report MCS-P500-0295, Argonne National Lab, April 1995
This paper describes how to analyze a distributed-memory message-passing program for performance and discusses techniques for fixing performance problems. The dominant cost for most computations is the number and kind of memory reference. A scalability analysis, which is an analytic estimate of expected performance as a function of the number of processes, is used to estimate the performance of a parallel code. Then tools are used to observe the per-node and parallel performance, in order to identify the parts of the code that are under-performing. Code should be tuned first for per-node performance and then for message-passing or remote memory performance.

Barton P. Miller, Mark D. Callaghan, Jonathan M. Cargille, Jeffrey K. Hollingsworth, R. Bruce Irvin, Karen L. Karavanic, Krishna Kunchithapadam, and Tia Newhall. The Paradyn Parallel Performance Measurement Tools. IEEE Computer28:11, November 1995, pp. 37-46. Special issue on performance evaluation tools for parallel and distributed computer systems.
Paradyn is a tool for measuring the performance of large-scale parallel programs. The goal is to provide detailed, flexible performance information without incurring the space and time overhead typically associated with trace-based tools. Paradyn achieves this goal by dynamically instrumenting the application and automatically controlling the instrumentation in search of performance problems. Paradyn also provides decision support for the user by helping to decide when and where to insert instrumentation, automatically locating performance bottlenecks, and explaining performance bottlenecks using descriptions and visualizations. Paradyn maps performance data to multiple layers of abstraction, and the user can choose to look at it in terms of high-level language constructs or low-level machine structures. Paradyn configuration language (PCL) files specify metrics, instrumentation, visualizations, and daemons (default provided, user may modify or replace). The Paradyn Instrumentation Manager encapsulates architecture-specific knowledge and translates abstract instrumentation specifications into machine-level instrumentation. Instrumentation instructions are transferred to the application's binary image by a variation of the Unix ptrace interface. Performance data from external sources, such as hardware-based counters, can also be integrated. Paradyn provides basic visualizations for time histograms, bar charts, and tables and provides a library and remote procedure call interface for incorporating external data visualization packages (e.g., AVS, ParaGraph or Pablo displays).

Daniel Reed and Keith Shields and Will Scullin and Luis Tavera and Christopher Elford. Virtual Reality and Parallel Systems Performance Analysis. IEEE Computer 28(11), November 1995, 57--67.

Jerry Yan and Sekhar Sarukkai and Pankaj Mehra. Performance Measurement, Visualization and Modeling of Parallel and Distributed Programs using the AIMS toolkit. Software Practice and Experience 25(4), April 1995, 429--461.
The authors maintain that performance tuning tools should not be closely coupled to a specific architecture or particular system software, but that the architecture-specific components of a toolkit should be isolated and interfaces to them clearly defined to enhance portability. AIMS is a toolkit for tuning and predicting performance of message-passing programs on both tightly and loosely-coupled multiprocessors. Instrumented source code is executed (with help of a machine-specific monitor) to generate a tracefile which is run through an intrusion compensation module before being input to various post-processing tools. The AIMS Xinstrument source code instrumentor is written using the Sage compiler frontend library and handles large scientific applications with complex directory structure that may include a combination of C and Fortran modules. The instrumentor uses Sage flow analysis to support tracking of data structure movements. The performance monitoring library is responsible for collecting data, flushing traced events to a tracefile, and generating a data structure lookup table used by the post-processing tools to present performance information in terms of source-level data structures.

G.A. Geist, James Kohl, and Philip Papadopoulos. Visualization, Debugging, and Performance in PVM. Debugging and Performance Tuning for Parallel Computing Systems ed. Simmons, Hayes, Brown, and Reed, IEEE Computer Society Press, 1996.
XPVM is a graphical user interface for PVM that includes a control console, a real-time performance monitor, a call level debugger, and post-mortem analysis. A space-time view represents each PVM task by a bar with different colors representing different states and represents messages as arrows between bars. The utilization view is a graph of the number of tasks versus time and shows the number of tasks computing, comminicating, and idle at each moment. The tracefile written by XPVM is in SDDF format and thus may be interpreted by other performance analysis tools such as Pablo. XPVM is written in Tcl/Tk, which makes it easy to change, customize, and extend the PVM interface. A PVM debugger interface has been developed so that debugger developers can integrate their products into the PVM environment without modifying the PVM source. A debugger can register with PVM as the task starter and controller for a host, and PVM then defers all task spawning requests to the debugger. Debuggers can also attach to running PVM tasks.

Krishna Kunchithapadam and Barton P. Miller. Integrating a Debugger and a Performance Tool for Steering. in Debugging and Performance Tuning for Parallel Computing Systems, ed. Simmons, Hayes, Brown, and Reed, IEEE Computer Society Press, 1996.
Steering involves the optimizatin and control of the performance of a program using mechanisms external to the program. This paper presents the design of a generic steering tool that uses mechanisms already present in some performance measurement tools and in symbolic breakpoint debuggers. The programmer uses the results of performance analysis to decide on an optimization. The steering tool then invokes a symbolic debugger to allow the programmer to modify the program to effect the steering changes. Examples of steering include loop convergence control in iterative numberical computations, adjusting load balance and degree of parallelism, array redistribution, and changing algorithms (e.g., by loading different dynamic libraries). The steering configuration described in this paper uses the Paradyn tool, which uses dynamic instrumentation together with an automated search mechanism to identify performance bottlenecks, as the performance analysis tool. The steering tool can be used for either interactive or automated steering.

Peter Buhr and Martin Karsten and Jun Shih. KDB: A Multi-threaded Debugger for Multi-threaded Applications. Proc. SPDT'96: ACM SIGMETRICS Symposium on Parallel and Distributed Tools, May 1996, Philadelphia, pp. 80--87.
The purpose of the KDB project is to demonstrate, using the uC++ shared memory thread library, that is possible to build powerful and flexible debugging support for user-level threads. Use of the synchronous UNIX debugging primitives /proc and ptrace precludes independent and asynchronous control of individual threads within a UNIX process and restricts a distributed debugger to independent control of one thread per UNIX process. KDB is a concurrent debugger that runs on symmetric shared-memory multiprocessors and that supports the uC++ execution environment, which shares all data and has multiple kernel and user-level threads. Using KDB, the programmer can individually control and examine each user-level thread. Alternatively, tasks can be grouped together and operations issued on the group.

Automatic Performance Prediction to Support Cross Development of Parallel Programs. Matthias Schumann. Proc. SPDT'96: ACM SIGMETRICS Symposium on Parallel and Distributed Tools. May 1996, Philadelphia, pp. 88--97.
This paper presents a performance prediction methodology for supporting cross development (e.g., from loosely coupled to tightly coupled multiprocessors) of deterministic message-passing programs. In the abstract parallel numerical machine (APNM) model, the parameters of the machine model for each considered systems are automatically determined by a machine analyzer based on a micro benchmark. The workload parameters of the target programs are automatically determined by a static program analyzer/instrumentor tool, and in a profile run or execution-driven simulation. Memory access behavior of a program is modeled by heuristics that attempt to determine the target of a memory access by looking at the source code. A piecewise linear model of communication latencies accounts for discontinuities caused by packetization and for the effect of the number of participants in broadcasts and global synchronizations. The model ignores network topology and contention. The PE performance predictor tool assembles a weighted task graph representing the program's execution behavior on the target machine and performs a critical path analysis upon the resulting graph. PE provides a machine independent characterization of program execution by visualizing the execution count profile of abstract HLL instructions, and a machine dependent prediction of execution time. Optionally, PE can generate a trace file for use with the TATOO performance analysis tool.

Communication/Run-time Support

Books

Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra. MPI: The Complete Reference Volume 1, 2nd Edition. MIT Press, 1998.

William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing Lusk, Bill Nitzberg, William Saphir, and Marc Snir. MPI: The Complete Reference Volume 2. MIT Press, 1998.

William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT Press, 1994.

Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Bob Manchek, and Vaidy Sunderam. PVM: Parallel Virtual Machine--A User's Guide and Tutorial for Network Parallel Computing. MIT Press, 1994.

Papers

Oliver A. McBryan. An overview of message passing environments. Parallel Computing 20(4), April 1994, 417-444.

Andrew S. Grimshaw and Jon B. Weissman and W. Timothy Strayer. Portable Run-Time Support for Dynamic Object-Oriented Parallel Processing. ACM Trans. on Computer Systems 14(2), May 1996, 139-170.
Mentat is an object-oriented parallel processing system. The Mentat compiler and run-time system work together to automatically manage the communication and synchronization between objects. This paper describes the Mentat run-time system, which marshalls member function arguments, schedules objects on processors, and dynamically constructs and executes large-grain data dependence graphs.

Adam Beguelin and Jack Dongarra and Al Geist and Vaidy Sunderam. Recent Enhancements to PVM. International Journal for Supercomputer Applications 9(2), Summer 1995.
This paper describes new features introduced in version 3 of PVM. A new communication function pvm_psend() combines the initialize, pack, and send steps for sending a message into a single call in order to improve performance. The complementary routine pvm_precv() combines unpack and blocking receive. Another new communication function in PVM 3.3 is a timeout version of receive. Two data encoding options are available in PVM 3.3 to boost communication performance: 1) the user can tell PVM to scip the XDR encoding step if the message will only be sent to hosts with compatible data format, 2) the inplace option which causes data to be copied directly from user memory to the network rather than be packed into a buffer. In addition to the already existing broadcast to a group of tasks and barrier across a group of tasks, PVM 3.3 adds three new collective communication routines: 1) pvm_reduce, with predefined reduce functions for global max, min, sum, and product, and with which user-defined reduce functions may be used, 2) pvm_gather, and 3) pvm_scatter. New interfaces in PVM 3.3 allow PVM tasks to take over functions normally performed by the pvm daemons: starting hosts and tasks, and making scheduling decisions. These interfaces have been used in colloborative work with the Condor and Paradyn projects. New shared memory optimizations implement data movement with a shared buffer pool and lock primitives. A new tracing feature can be used to generate events for PVM calls.

Greg Burns and Raja Daoud and James Vaigl. LAM: An Open Cluster Environment for MPI.
Trollius is a multicomputer operating environment that runs natively on dedicated parallel machine nodes (In The Box, or ITB, nodes) and as a Unix daemon on general purpose (Out of The Box, or OTB, nodes). LAM is a subset of Trollius that runs only on a network of UNIX machines acting as OTB nodes. LAM has been used to implement PVM, MPI, and a tuple-based tool (although some features of PVM were not implemented). LAM has a powerful message buffering system that supports non-blocking message send, overlapped communication, and debugging. Processes can organize into groups, and group membership is dynamic. A full range of collective communication functions are available. LAM supports a loosely synchronous parallel I/O access method based on Cubix. A separate paper by Nick Nevin (OSC-TR-1996-4) compares the performance of LAM 6.0 and MPICH 1.0.12 on a workstation cluster for a suite of six benchmark programs consisting of ping, ping-pong, barrier, broadcast, gather, and alltoall. LAM 6.0 is faster for all benchmarks for messages up to 8192 bytes in length, which is where LAM switches over to a long message protocol (MPICH default is to change protocols at 16384 bytes).

Tom Loos and Randall Bramley. MPI Performance Comparison on Distributed and Shared Memory Machines. MPI Developers Conference, University of Notre Dame, July 1996.

Compilers, Languages, Programming Models

Books

Michael Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, 1996. ISBN 0-8053-2730-4
See http://www.cse.ogi.edu/Sparse/book.html for more information.

Hans Zima and Barbara Chapman. Supercompilers for Parallel and Vector Computers ACM Press, 1991. ISBN 0-201-17560-6
See http://www.acm.org/catalog/books/homepage.html for more information.

Utpal Banerjee.
See Kluwer Academic Publishers for more information.

Papers

I. Foster, D. Kohn, R. Krishnaiyer, and A. Choudhary. Double Standards: Bringing Task Parallelism to HPF via the Message Passing Interface. Proc. Supercomputing '96.
This paper describes the HPF/MPI library (an HPF binding for MPI), which is a coordination library for allowing expression of mixed task/data-parallel computations and the coupling of separately compiled data-parallel modules. The HPF/MPI library supports an execution model in which disjoint process groups (corresponding to data-parallel tasks) interact with each other by calling group-oriented communication functions. This approach is in contrast to compiler-based approaches that atempt to identify task-parallel structures automatically and to language-based approaches that provide new language structures for specifying task parallelism explicitly. The HPF/MPI library includes techniques for determining data distribution information and for communicating distributed data structures efficiently from sender to receiver. Results from two-dimensional FFT, convolution, and multiblock programs demonstrate that the HPF/MPI library can provide performance superior to that of pure HPF. The HPF/MPI prototype uses the HPF compiler produced by the Portland Group, Inc. and the MPICH implementation of MPI.

David F. Bacon, Susan L. Graham, and Oliver J. Sharp. Compiler Transformations for High-Performance Computing. ACM Computing Surveys 26(4), December 1994, 345-420.
This survey describes transformations that optimize programs written in imperative languages such as Fortran and C for high-performance architectures, including superscalar, vector, and various classes of multiprocessor machines. Optimizations for high-performance machines maximize parallelism and memory locality with transformations that rely on tracking the properties of arrays using loop dependence analysis. Communication optimizations for distributed memory machines include message vectorization and aggregation, collective communication, overlapping communication and computation, and elimination of redundant communication.

B. Rau and J. A. Fisher. Instruction-level parallel processing: History, overview, and perspective. J. Supercomputing 7(1/2), May 1993, 9-50.

Banerjee, et.al. The Paradigm Compiler for Distributed Memory Multicomputers. IEEE Computer 28(10), October 1995, 37-47.

Guy E. Blelloch Programming Parallel Algorithms Communications of the ACM 39(3), March 1996, 85-97.

S.P. Johnson and C.S. Ierotheou and M. Cross. Automatic parallel code generation for message passing on distributed memory systems. Parallel Computing 22:2 (Feb 1996) 227-258.
This paper describes algorithms and a toolkit (CAPTools) for automating as much as possible of the parallelization process for mesh based numerical codes, with a focus on structured mesh computational mechanics software, including for example CFD, heat transfer, structural analysis, electromagnetics, and acoustics. The authors argue that to satisfy the needs of application code authors, the parallelization should commence from scalar code and perform mainly automated parallization. User interaction is considered essential to produce a very accurate dependence analysis of the code. The authors contrast their approach with that of High Performance Fortran (HPF) and describe a number of factors that inhibit the success of HPF compilers. A companion article on CAPTools (in the same issue of Parallel Computing) gives performance results for a 2-D diffusion code, the APPLU code from the NAS-PAR benchmark suite, the TEAMKE1 2-D CFD code, and a 3-D CFD code. The authors claim that use of CAPTools allows the parallelization process to be completed in a fraction of the time required for a manual effort while achieving parallel code as efficient as could be achieved by hand. Other companion articles (also in the same issue of Parallel Computing) describe the interprocedural dependence analysis techniques used in CAPTools and the CAPTools user interface.

Joy Reed, Kevin Parrott, and Tim Lanfear. Portability, predictability and performance for parallel computing: BSP in practice. November 1995.
This paper reports on practical experience using the Oxford BSP Library to parallelize a large electromagnetic code, the British Aerospace finite-difference time-domain code EMMA T:FD3D. The Oxford BSP Library is an implementation of the Bulk Synchronous Parallel computational model targeted at numerically intensive scientific (typically Fortran) computing. BSP cost-modeling techniques can be used to predict and optimize performance for single-source programs across different parallel platforms. Predicted and observed performance results are given for the BAe EMMA code on an 8-processor IBM SP1, aa 4-processor SGI Power Challenge L, and a 128-processor Cray T3D. The authors claim that the BSP approach can achieve good application performance for numerically-intensive codes across a complete range of parallel architectures with only a single maintainable code.

Cherri Pancake. Is Parallelism for You? IEEE Comp. Sci. & Eng. 3(2), Summer 1996, 18--37.
This article gives a good introduction to parallel machine architectures, parallel programming models, and problem "architectures", and how the three match up. The article gives guidelines, or "rules of thumb", for figuring out which problem category an application fits into, and whether effort in parallelizing the application will be worth the cost. The author discusses how to measure a serial program's potential parallel content and calculate its theoretical speedup, including consideration of the effect of problem size on theoretical speedup. Factors that can cause observed sppedup to differ from theoretical speedup are discussed. To illustrate the rules of thumb, the article includes an example of analyzing and parallelizing a volume renderer application.

Architecture

Papers

Rosenblum, Chappin, Teodosiu, Devine, Lahiri, and Gupta. Implementing Efficient Fault Containment for Multiprocessors: Confining Faults in a Shared-Memory Multiprocessor Environment.
The performance advantages of future large-scale shared-memory multiprocessors may be offset by reliability problems if they run current operating systems. With current multiprocessor operating systems, the entire machine must be rebooted when a nontrivial hardware or software fault occurs. The effect is to terminate all applications even though the initial fault may have affected only a few of the applications. According to the authors, the key to improving the reliability of shared-memory multiprocessors is to provide fault containment. This article describes the authors' experience with implementing efficient fault containment in a prototype operating system called Hive. Hive is a version of Unix that is targeted to run on the Stanford FLASH multiprocessor. See also http://www-flash.stanford.edu/Hive/

Wood and Hill. Cost-Effective Parallel Computing. IEEE Computer, February 1995.
The authors' thesis is that a parallel system provides better cost/performance than uniprocessors whenever speedup exceeds costup, which is the parallel system cost divided by the uniprocessor cost. When applications have large memory requirements (e.g., 512 Mbytes), the costup can be far less than linear, because parallelizing a job on p processor rarely multiplies the memory requirements by p. Thus when memory cost is a significant fraction of system cost, parallel systems can be cost-effective at modest speedups.

Heinlein, Gharachorloo, Dresser, and Gupta. Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor. Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 38-50, San Jose, CA, October 1994.
This paper present hardware and software mechanisms in FLASH to support various message passing protocols and to handle interaction of the messaging protocols with virtual memory, protected multiprogramming, and cache coherence. The FLASH custom programmable node controller, called MAGIC (Memory and General Interconnect Controller), can be programmed to implement both cache coherence and message passing protocols. Message passing protocols are supported on the controllers to keeping the overhead on the compute processors low. Message passing latency and overhead are further reduced by providing protected user-level access to the programmable controllers. Data are transferred as a series of cache lines. The flexibility of FLASH allows support for either local or global cache coherence and the interaction of message data with cache coherence. The MAGIC chip can also be used to support active message style operations that require computation to be invoked at remote nodes. Simulation studies indicate that the system can sustan message transfer rates of several hundred megabytes per second.

Holt, Heinrich, Singh, Rothberg, and Hennessy. Technical Report CSL-TR-95-660, Computer Systems Laboratory, Stanford University, January 1995.

Chandra, Gharachorloo, Soundararajan, and Gupta. Performance Evaluation of Hybrid Hardware and Software Distributed Shared Memory Protocols. Proceedings of the 8th ACM International Conference on Supercomputing, pp. 274-288, Manchester, England, July 1994.

Parallel I/O

David Kotz. Applications of Parallel I/O. Dartmouth Technical Report PCS-TR96-297
The bibliography for this technical report is also available in BibTeX and HTML, with links to many of the cited papers.

P. Brezany and A. Choudhary. Techniques and Optimizations for Developing Irregular Out-of-Core Applications for Distributed-Memory Systems. Institute for Software Technology and Parallel Systems, University of Vienna, TR 96-4. November 1996.
This paper presents techniques for implementing Out-Of-Core irregular problems. In particular we present a design for a runtime system to efficiently support out-of-core irregular problems. Furthermore, we describe the appropriate transformations required to reduce the I/O overheads for staging data as well as for communication. Several optimizations are described for each step of the parallelization process. The proposed techniques can be used by a compiler that compiles a global name space program (e.g., written in HPF and its extensions) or by users writing programs in node+message passing style. First we describe the runtime support and the transformation for a restricted version of the the problem in which it is assumed that only part of the data (large data structures) are out-of-core. Then we generalize these techniques in which all the major datasets of an application are out-of-core. The main objectives of the proposed techniques is to minimize I/O accesses in all the steps while maintaining load balance and minimal communication. We introduce experimental results from a template CFD code to demonstrate the efficiency of the presented techniques.

ptlib_maintainers@nhse.org