HPC-Netlib Recommended Reading

Contribute to this page

Contents of this page

Numerical linear algebra
Graph and mesh generation, refinement, and partitioning
Partial differential equations
Optimization

Numerical linear algebra

General

Books

Gallivan, Heath, Ng, Ortega, Peyton, Plemmons, Romine, Sameh, and Voigt. Parallel Algorithms for Matrix Computations. SIAM, 1990.
Dongarra, Duff Sorensen, and Van der Vorst. Solving Linear Systems on Vector and Shared Memory Computers. SIAM, 1991.

Papers

Demmel, Heath, and Van der Vorst. Parallel Numerical Linear Algebra. Acta Numerica, 1993.

Dense linear algebra

Papers

Jack J. Dongarra and David W. Walker. Software Libraries for Linear Algebra Computation on High-Performance Computers. ORNL TM-12404, August 1993.: This paper discusses the design of linear algebra libraries for high performance computers. Particular emphasis is placed on the development of scalable algorithms for MIMD distributed memory concurrent computers. A brief description of the EISPACK, LINPACK, and LAPACK libraries is given, followed by an outline of ScaLAPACK, which is a distributed memory version of LAPACK currently under development. The importance of block-partitioned algorithms in reducing the frequency of data movement between different levels of hierarchical memory is stressed. The use of such algorithms helps reduce the message startup costs on distributed memory concurrent computers. Other key ideas in our approach are the use of distributed versions of the Level 3 Basic Linear Algebra Subprograms (BLAS) as computational building blocks, and the use of Basic Linear Algebra Communication Subprograms (BLACS) as communication building blocks. Together the distributed BLAS and the BLACS can be used to construct higher-level algorithms, and hide many details of the parallelism from the application developer. The block-cyclic data distribution is described, and adopted as a good way of distributing block-partitioned matrices. Block-partitioned versions of the Cholesky and LU factorizations are presented, and optimization issues associated with the implementation of the LU factorization algorithm on distributed memory concurrent computers are discussed, together with its performance on the Intel Delta system. Finally, approaches to the design of library interfaces are reviewed.

Sparse linear algebra

Books

Direct methods

I.S. Duff, A.M. Erisman, and J.K. Reid. Direct Methods for Sparse Matrices. Clarendon Press, Oxford, 1986.

Iterative methods

Yousef Saad. Iterative Methods for Sparse Linear Systems. PWS Publishing Company, 1996. ISBN 0-534-94776-X
Barrett, Berry, Chan, Demmel, Donato, Dongarra, Eijkhout, Pozo, Romine, and Van der Vorst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd edition. SIAM, 1994

Papers

Direct Methods

Iterative Methods

Willi Schonauer and Rudiger Weiss. An engineering approach to generalized conjugate gradient methods and beyond. Applied Numerical Mathematics 19:3 (Dec 1995) 175-206.: The purpose of this paper is to give nonexperts a simple introduction to generalized conjugate gradient methods. The methods surveyed include ORTHORES-based methods, residual- and error-minimizing methods (such as GMRES), biconjugate gradient-based methods, and methods based on the normal equations. Practical aspects such as normalization and smoothing are discussed. For a model of the Navier-Stokes equations that varies between strong and weak ellipticity, the methods are compared and their typical properties are demonstrated.
Rudiger Weiss. A theoretical overview of Krylov subspace methods. Applied Numerical Mathematics 19:3 (Dec 1995) 207-233.: This paper surveys Krylov subspace methods for the solution of linear systems with a focus on commonly used and recently developed methods. Convergence results are derived from a general theoretical framework and compiled and analyzed.
Gerard L.G. Sleijpen and Henk A. van der Vorst. An overview of approaches for the stable computation of hybrid BiCG methods. Applied Numerical Mathematics 19:3 (Dec 1995) 235-254.: BiCG can be adapted so that operations with A^T can be avoided, and hybrid methods with computational complexity almost similar to BiCG can be constructed to improve convergence behavior. Examples of this are CGS, Bi-CGSTAB, and BiCGstab(l). In many applications, the speed of convergence of these methods is very dependent on the incorporated BiCG process. The accuracy of the iteration coefficients of BiCG depends on the particular choice of the hybrid method. This paper discusses the accuracy of these coefficients and how this affects the speed of converence and shows that hybrid methods exists with bettery accuracy properties. This paper also discusses look-ahead strategies for the determination of appropriate values for l in BiCGstab(l).
Anshul Gupta, Vipin Kumar, and Ahmed Sameh. Performance and Scalability of Preconditioned Conjugate Gradient Methods on Parallel Computers. IEEE Transactions on Parallel and Distributed Systems, 1995. Earlier version: University of Minnesota Tech Report TR 92-64, Nov. 1992 (ftp://ftp.cs.umn.edu/users/kumar/conjugate-gradient.ps).
R.R.P. van Nooyen, C. Vuik, and P. Wesseling Some parallelizable preconditioners for GMRES. Technische Universiteit Delft, Tech Report 96-50.
Randall Bramley and Vladimir Menkov. Low Rank Off-Diagonal Block Preconditioners for Solving Sparse Linear Systems on Parallel Computers. Department of Computer Science, Indiana University - Bloomington.

Graph and Mesh Generation, Refinement, and Partitioning

Papers

C. Walshaw, M. Cross, and M. Everett. Parallel Partitioning of Unstructured Meshes. Proc. Parallel CFD '96.: A method is outlined for optimising graph partitions which arise in mapping unstructured mesh calculations to parallel computers. The method employs a combination of iterative techniques to both evenly balance the workload and minimise the number and volume of interprocessor communications. The method is designed to work efficiently in parallel as well as sequentially. When combined with a fast direct partitioning technique (such as the Greedy algorithm) to give an initial partition, the resulting two-stage process proves itself to be both a powerful and flexible solution to the static graph-partitioning problem. The algorithms can also be used for dynamic load-balancing, and a clustering technique can additionally be employed to speed up the whole process. Experiments indicate that the resulting parallel code can provide high quality partitions, independent of the initial partition, within a few seconds.
C. Walshaw and M. Berzins Dynamic load-balancing or PDE solvers on adaptive unstructured meshes. Concurrency: Practice and Experience 7(1):17-28, 1995.: Modern PDE solvers written for time-dependent problems increasingly employ adaptive unstructured meshes in order to both increase efficiency and control the numerical error. If a distributed memory parallel computer is to be used, there arises the significant problem of dividing the domain equally amongst the processors whilst minimising the inter-subdomain dependencies. A number of graph based algorithms have recently been proposed for steady state calculations. This paper considers an extension to such methods which renders them more suitable for time-dependent problems in which the mesh may be changed frequently.
Scott R. Kohn and Scott B. Baden. A Parallel Software Infrastructure for Structured Adaptive Mesh Methods. Proc. Supercomputing '95, San Diego CA, December 1995.: Structured adaptive mesh algorithms dynamically allocate computational resources to accurately resolve interesting portions of a numerical calculation. Such methods are difficult to implement and parallelize because they rely on dynamic, irregular data structures. We have developed an efficient, portable, parallel software infrastructure for adaptive mesh methods; our software provides computational scientists with high-level facilities that hide low-level details of parallelism and resource management. We have applied our software infrastructure to the solution of adaptive eigenvalue problems arising in materials design. We describe our software infrastructure and analyze its performance. We also present computational results which indicate that the uniformity restrictions imposed by a data parallel Fortran implementation of a structured adaptive mesh application would significantly impact performance.

Partial differential equations

Smith, Bjorstad, and Gropp. Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations. Cambride University Press, 1996. ISBN 0-521-49589-X

Optimization

Books

Jorge J. More' and Stephen J. Wright. Optimization Software Guide. SIAM Publications, 1993.

Pascal Van Hentenryck. The OPL Optimization Programming Language. MIT Press, 1999.

Papers

Robert B. Schnabel, A view of the limitations, opportunities, and challenges in parallel nonlinear optimization, Parallel Computing 21(6), June 1995, 875--905: This paper discusses how consideration of parallelism is affecting and is likely to affect the field of nonlinear optimization. It presents a set of examples that illustrate limitations, opportunities, and challenges. The examples include parallel methods for unconstrained optimization problems with a small to moderate number of variables, parallel methods for large block bordered systems of nonlinear equations, and parallel methods for small-scale and large-scale global optimization problems.
Brett M. Averick and Jorge J. More'. Evaluation of Large-Scale Optimization Problems on Vector and Parallel Architectures.: This research uses limited memory variable metric algorithms to illustrate performance issues for large-scale optimization problems. The results indicate that the cost of evaluating functions and gradients dominates the overall solution time. For a vector architecture, vectorization of function and gradient evaluation using partitioning techniques which are applicable to any partially separable function improves computing times by at least a factor of 10. For distributed memory architectures, the authors show that it is possible to develop algorithms for function and gradient evaluation that are scalable in the sense that as the number of variables increases and the number of processors increases, the computing time remains reasonably constant.

hpc-netlib@neltib.org