You are previewing Introduction to Parallel Computing, Second Edition.
O'Reilly logo
Introduction to Parallel Computing, Second Edition

Book Description

Introducation to Parallel Computing is a complete end-to-end source of information on almost all aspects of parallel computing from introduction to architectures to programming paradigms to algorithms to programming standards. It is the only book to have complete coverage of traditional Computer Science algorithms (sorting, graph and matrix algorithms), scientific computing algorithms (FFT, sparse matrix computations, N-body methods), and data intensive algorithms (search, dynamic programming, data-mining).

Table of Contents

  1. Copyright
    1. Dedication
  2. Pearson Education
  3. Preface
  4. Acknowledgments
  5. 1. Introduction to Parallel Computing
    1. 1.1. Motivating Parallelism
      1. 1.1.1. The Computational Power Argument – from Transistors to FLOPS
      2. 1.1.2. The Memory/Disk Speed Argument
      3. 1.1.3. The Data Communication Argument
    2. 1.2. Scope of Parallel Computing
      1. 1.2.1. Applications in Engineering and Design
      2. 1.2.2. Scientific Applications
      3. 1.2.3. Commercial Applications
      4. 1.2.4. Applications in Computer Systems
    3. 1.3. Organization and Contents of the Text
    4. 1.4. Bibliographic Remarks
    5. Problems
  6. 2. Parallel Programming Platforms
    1. 2.1. Implicit Parallelism: Trends in Microprocessor Architectures*
      1. 2.1.1. Pipelining and Superscalar Execution
      2. 2.1.2. Very Long Instruction Word Processors
    2. 2.2. Limitations of Memory System Performance*
      1. 2.2.1. Improving Effective Memory Latency Using Caches
      2. 2.2.2. Impact of Memory Bandwidth
      3. 2.2.3. Alternate Approaches for Hiding Memory Latency
        1. Multithreading for Latency Hiding
        2. Prefetching for Latency Hiding
      4. 2.2.4. Tradeoffs of Multithreading and Prefetching
    3. 2.3. Dichotomy of Parallel Computing Platforms
      1. 2.3.1. Control Structure of Parallel Platforms
      2. 2.3.2. Communication Model of Parallel Platforms
        1. Shared-Address-Space Platforms
        2. Message-Passing Platforms
    4. 2.4. Physical Organization of Parallel Platforms
      1. 2.4.1. Architecture of an Ideal Parallel Computer
        1. Architectural Complexity of the Ideal Model
      2. 2.4.2. Interconnection Networks for Parallel Computers
      3. 2.4.3. Network Topologies
        1. Bus-Based Networks
        2. Crossbar Networks
        3. Multistage Networks
        4. Completely-Connected Network
        5. Star-Connected Network
        6. Linear Arrays, Meshes, and k-d Meshes
        7. Tree-Based Networks
      4. 2.4.4. Evaluating Static Interconnection Networks
      5. 2.4.5. Evaluating Dynamic Interconnection Networks
      6. 2.4.6. Cache Coherence in Multiprocessor Systems
        1. Snoopy Cache Systems
        2. Directory Based Systems
    5. 2.5. Communication Costs in Parallel Machines
      1. 2.5.1. Message Passing Costs in Parallel Computers
        1. Store-and-Forward Routing
        2. Packet Routing
        3. Cut-Through Routing
        4. A Simplified Cost Model for Communicating Messages
      2. 2.5.2. Communication Costs in Shared-Address-Space Machines
    6. 2.6. Routing Mechanisms for Interconnection Networks
    7. 2.7. Impact of Process-Processor Mapping and Mapping Techniques
      1. 2.7.1. Mapping Techniques for Graphs
        1. Embedding a Linear Array into a Hypercube
        2. Embedding a Mesh into a Hypercube
        3. Embedding a Mesh into a Linear Array
        4. Embedding a Hypercube into a 2-D Mesh
        5. Process-Processor Mapping and Design of Interconnection Networks
      2. 2.7.2. Cost-Performance Tradeoffs
    8. 2.8. Bibliographic Remarks
    9. Problems
  7. 3. Principles of Parallel Algorithm Design
    1. 3.1. Preliminaries
      1. 3.1.1. Decomposition, Tasks, and Dependency Graphs
      2. 3.1.2. Granularity, Concurrency, and Task-Interaction
      3. 3.1.3. Processes and Mapping
      4. 3.1.4. Processes versus Processors
    2. 3.2. Decomposition Techniques
      1. 3.2.1. Recursive Decomposition
      2. 3.2.2. Data Decomposition
      3. 3.2.3. Exploratory Decomposition
      4. 3.2.4. Speculative Decomposition
      5. 3.2.5. Hybrid Decompositions
    3. 3.3. Characteristics of Tasks and Interactions
      1. 3.3.1. Characteristics of Tasks
      2. 3.3.2. Characteristics of Inter-Task Interactions
    4. 3.4. Mapping Techniques for Load Balancing
      1. 3.4.1. Schemes for Static Mapping
        1. Mappings Based on Data Partitioning
          1. Block Distributions
          2. Cyclic and Block-Cyclic Distributions
          3. Randomized Block Distributions
        2. Mappings Based on Task Partitioning
        3. Hierarchical Mappings
      2. 3.4.2. Schemes for Dynamic Mapping
        1. Centralized Schemes
        2. Distributed Schemes
        3. Suitability to Parallel Architectures
    5. 3.5. Methods for Containing Interaction Overheads
      1. 3.5.1. Maximizing Data Locality
      2. 3.5.2. Minimizing Contention and Hot Spots
      3. 3.5.3. Overlapping Computations with Interactions
      4. 3.5.4. Replicating Data or Computations
      5. 3.5.5. Using Optimized Collective Interaction Operations
      6. 3.5.6. Overlapping Interactions with Other Interactions
    6. 3.6. Parallel Algorithm Models
      1. 3.6.1. The Data-Parallel Model
      2. 3.6.2. The Task Graph Model
      3. 3.6.3. The Work Pool Model
      4. 3.6.4. The Master-Slave Model
      5. 3.6.5. The Pipeline or Producer-Consumer Model
      6. 3.6.6. Hybrid Models
    7. 3.7. Bibliographic Remarks
    8. Problems
  8. 4. Basic Communication Operations
    1. 4.1. One-to-All Broadcast and All-to-One Reduction
      1. 4.1.1. Ring or Linear Array
      2. 4.1.2. Mesh
      3. 4.1.3. Hypercube
      4. 4.1.4. Balanced Binary Tree
      5. 4.1.5. Detailed Algorithms
      6. 4.1.6. Cost Analysis
    2. 4.2. All-to-All Broadcast and Reduction
      1. 4.2.1. Linear Array and Ring
      2. 4.2.2. Mesh
      3. 4.2.3. Hypercube
      4. 4.2.4. Cost Analysis
    3. 4.3. All-Reduce and Prefix-Sum Operations
    4. 4.4. Scatter and Gather
    5. 4.5. All-to-All Personalized Communication
      1. 4.5.1. Ring
      2. 4.5.2. Mesh
      3. 4.5.3. Hypercube
        1. An Optimal Algorithm
    6. 4.6. Circular Shift
      1. 4.6.1. Mesh
      2. 4.6.2. Hypercube
    7. 4.7. Improving the Speed of Some Communication Operations
      1. 4.7.1. Splitting and Routing Messages in Parts
        1. One-to-All Broadcast
        2. All-to-One Reduction
        3. All-Reduce
      2. 4.7.2. All-Port Communication
    8. 4.8. Summary
    9. 4.9. Bibliographic Remarks
    10. Problems
  9. 5. Analytical Modeling of Parallel Programs
    1. 5.1. Sources of Overhead in Parallel Programs
    2. 5.2. Performance Metrics for Parallel Systems
      1. 5.2.1. Execution Time
      2. 5.2.2. Total Parallel Overhead
      3. 5.2.3. Speedup
      4. 5.2.4. Efficiency
      5. 5.2.5. Cost
    3. 5.3. The Effect of Granularity on Performance
    4. 5.4. Scalability of Parallel Systems
      1. 5.4.1. Scaling Characteristics of Parallel Programs
      2. 5.4.2. The Isoefficiency Metric of Scalability
        1. The Isoefficiency Function
      3. 5.4.3. Cost-Optimality and the Isoefficiency Function
      4. 5.4.4. A Lower Bound on the Isoefficiency Function
      5. 5.4.5. The Degree of Concurrency and the Isoefficiency Function
    5. 5.5. Minimum Execution Time and Minimum Cost-Optimal Execution Time
    6. 5.6. Asymptotic Analysis of Parallel Programs
    7. 5.7. Other Scalability Metrics
    8. 5.8. Bibliographic Remarks
    9. Problems
  10. 6. Programming Using the Message-Passing Paradigm
    1. 6.1. Principles of Message-Passing Programming
    2. 6.2. The Building Blocks: Send and Receive Operations
      1. 6.2.1. Blocking Message Passing Operations
        1. Blocking Non-Buffered Send/Receive
        2. Blocking Buffered Send/Receive
      2. 6.2.2. Non-Blocking Message Passing Operations
    3. 6.3. MPI: the Message Passing Interface
      1. 6.3.1. Starting and Terminating the MPI Library
      2. 6.3.2. Communicators
      3. 6.3.3. Getting Information
      4. 6.3.4. Sending and Receiving Messages
      5. 6.3.5. Example: Odd-Even Sort
    4. 6.4. Topologies and Embedding
      1. 6.4.1. Creating and Using Cartesian Topologies
      2. 6.4.2. Example: Cannon’s Matrix-Matrix Multiplication
    5. 6.5. Overlapping Communication with Computation
      1. 6.5.1. Non-Blocking Communication Operations
        1. Example: Cannon’s Matrix-Matrix Multiplication (Using Non-Blocking Operations)
    6. 6.6. Collective Communication and Computation Operations
      1. 6.6.1. Barrier
      2. 6.6.2. Broadcast
      3. 6.6.3. Reduction
      4. 6.6.4. Prefix
      5. 6.6.5. Gather
      6. 6.6.6. Scatter
      7. 6.6.7 All-to-All
      8. 6.6.8. Example: One-Dimensional Matrix-Vector Multiplication
      9. 6.6.9. Example: Single-Source Shortest-Path
      10. 6.6.10. Example: Sample Sort
    7. 6.7. Groups and Communicators
      1. 6.7.1. Example: Two-Dimensional Matrix-Vector Multiplication
    8. 6.8. Bibliographic Remarks
    9. Problems
  11. 7. Programming Shared Address Space Platforms
    1. 7.1. Thread Basics
    2. 7.2. Why Threads?
    3. 7.3. The POSIX Thread API
    4. 7.4. Thread Basics: Creation and Termination
    5. 7.5. Synchronization Primitives in Pthreads
      1. 7.5.1. Mutual Exclusion for Shared Variables
        1. Overheads of Locking
        2. Alleviating Locking Overheads
      2. 7.5.2. Condition Variables for Synchronization
    6. 7.6. Controlling Thread and Synchronization Attributes
      1. 7.6.1. Attributes Objects for Threads
      2. 7.6.2. Attributes Objects for Mutexes
    7. 7.7. Thread Cancellation
    8. 7.8. Composite Synchronization Constructs
      1. 7.8.1. Read-Write Locks
      2. 7.8.2. Barriers
    9. 7.9. Tips for Designing Asynchronous Programs
    10. 7.10. OpenMP: a Standard for Directive Based Parallel Programming
      1. 7.10.1. The OpenMP Programming Model
      2. 7.10.2. Specifying Concurrent Tasks in OpenMP
        1. The for Directive
        2. Assigning Iterations to Threads
        3. Synchronization Across Multiple for Directives
        4. The sections Directive
        5. Merging Directives
        6. Nesting parallel Directives
      3. 7.10.3. Synchronization Constructs in OpenMP
        1. Synchronization Point: The barrier Directive
        2. Single Thread Executions: The single and master Directives
        3. Critical Sections: The critical and atomic Directives
        4. In-Order Execution: The ordered Directive
        5. Memory Consistency: The flush Directive
      4. 7.10.4. Data Handling in OpenMP
      5. 7.10.5. OpenMP Library Functions
        1. Controlling Number of Threads and Processors
        2. Controlling and Monitoring Thread Creation
        3. Mutual Exclusion
      6. 7.10.6. Environment Variables in OpenMP
      7. 7.10.7. Explicit Threads versus OpenMP Based Programming
    11. 7.11. Bibliographic Remarks
    12. Problems
  12. 8. Dense Matrix Algorithms
    1. 8.1. Matrix-Vector Multiplication
      1. 8.1.1. Rowwise 1-D Partitioning
        1. One Row Per Process
        2. Using Fewer than n Processes
      2. 8.1.2. 2-D Partitioning
        1. One Element Per Process
        2. Using Fewer than n2 Processes
        3. Comparison of 1-D and 2-D Partitionings
    2. 8.2. Matrix-Matrix Multiplication
      1. 8.2.1. A Simple Parallel Algorithm
      2. 8.2.2. Cannon’s Algorithm
      3. 8.2.3. The DNS Algorithm
        1. DNS Algorithm with Fewer than n3 Processes
    3. 8.3. Solving a System of Linear Equations
      1. 8.3.1. A Simple Gaussian Elimination Algorithm
        1. Parallel Implementation with 1-D Partitioning
        2. Parallel Implementation with 2-D Partitioning
      2. 8.3.2. Gaussian Elimination with Partial Pivoting
        1. 1-D Partitioning
        2. 2-D Partitioning
      3. 8.3.3. Solving a Triangular System: Back-Substitution
      4. 8.3.4. Numerical Considerations in Solving Systems of Linear Equations
    4. 8.4. Bibliographic Remarks
    5. Problems
  13. 9. Sorting
    1. 9.1. Issues in Sorting on Parallel Computers
      1. 9.1.1. Where the Input and Output Sequences are Stored
      2. 9.1.2. How Comparisons are Performed
        1. One Element Per Process
        2. More than One Element Per Process
    2. 9.2. Sorting Networks
      1. 9.2.1. Bitonic Sort
      2. 9.2.2. Mapping Bitonic Sort to a Hypercube and a Mesh
        1. One Element Per Process
        2. A Block of Elements Per Process
    3. 9.3. Bubble Sort and its Variants
      1. 9.3.1. Odd-Even Transposition
        1. Parallel Formulation
      2. 9.3.2. Shellsort
    4. 9.4. Quicksort
      1. 9.4.1. Parallelizing Quicksort
      2. 9.4.2. Parallel Formulation for a CRCW PRAM
      3. 9.4.3. Parallel Formulation for Practical Architectures
        1. Shared-Address-Space Parallel Formulation
        2. Message-Passing Parallel Formulation
      4. 9.4.4. Pivot Selection
    5. 9.5. Bucket and Sample Sort
    6. 9.6. Other Sorting Algorithms
      1. 9.6.1. Enumeration Sort
      2. 9.6.2. Radix Sort
    7. 9.7. Bibliographic Remarks
    8. Problems
  14. 10. Graph Algorithms
    1. 10.1. Definitions and Representation
    2. 10.2. Minimum Spanning Tree: Prim’s Algorithm
      1. Parallel Formulation
    3. 10.3. Single-Source Shortest Paths: Dijkstra’s Algorithm
      1. Parallel Formulation
    4. 10.4. All-Pairs Shortest Paths
      1. 10.4.1. Dijkstra’s Algorithm
        1. Parallel Formulations
      2. 10.4.2. Floyd’s Algorithm
        1. Parallel Formulation
      3. 10.4.3. Performance Comparisons
    5. 10.5. Transitive Closure
    6. 10.6. Connected Components
      1. 10.6.1. A Depth-First Search Based Algorithm
        1. Parallel Formulation
    7. 10.7. Algorithms for Sparse Graphs
      1. 10.7.1. Finding a Maximal Independent Set
        1. Shared-Address-Space Parallel Formulation
      2. 10.7.2. Single-Source Shortest Paths
        1. Parallelization Strategy
        2. Distributed Memory Formulation
    8. 10.8. Bibliographic Remarks
    9. Problems
  15. 11. Search Algorithms for Discrete Optimization Problems
    1. 11.1. Definitions and Examples
    2. 11.2. Sequential Search Algorithms
      1. 11.2.1. Depth-First Search Algorithms
        1. Simple Backtracking
        2. Depth-First Branch-and-Bound
        3. Iterative Deepening A*
      2. 11.2.2. Best-First Search Algorithms
    3. 11.3. Search Overhead Factor
    4. 11.4. Parallel Depth-First Search
      1. 11.4.1. Important Parameters of Parallel DFS
        1. Work-Splitting Strategies
        2. Load-Balancing Schemes
      2. 11.4.2. A General Framework for Analysis of Parallel DFS
        1. Computation of V(p) for Various Load-Balancing Schemes
      3. 11.4.3. Analysis of Load-Balancing Schemes
      4. 11.4.4. Termination Detection
        1. Dijkstra’s Token Termination Detection Algorithm
        2. Tree-Based Termination Detection
      5. 11.4.5. Experimental Results
      6. 11.4.6. Parallel Formulations of Depth-First Branch-and-Bound Search
      7. 11.4.7. Parallel Formulations of IDA*
    5. 11.5. Parallel Best-First Search
      1. Communication Strategies for Parallel Best-First Tree Search
      2. Communication Strategies for Parallel Best-First Graph Search
    6. 11.6. Speedup Anomalies in Parallel Search Algorithms
      1. 11.6.1. Analysis of Average Speedup in Parallel DFS
        1. Assumptions
        2. Analysis of the Search Overhead Factor
    7. 11.7. Bibliographic Remarks
      1. Parallel Depth-First Search Algorithms
      2. Parallel Formulations of Alpha-Beta Search
      3. Parallel Best-First Search
      4. Speedup Anomalies in Parallel Formulations of Search Algorithms
      5. Role of Heuristics
    8. Problems
  16. 12. Dynamic Programming
    1. 12.1. Overview of Dynamic Programming
    2. 12.2. Serial Monadic DP Formulations
      1. 12.2.1. The Shortest-Path Problem
      2. 12.2.2. The 0/1 Knapsack Problem
    3. 12.3. Nonserial Monadic DP Formulations
      1. 12.3.1. The Longest-Common-Subsequence Problem
    4. 12.4. Serial Polyadic DP Formulations
      1. 12.4.1. Floyd’s All-Pairs Shortest-Paths Algorithm
    5. 12.5. Nonserial Polyadic DP Formulations
      1. 12.5.1. The Optimal Matrix-Parenthesization Problem
    6. 12.6. Summary and Discussion
    7. 12.7. Bibliographic Remarks
    8. Problems
  17. 13. Fast Fourier Transform
    1. 13.1. The Serial Algorithm
    2. 13.2. The Binary-Exchange Algorithm
      1. 13.2.1. A Full Bandwidth Network
        1. One Task Per Process
        2. Multiple Tasks Per Process
        3. Scalability Analysis
      2. 13.2.2. Limited Bandwidth Network
      3. 13.2.3. Extra Computations in Parallel FFT
    3. 13.3. The Transpose Algorithm
      1. 13.3.1. Two-Dimensional Transpose Algorithm
        1. Comparison with the Binary-Exchange Algorithm
      2. 13.3.2. The Generalized Transpose Algorithm
    4. 13.4. Bibliographic Remarks
    5. Problems
  18. A. Complexity of Functions and Order Analysis
    1. A.1. Complexity of Functions
    2. A.2. Order Analysis of Functions
      1. Properties of Functions Expressed in Order Notation
  19. Bibliography