You are previewing Advanced Backend Optimization.
O'Reilly logo
Advanced Backend Optimization

Book Description

This book is a summary of more than a decade of research in the area of backend optimization. It contains the latest fundamental research results in this field. While existing books are often more oriented toward Masters students, this book is aimed more towards professors and researchers as it contains more advanced subjects.

It is unique in the sense that it contains information that has not previously been covered by other books in the field, with chapters on phase ordering in optimizing compilation; register saturation in instruction level parallelism; code size reduction for software pipelining; memory hierarchy effects and instruction level parallelism.

Other chapters provide the latest research results in well-known topics such as register need, and software pipelining and periodic register allocation.

Table of Contents

  1. Cover
  2. Contents
  3. Title Page
  4. Copyright
  5. Introduction
    1. I.1. Inside this book
    2. I.2. Other contributors
    3. I.3. Basics on instruction-level parallelism processor architectures
  6. Part 1: Prolog: Optimizing Compilation
    1. 1 On the Decidability of Phase Ordering in Optimizing Compilation
      1. 1.1. Introduction to the Phase Ordering Problem
      2. 1.2. Background on Phase Ordering
      3. 1.3. Toward a Theoretical Model for the Phase Ordering Problem
      4. 1.4. Examples of Decidable Simplified Cases
      5. 1.5. Compiler Optimization Parameter Space Exploration
      6. 1.6. Conclusion on Phase Ordering in Optimizing Compilation
  7. Part 2: Instruction Scheduling
    1. 2 Instruction Scheduling Problems and Overview
      1. 2.1. Vliw Instruction Scheduling Problems
      2. 2.2. Software Pipelining
      3. 2.3. Instruction Scheduling and Register Allocation
    2. 3 Applications of Machine Scheduling to Instruction Scheduling
      1. 3.1. Advances in Machine Scheduling
      2. 3.2. List Scheduling Algorithms
      3. 3.3. Time-Indexed Scheduling Problem Formulations
    3. 4 Instruction Scheduling Before Register Allocation
      1. 4.1. Instruction Scheduling for an Ilp Processor: Case of a Vliw Architecture
      2. 4.2. Large Neighborhood Search for the Resource-Constrained Modulo Scheduling Problem
      3. 4.3. Resource-Constrained Modulo Scheduling Problem
      4. 4.4. Time-Indexed Integer Programming Formulations
      5. 4.5. Large Neighborhood Search Heuristic
      6. 4.6. Summary and Conclusions
    4. 5 Instruction Scheduling After Register Allocation
      1. 5.1. Introduction
      2. 5.2. Local Instruction Scheduling
      3. 5.3. Global Instruction Scheduling
      4. 5.4. Experimental Results
      5. 5.5. Conclusions
    5. 6 Dealing in Practice With Memory Hierarchy Effects and Instruction Level Parallelism
      1. 6.1. The Problem of Hardware Memory Disambiguation at Runtime
      2. 6.2. Data Preloading and Prefetching
  8. Part 3: Register Optimization
    1. 7 The Register Need of a Fixed Instruction Schedule
      1. 7.1. Data Dependence Graph and Processor Model for Register Optimization
      2. 7.2. The Acyclic Register Need
      3. 7.3. The Periodic Register Need
      4. 7.4. Computing the Periodic Register Need
      5. 7.5. Some Theoretical Results on the Periodic Register Need
      6. 7.6. Conclusion on the Register Requirement
    2. 8 The Register Saturation
      1. 8.1. Motivations on the Register Saturation Concept
      2. 8.2. Computing the Acyclic Register Saturation
      3. 8.3. Computing the Periodic Register Saturation
      4. 8.4. Conclusion on the Register Saturation
    3. 9 Spill Code Reduction
      1. 9.1. Introduction to Register Constraints in Software Pipelining
      2. 9.2. Related Work in Periodic Register Allocation
      3. 9.3. Sira: Schedule Independant Register Allocation
      4. 9.4. Siralina: An Efficient Polynomial Heuristic for Sira
      5. 9.5. Experimental Results With Sira
      6. 9.6. Conclusion on Spill Code Reduction
    4. 10 Exploiting the Register Access Delays Before Instruction Scheduling
      1. 10.1. Introduction
      2. 10.2. Problem Description of Ddg Circuits With Non-Positive Distances
      3. 10.3. Necessary and Sufficient Condition to Avoid Non-Positive Circuits
      4. 10.4. Application to the Sira Framework
      5. 10.5. Experimental Results on Eliminating Non-Positive Circuits
      6. 10.6. Conclusion on Non-Positive Circuit Elimination
    5. 11 Loop Unrolling Degree Minimization for Periodic Register Allocation
      1. 11.1. Introduction
      2. 11.2. Background
      3. 11.3. Problem Description of Unroll Factor Minimization for Unscheduled Loops
      4. 11.4. Algorithmic Solution for Unroll Factor Minimization: Single Register Type
      5. 11.5. Unroll Factor Minimization in the Presence of Multiple Register Types
      6. 11.6. Unroll Factor Reduction for Already Scheduled Loops
      7. 11.7. Experimental Results
      8. 11.8. Related Work
      9. 11.9. Conclusion on Loop Unroll Degree Minimization
  9. Part 4: Epilog: Performance, Open Problems
    1. 12 Statistical Performance Analysis: The Speedup-Test Protocol
      1. 12.1. Code Performance Variation
      2. 12.2. Background and Notations
      3. 12.3. Analyzing the Statistical Significance of the Observed Speedups
      4. 12.4. The Speedup-Test Software
      5. 12.5. Evaluating the Proportion of Accelerated Benchmarks by a Confidence Interval
      6. 12.6. Experiments and Applications
      7. 12.7. Related Work
      8. 12.8. Discussion and Conclusion on the Speedup-Test Protocol
  10. Conclusion
  11. Appendix 1: Presentation of the Benchmarks Used in Our Experiments
    1. A1.1. Qualitative benchmarks presentation
    2. A1.2. Quantitative benchmarks presentation
    3. A1.3. Changing the architectural configuration of the processor
  12. Appendix 2: Register Saturation Computation on Stand-Alone Ddg
    1. A2.1. The cyclic register saturation
    2. A2.2. The periodic register saturation
  13. Appendix 3: Efficiency Of Sira on the Benchmarks
    1. A3.1. Efficiency of SIRALINA on stand-alone DDG
    2. A3.2. Efficiency of SIRALINA plugged with an industrial compiler
  14. Appendix 4: Efficiency of Non-Positive Circuit Elimination in the Sira Framework
    1. A4.1. Experimental setup
    2. A4.2. Comparison between the heuristics execution times
    3. A4.3. Convergence of the proactive heuristic (iterative SIRALINA)
    4. A4.4. Qualitative analysis of the heuristics
    5. A4.5. Conclusion on non-positive circuit elimination strategy
  15. Appendix 5: Loop Unroll Degree Minimization: Experimental Results
    1. A5.1. Stand-alone experiments with single register types
    2. A5.2. Experiments with multiple register types
  16. Appendix 6: Experimental Efficiency of Software Data Preloading and Prefetching for Embedded Vliw
  17. Appendix 7: Appendix of the Speedup-Test Protocol
    1. A7.1. Why is the observed minimal execution time not necessarily a good statistical estimation of program performances?
    2. A7.2. Hypothesis testing in statistical and probability theory
    3. A7.3. What is a reasonable large sample? Observing the central limit theorem in practice
  18. Bibliography
  19. Lists of Figures, Tables and Algorithms
    1. LIST OF FIGURES
    2. LIST OF TABLES
    3. LIST OF ALGORITHMS
  20. Index