You are previewing Intel Xeon Phi Coprocessor High Performance Programming.
O'Reilly logo
Intel Xeon Phi Coprocessor High Performance Programming

Book Description

Authors Jim Jeffers and James Reinders spent two years helping educate customers about the prototype and pre-production hardware before Intel introduced the first Intel Xeon Phi coprocessor. They have distilled their own experiences coupled with insights from many expert customers, Intel Field Engineers, Application Engineers and Technical Consulting Engineers, to create this authoritative first book on the essentials of programming for this new architecture and these new products.

This book is useful even before you ever touch a system with an Intel Xeon Phi coprocessor. To ensure that your applications run at maximum efficiency, the authors emphasize key techniques for programming any modern parallel computing system whether based on Intel Xeon processors, Intel Xeon Phi coprocessors, or other high performance microprocessors. Applying these techniques will generally increase your program performance on any system, and better prepare you for Intel Xeon Phi coprocessors and the Intel MIC architecture.

    • A practical guide to the essentials of the Intel Xeon Phi coprocessor
    • Presents best practices for portable, high-performance computing and a familiar and proven threaded, scalar-vector programming model
    • Includes simple but informative code examples that explain the unique aspects of this new highly parallel and high performance computational product
    • Covers wide vectors, many cores, many threads and high bandwidth cache/memory architecture

Table of Contents

  1. Cover image
  2. Title page
  3. Table of Contents
  4. Copyright
  5. Foreword
  6. Preface
    1. Organization
  7. Acknowledgements
  8. Chapter 1. Introduction
    1. Trend: more parallelism
    2. Why Intel® Xeon Phi™ coprocessors are needed
    3. Platforms with coprocessors
    4. The first Intel® Xeon Phi™ coprocessor
    5. Keeping the “Ninja Gap” under control
    6. Transforming-and-tuning double advantage
    7. When to use an Intel® Xeon Phi™ coprocessor
    8. Maximizing performance on processors first
    9. Why scaling past one hundred threads is so important
    10. Maximizing parallel program performance
    11. Measuring readiness for highly parallel execution
    12. What about GPUs?
    13. Beyond the ease of porting to increased performance
    14. Transformation for performance
    15. Hyper-threading versus multithreading
    16. Coprocessor major usage model: MPI versus offload
    17. Compiler and programming models
    18. Cache optimizations
    19. Examples, then details
    20. For more information
  9. Chapter 2. High Performance Closed Track Test Drive!
    1. Looking under the hood: coprocessor specifications
    2. Starting the car: communicating with the coprocessor
    3. Taking it out easy: running our first code
    4. Starting to accelerate: running more than one thread
    5. Petal to the metal: hitting full speed using all cores
    6. Easing in to the first curve: accessing memory bandwidth
    7. High speed banked curved: maximizing memory bandwidth
    8. Back to the pit: a summary
  10. Chapter 3. A Friendly Country Road Race
    1. Preparing for our country road trip: chapter focus
    2. Getting a feel for the road: the 9-point stencil algorithm
    3. At the starting line: the baseline 9-point stencil implementation
    4. Rough road ahead: running the baseline stencil code
    5. Cobblestone street ride: vectors but not yet scaling
    6. Open road all-out race: vectors plus scaling
    7. Some grease and wrenches!: a bit of tuning
    8. Summary
    9. For more information
  11. Chapter 4. Driving Around Town: Optimizing A Real-World Code Example
    1. Choosing the direction: the basic diffusion calculation
    2. Turn ahead: accounting for boundary effects
    3. Finding a wide boulevard: scaling the code
    4. Thunder road: ensuring vectorization
    5. Peeling out: peeling code from the inner loop
    6. Trying higher octane fuel: improving speed using data locality and tiling
    7. High speed driver certificate: summary of our high speed tour
  12. Chapter 5. Lots of Data (Vectors)
    1. Why vectorize?
    2. How to vectorize
    3. Five approaches to achieving vectorization
    4. Six step vectorization methodology
    5. Streaming through caches: data layout, alignment, prefetching, and so on
    6. Compiler tips
    7. Compiler options
    8. Compiler directives
    9. Use array sections to encourage vectorization
    10. Look at what the compiler created: assembly code inspection
    11. Numerical result variations with vectorization
    12. Summary
    13. For more information
  13. Chapter 6. Lots of Tasks (not Threads)
    1. OpenMP, Fortran 2008, Intel® TBB, Intel® Cilk™ Plus, Intel® MKL
    2. OpenMP
    3. Fortran 2008
    4. Intel® TBB
    5. Cilk Plus
    6. Summary
    7. For more information
  14. Chapter 7. Offload
    1. Two offload models
    2. Choosing offload vs. native execution
    3. Language extensions for offload
    4. Using pragma/directive offload
    5. Using offload with shared virtual memory
    6. About asynchronous computation
    7. About asynchronous data transfer
    8. Applying the target attribute to multiple declarations
    9. Performing file I/O on the coprocessor
    10. Logging stdout and stderr from offloaded code
    11. Summary
    12. For more information
  15. Chapter 8. Coprocessor Architecture
    1. The Intel® Xeon Phi™ coprocessor family
    2. Coprocessor card design
    3. Intel® Xeon Phi™ coprocessor silicon overview
    4. Individual coprocessor core architecture
    5. Instruction and multithread processing
    6. Cache organization and memory access considerations
    7. Prefetching
    8. Vector processing unit architecture
    9. Coprocessor PCIe system interface and DMA
    10. Coprocessor power management capabilities
    11. Reliability, availability, and serviceability (RAS)
    12. Coprocessor system management controller (SMC)
    13. Benchmarks
    14. Summary
    15. For more information
  16. Chapter 9. Coprocessor System Software
    1. Coprocessor software architecture overview
    2. Coprocessor programming models and options
    3. Coprocessor software architecture components
    4. Intel® manycore platform software stack
    5. Linux support for Intel® Xeon Phi™ coprocessors
    6. Tuning memory allocation performance
    7. Summary
    8. For more information
  17. Chapter 10. Linux on the Coprocessor
    1. Coprocessor Linux baseline
    2. Introduction to coprocessor Linux bootstrap and configuration
    3. Default coprocessor Linux configuration
    4. Changing coprocessor configuration
    5. The micctrl utility
    6. Adding software
    7. Coprocessor Linux boot process
    8. Coprocessors in a Linux cluster
    9. Summary
    10. For more information
  18. Chapter 11. Math Library
    1. Intel Math Kernel Library overview
    2. Intel MKL and Intel compiler
    3. Coprocessor support overview
    4. Using the coprocessor in native mode
    5. Using automatic offload mode
    6. Using compiler-assisted offload
    7. Precision choices and variations
    8. Summary
    9. For more information
  19. Chapter 12. MPI
    1. MPI overview
    2. Using MPI on Intel® Xeon PhiTM coprocessors
    3. Prerequisites (batteries not included)
    4. Offload from an MPI rank
    5. Using MPI natively on the coprocessor
    6. Summary
    7. For more information
  20. Chapter 13. Profiling and Timing
    1. Event monitoring registers on the coprocessor
    2. Efficiency metrics
    3. Potential performance issues
    4. Intel® VTune™ Amplifier XE product
    5. Performance application programming interface
    6. MPI analysis: Intel Trace Analyzer and Collector
    7. Timing
    8. Summary
    9. For more information
  21. Chapter 14. Summary
    1. Advice
    2. Additional resources
    3. Another book coming?
    4. Feedback appreciated
  22. Glossary
  23. Index