Cover image for Inside the Machine

Book description

Inside the Machine explains how microprocessors operate - what they do, and how they do it. Written by the co-founder of the highly respected Ars Technica site, the book begins with the fundamentals of computing, defining what a computer is and using analogies, numerous 4-color diagrams, and clear explanations to communicate the concepts that form the basis of modern computing. After discussing computers in the abstract, the book goes on to cover specific microprocessors, discussing in detail how they work and how they differ.

Table of Contents

  1. Inside the Machine
    1. Preface
    2. Acknowledgments
    3. Introduction
    4. 1. Basic Computing Concepts
      1. The Calculator Model of Computing
      2. The File-Clerk Model of Computing
        1. The Stored-Program Computer
        2. Refining the File-Clerk Model
      3. The Register File
      4. RAM: When Registers Alone Won't Cut It
        1. The File-Clerk Model Revisited and Expanded
        2. An Example: Adding Two Numbers
      5. A Closer Look at the Code Stream: The Program
        1. General Instruction Types
        2. The DLW-1's Basic Architecture and Arithmetic Instruction Format
          1. The DLW-1's Arithmetic Instruction Format
          2. The DLW-1's Memory Instruction Format
          3. An Example DLW-1 Program
      6. A Closer Look at Memory Accesses: Register vs. Immediate
        1. Immediate Values
        2. Register-Relative Addressing
    5. 2. The Mechanics of Program Execution
      1. Opcodes and Machine Language
        1. Machine Language on the DLW-1
        2. Binary Encoding of Arithmetic Instructions
        3. Binary Encoding of Memory Access Instructions
          1. The load Instruction
          2. The store Instruction
        4. Translating an Example Program into Machine Language
      2. The Programming Model and the ISA
        1. The Programming Model
        2. The Instruction Register and Program Counter
        3. The Instruction Fetch: Loading the Instruction Register
        4. Running a Simple Program: The Fetch-Execute Loop
      3. The Clock
      4. Branch Instructions
        1. Unconditional Branch
        2. Conditional Branch
          1. Branch Instructions and the Fetch-Execute Loop
          2. The Branch Instruction as a Special Type of Load
          3. Branch Instructions and Labels
      5. Excursus: Booting Up
    6. 3. Pipelined Execution
      1. The Lifecycle of an Instruction
      2. Basic Instruction Flow
      3. Pipelining Explained
      4. Applying the Analogy
        1. A Non-Pipelined Processor
        2. A Pipelined Processor
          1. Shrinking the Clock
          2. Shrinking Program Execution Time
        3. The Speedup from Pipelining
        4. Program Execution Time and Completion Rate
        5. The Relationship Between Completion Rate and Program Execution Time
        6. Instruction Throughput and Pipeline Stalls
          1. Instruction Throughput
          2. Pipeline Stalls
        7. Instruction Latency and Pipeline Stalls
        8. Limits to Pipelining
          1. Clock Period and Completion Rate
          2. The Cost of Pipelining
    7. 4. Superscalar Execution
      1. Superscalar Computing and IPC
      2. Expanding Superscalar Processing with Execution Units
        1. Basic Number Formats and Computer Arithmetic
        2. Arithmetic Logic Units
        3. Memory-Access Units
      3. Microarchitecture and the ISA
        1. A Brief History of the ISA
        2. Moving Complexity from Hardware to Software
      4. Challenges to Pipelining and Superscalar Design
        1. Data Hazards
        2. Structural Hazards
        3. The Register File
        4. Control Hazards
    8. 5. The Intel Pentium and Pentium Pro
      1. The Original Pentium
        1. Caches
        2. The Pentium's Pipeline
        3. The Branch Unit and Branch Prediction
        4. The Pentium's Back End
          1. The Integer ALUs
          2. The Floating-Point ALU
        5. x86 Overhead on the Pentium
        6. Summary: The Pentium in Historical Context
      2. The Intel P6 Microarchitecture: The Pentium Pro
        1. Decoupling the Front End from the Back End
          1. The Issue Phase
          2. The Completion Phase
          3. The P6's Issue Phase: The Reservation Station
          4. The P6's Completion Phase: The Reorder Buffer
          5. The Instruction Window
        2. The P6 Pipeline
        3. Branch Prediction on the P6
        4. The P6 Back End
        5. CISC, RISC, and Instruction Set Translation
        6. The P6 Microarchitecture's Instruction Decoding Unit
        7. The Cost of x86 Legacy Support on the P6
        8. Summary: The P6 Microarchitecture in Historical Context
          1. The Pentium Pro
          2. The Pentium II
          3. The Pentium III
      3. Conclusion
    9. 6. PowerPC Processors: 600 Series, 700 Series, and 7400
      1. A Brief History of PowerPC
      2. The PowerPC 601
        1. The 601's Pipeline and Front End
          1. The PowerPC Instruction Queue
          2. Instruction Scheduling on the 601
        2. The 601's Back End
          1. The Integer Unit
          2. The Floating-Point Unit
          3. The Branch Execution Unit
          4. The Sequencer Unit
        3. Latency and Throughput Revisited
        4. Summary: The 601 in Historical Context
      3. The PowerPC 603 and 603e
        1. The 603e's Back End
        2. The 603e's Front End, Instruction Window, and Branch Prediction
        3. Summary: The 603 and 603e in Historical Context
      4. The PowerPC 604
        1. The 604's Pipeline and Back End
        2. The 604's Front End and Instruction Window
          1. The Issue Phase: The 604's Reservation Stations
          2. The Four Rules of Instruction Dispatch
          3. The Completion Phase: The 604's Reorder Buffer
        3. Summary: The 604 in Historical Context
      5. The PowerPC 604e
      6. The PowerPC 750 (aka the G3)
        1. The 750's Front End, Instruction Window, and Branch Instruction
        2. Summary: The PowerPC 750 in Historical Context
      7. The PowerPC 7400 (aka the G4)
        1. The G4's Vector Unit
        2. Summary: The PowerPC G4 in Historical Context
      8. Conclusion
    10. 7. Intel's Pentium 4 vs. Motorola's G4e: Approaches and Design Philosophies
      1. The Pentium 4's Speed Addiction
      2. The General Approaches and Design Philosophies of the Pentium 4 and G4e
      3. An Overview of the G4e's Architecture and Pipeline
        1. Stages 1 and 2: Instruction Fetch
        2. Stage 3: Decode/Dispatch
        3. Stage 4: Issue
        4. Stage 5: Execute
        5. Stages 6 and 7: Complete and Write-Back
      4. Branch Prediction on the G4e and Pentium 4
      5. An Overview of the Pentium 4's Architecture
        1. Expanding the Instruction Window
        2. The Trace Cache
          1. Shortening Instruction Execution Time
          2. The Trace Cache's Operation
      6. An Overview of the Pentium 4's Pipeline
        1. Stages 1 and 2: Trace Cache Next Instruction Pointer
        2. Stages 3 and 4: Trace Cache Fetch
        3. Stage 5: Drive
        4. Stages 6 Through 8: Allocate and Rename (ROB)
        5. Stage 9: Queue
        6. Stages 10 Through 12: Schedule
        7. Stages 13 and 14: Issue
        8. Stages 15 and 16: Register Files
        9. Stage 17: Execute
        10. Stage 18: Flags
        11. Stage 19: Branch Check
        12. Stage 20: Drive
        13. Stages 21 and Onward: Complete and Commit
      7. The Pentium 4's Instruction Window
    11. 8. Intel's Pentium 4 vs. Motorola's G4e: The Back End
      1. Some Remarks About Operand Formats
      2. The Integer Execution Units
        1. The G4e's IUs: Making the Common Case Fast
        2. The Pentium 4's IUs: Make the Common Case Twice as Fast
      3. The Floating-Point Units (FPUs)
        1. The G4e's FPU
        2. The Pentium 4's FPU
        3. Concluding Remarks on the G4e's and Pentium 4's FPUs
      4. The Vector Execution Units
        1. A Brief Overview of Vector Computing
        2. Vectors Revisited: The AltiVec Instruction Set
        3. AltiVec Vector Operations
          1. Intra-Element Arithmetic and Non-Arithmetic Instructions
          2. Inter-Element Arithmetic and Non-Arithmetic Instructions
        4. The G4e's VU: SIMD Done Right
        5. Intel's MMX
        6. SSE and SSE2
        7. The Pentium 4's Vector Unit: Alphabet Soup Done Quickly
        8. Increasing Floating-Point Performance with SSE2
      5. Conclusions
    12. 9. 64-Bit Computing and x86-64
      1. Intel's IA-64 and AMD's x86-64
      2. Why 64 Bits?
      3. What Is 64-Bit Computing?
      4. Current 64-Bit Applications
        1. Dynamic Range
        2. The Benefits of Increased Dynamic Range, or, How the Existing 64-Bit Computing Market Uses 64-Bit Integers
        3. Virtual Address Space vs. Physical Address Space
        4. The Benefits of a 64-Bit Address
      5. The 64-Bit Alternative: x86-64
        1. Extended Registers
        2. More Registers
        3. Switching Modes
        4. Out with the Old
      6. Conclusion
    13. 10. The G5: IBM's PowerPC 970
      1. Overview: Design Philosophy
      2. Caches and Front End
      3. Branch Prediction
      4. The Trade-Off: Decode, Cracking, and Group Formation
          1. Dispatching and Issuing Instructions on the PowerPC 970
        1. The 970's Dispatch Rules
        2. Predecoding and Group Dispatch
        3. Some Preliminary Conclusions on the 970's Group Dispatch Scheme
      5. The PowerPC 970's Back End
        1. Integer Unit, Condition Register Unit, and Branch Unit
        2. The Integer Units Are Not Fully Symmetric
        3. Integer Unit Latencies and Throughput
        4. The CRU
          1. The PowerPC Condition Register
        5. Preliminary Conclusions About the 970's Integer Performance
      6. Load-Store Units
      7. Front-Side Bus
      8. The Floating-Point Units
      9. Vector Computing on the PowerPC 970
      10. Floating-Point Issue Queues
        1. Integer and Load-Store Issue Queues
        2. BU and CRU Issue Queues
        3. Vector Issue Queues
      11. The Performance Implications of the 970's Group Dispatch Scheme
      12. Conclusions
    14. 11. Understanding Caching and Performance
      1. Caching Basics
        1. The Level 1 Cache
        2. The Level 2 Cache
        3. Example: A Byte's Brief Journey Through the Memory Hierarchy
        4. Cache Misses
      2. Locality of Reference
        1. Spatial Locality of Data
        2. Spatial Locality of Code
        3. Temporal Locality of Code and Data
        4. Locality: Conclusions
      3. Cache Organization: Blocks and Block Frames
      4. Tag RAM
      5. Fully Associative Mapping
      6. Direct Mapping
      7. N-Way Set Associative Mapping
        1. Four-Way Set Associative Mapping
        2. Two-Way Set Associative Mapping
        3. Two-Way vs. Direct-Mapped
        4. Two-Way vs. Four-Way
        5. Associativity: Conclusions
      8. Temporal and Spatial Locality Revisited: Replacement/Eviction Policies and Block Sizes
        1. Types of Replacement/Eviction Policies
        2. Block Sizes
      9. Write Policies: Write-Through vs. Write-Back
      10. Conclusions
    15. 12. Intel's Pentium M, Core Duo, and Core 2 Duo
      1. Code Names and Brand Names
      2. The Rise of Power-Efficient Computing
      3. Power Density
        1. Dynamic Power Density
        2. Static Power Density
      4. The Pentium M
        1. The Fetch Phase
          1. The Hardware Loop Buffer
        2. The Decode Phase: Micro-ops Fusion
          1. Fused Stores
          2. Fused Loads
          3. The Impact of Micro-ops Fusion
        3. Branch Prediction
          1. The Loop Detector
          2. The Indirect Predictor
        4. The Stack Execution Unit
        5. Pipeline and Back End
        6. Summary: The Pentium M in Historical Context
      5. Core Duo/Solo
        1. Intel's Line Goes Multi-Core
          1. Processor Organization and Core Microarchitecture
          2. Multiprocessing and Chip Multiprocessing
        2. Core Duo's Improvements
          1. Micro-ops Fusion of SSE and SSE2 store and load-op Instructions
          2. Micro-ops Fusion and Lamination of SSE and SSE2 Arithmetic Instructions
          3. Micro-ops Fusion of Miscellaneous Non-SSE Instructions
          4. Improved Loop Detector
          5. SSE3
          6. Floating-Point Improvement
          7. Integer Divide Improvement
          8. Virtualization Technology
        3. Summary: Core Duo in Historical Context
      6. Core 2 Duo
        1. The Fetch Phase
          1. Macro-Fusion
        2. The Decode Phase
        3. Core's Pipeline
      7. Core's Back End
          1. Integer Units
          2. Floating-Point Units
        1. Vector Processing Improvements
          1. 128-bit Vector Execution on the P6 Through Core Duo
          2. 128-bit Vector Execution on Core
        2. Memory Disambiguation: The Results Stream Version of Speculative Execution
          1. The Lifecycle of a Memory Access Instruction
          2. The Memory Reorder Buffer
          3. Memory Aliasing
          4. Memory Reordering Rules
          5. False Aliasing
          6. Memory Disambiguation
        3. Summary: Core 2 Duo in Historical Context
    16. A. Bibliography and Suggested Reading
      1. Online Resources
    17. Index
    18. About the Author
    19. Colophon
    20. B. Updates