You are previewing The Datacenter as a Computer, 2nd Edition.
O'Reilly logo
The Datacenter as a Computer, 2nd Edition

Book Description

As computation continues to move into the cloud, the computing platform of interest no longer resembles a pizza box or a refrigerator, but a warehouse full of computers. These new large datacenters are quite different from traditional hosting facilities of earlier times and cannot be viewed simply as a collection of co-located servers. Large portions of the hardware and software resources in these facilities must work in concert to efficiently deliver good levels of Internet service performance, something that can only be achieved by a holistic approach to their design and deployment. In other words, we must treat the datacenter itself as one massive warehouse-scale computer (WSC). We describe the architecture of WSCs, the main factors influencing their design, operation, and cost structure, and the characteristics of their software base. We hope it will be useful to architects and programmers of today’s WSCs, as well as those of future many-core platforms which may one day implement the equivalent of today’s WSCs on a single board.

Notes for the Second Edition

After nearly four years of substantial academic and industrial developments in warehouse-scale computing, we are delighted to present our first major update to this lecture. The increased popularity of public clouds has made WSC software techniques relevant to a larger pool of programmers since our first edition. Therefore, we expanded Chapter 2 to reflect our better understanding of WSC software systems and the toolbox of software techniques for WSC programming. In Chapter 3, we added to our coverage of the evolving landscape of wimpy vs. brawny server trade-offs, and we now present an overview of WSC interconnects and storage systems that was promised but lacking in the original edition. Thanks largely to the help of our new co-author, Google Distinguished Engineer Jimmy Clidaras, the material on facility mechanical and power distribution design has been updated and greatly extended (see Chapters 4 and 5). Chapters 6 and 7 have also been revamped significantly. We hope this revised edition continues to meet the needs of educators and professionals in this area.

Table of Contents

  1. Cover
  2. Title
  3. Copyright
  4. Contents
  5. Acknowledgments
  6. Note to the Reader
  7. 1 Introduction
    1. 1.1 Warehouse-Scale Computers
    2. 1.2 Cost Efficiency at Scale
    3. 1.3 Not Just a Collection of Servers
    4. 1.4 One Datacenter vs. Several Datacenters
    5. 1.5 Why WSCS Might Matter to You
    6. 1.6 Architectural Overview of WSCs
      1. 1.6.1 Storage
      2. 1.6.2 Networking Fabric
      3. 1.6.3 Storage Hierarchy
      4. 1.6.4 Quantifying Latency, Bandwidth, and Capacity
      5. 1.6.5 Power Usage
      6. 1.6.6 Handling Failures
  8. 2 Workloads and Software Infrastructure
    1. 2.1 Datacenter vs. Desktop
    2. 2.2 Performance and Availability Toolbox
    3. 2.3 Platform-Level Software
    4. 2.4 Cluster-Level Infrastructure Software
      1. 2.4.1 Resource Management
      2. 2.4.2 Hardware Abstraction and Other Basic Services
      3. 2.4.3 Deployment and Maintenance
      4. 2.4.4 Programming Frameworks
    5. 2.5 Application-Level Software
      1. 2.5.1 Workload Examples
      2. 2.5.2 Online: Web Search
      3. 2.5.3 Offline: Scholar Article Similarity
    6. 2.6 A Monitoring Infrastructure
      1. 2.6.1 Service-Level Dashboards
      2. 2.6.2 Performance Debugging Tools
      3. 2.6.3 Platform-Level Health Monitoring
    7. 2.7 Buy vs. Build
    8. 2.8 Tail-Tolerance
    9. 2.9 Further Reading
  9. 3 Hardware Building Blocks
    1. 3.1 Cost-Efficient Server Hardware
      1. 3.1.1 The Impact of Large SMP Communication Efficiency
      2. 3.1.2 Brawny vs. Wimpy Servers
      3. 3.1.3 Balanced Designs
    2. 3.2 WSC Storage
      1. 3.2.1 Unstructured WSC Storage
      2. 3.2.2 Structured WSC Storage
      3. 3.2.3 Interplay of Storage and Networking Technology
    3. 3.3 WSC Networking
    4. 3.4 Further Reading
  10. 4 Datacenter Basics
    1. 4.1 Datacenter Tier Classifications and Specifications
    2. 4.2 Datacenter Power Systems
      1. 4.2.1 Uninterruptible Power Systems
      2. 4.2.2 Power Distribution Units
      3. 4.2.3 Alternative: DC Distribution
    3. 4.3 Datacenter Cooling Systems
      1. 4.3.1 CRACs, Chillers, and Cooling Towers
      2. 4.3.2 CRACs
      3. 4.3.3 Chillers
      4. 4.3.4 Cooling towers
      5. 4.3.5 Free Cooling
      6. 4.3.6 Air Flow Considerations
      7. 4.3.7 In-Rack, In-Row Cooling, and Cold Plates
      8. 4.3.8 Case Study: Google’s In-row Cooling
      9. 4.3.9 Container-Based Datacenters
    4. 4.4 Summary
  11. 5 Energy and Power Efficiency
    1. 5.1 Datacenter Energy Efficiency
      1. 5.1.1 The PUE Metric
      2. 5.1.2 Issues with the PUE Metric
      3. 5.1.3 Sources of Efficiency Losses in Datacenters
      4. 5.1.4 Improving the Energy Efficiency of Datacenters
      5. 5.1.5 Beyond the Facility
    2. 5.2 The Energy Efficiency of Computing
      1. 5.2.1 Measuring Energy Efficiency
      2. 5.2.2 Server Energy Efficiency
      3. 5.2.3 Usage Profile of Warehouse-Scale Computers
    3. 5.3 Energy-Proportional Computing
      1. 5.3.1 Causes of Poor Energy Proportionality
      2. 5.3.2 Improving Energy Proportionality
      3. 5.3.3 Energy Proportionality—The Rest of the System
    4. 5.4 Relative Effectiveness of Low-Power Modes
    5. 5.5 The Role of Software in Energy Proportionality
    6. 5.6 Datacenter Power Provisioning
      1. 5.6.1 Deploying the Right Amount of Equipment
      2. 5.6.2 Oversubscribing Facility Power
    7. 5.7 Trends in Server Energy Usage
      1. 5.7.1 Using Energy Storage for Power Management
    8. 5.8 Conclusions
      1. 5.8.1 Further Reading
  12. 6 Modeling Costs
    1. 6.1 Capital Costs
    2. 6.2 Operational Costs
    3. 6.3 Case Studies
      1. 6.3.1 Real-World Datacenter Costs
      2. 6.3.2 Modeling a Partially Filled Datacenter
      3. 6.3.3 The Cost of Public Clouds
  13. 7 Dealing with Failures and Repairs
    1. 7.1 Implications of Software-Based Fault Tolerance
    2. 7.2 Categorizing Faults
    3. 7.3 Machine-Level Failures
    4. 7.4 Repairs
    5. 7.5 Tolerating Faults, Not Hiding Them
  14. 8 Closing Remarks
    1. 8.1 Hardware
    2. 8.2 Software
    3. 8.3 Economics
    4. 8.4 Key Challenges
      1. 8.4.1 Rapidly Changing Workloads
      2. 8.4.2 Building Responsive Large Scale Systems
      3. 8.4.3 Energy Proportionality of Non-CPU components
      4. 8.4.4 Overcoming the End of Dennard Scaling
      5. 8.4.5 Amdahl’s Cruel Law
    5. 8.5 Conclusions
  15. Bibliography
  16. Author Biographies