You are previewing Planning for Big Data.

Planning for Big Data

Cover of Planning for Big Data by Edd Dumbill Published by O'Reilly Media, Inc.
  1. Planning for Big Data
  2. Introduction
  3. 1. The Feedback Economy
    1. Data-Obese, Digital-Fast
    2. The Big Data Supply Chain
      1. Data collection
      2. Ingesting and cleaning
      3. Hardware
      4. Platforms
      5. Machine learning
      6. Human exploration
      7. Storage
      8. Sharing and acting
      9. Measuring and collecting feedback
    3. Replacing Everything with Data
    4. A Feedback Economy
  4. 2. What Is Big Data?
    1. What Does Big Data Look Like?
      1. Volume
      2. Velocity
      3. Variety
    2. In Practice
      1. Cloud or in-house?
      2. Big data is big
      3. Big data is messy
      4. Culture
      5. Know where you want to go
  5. 3. Apache Hadoop
    1. The Core of Hadoop: MapReduce
    2. Hadoop’s Lower Levels: HDFS and MapReduce
    3. Improving Programmability: Pig and Hive
    4. Improving Data Access: HBase, Sqoop, and Flume
      1. Getting data in and out
    5. Coordination and Workflow: Zookeeper and Oozie
    6. Management and Deployment: Ambari and Whirr
    7. Machine Learning: Mahout
    8. Using Hadoop
  6. 4. Big Data Market Survey
    1. Just Hadoop?
    2. Integrated Hadoop Systems
      1. EMC Greenplum
      2. IBM
      3. Microsoft
      4. Oracle
      5. Availability
    3. Analytical Databases with Hadoop Connectivity
      1. Quick facts
    4. Hadoop-Centered Companies
      1. Cloudera
      2. Hortonworks
      3. An overview of Hadoop distributions (part 1)
      4. An overview of Hadoop distributions (part 2)
    5. Notes
  7. 5. Microsoft’s Plan for Big Data
    1. Microsoft’s Hadoop Distribution
    2. Developers, Developers, Developers
    3. Streaming Data and NoSQL
    4. Toward an Integrated Environment
    5. The Data Marketplace
    6. Summary
  8. 6. Big Data in the Cloud
    1. IaaS and Private Clouds
    2. Platform solutions
      1. Amazon Web Services
      2. Google
      3. Microsoft
    3. Big data cloud platforms compared
    4. Conclusion
    5. Notes
  9. 7. Data Marketplaces
    1. What Do Marketplaces Do?
    2. Infochimps
    3. Factual
    4. Windows Azure Data Marketplace
    5. DataMarket
    6. Data Markets Compared
    7. Other Data Suppliers
  10. 8. The NoSQL Movement
    1. Size, Response, Availability
    2. Changing Data and Cheap Lunches
    3. The Sacred Cows
    4. Other features
    5. In the End
  11. 9. Why Visualization Matters
    1. A Picture Is Worth 1000 Rows
    2. Types of Visualization
      1. Explaining and exploring
    3. Your Customers Make Decisions, Too
    4. Do Yourself a Favor and Hire a Designer
  12. 10. The Future of Big Data
    1. More Powerful and Expressive Tools for Analysis
    2. Streaming Data Processing
    3. Rise of Data Marketplaces
    4. Development of Data Science Workflows and Tools
    5. Increased Understanding of and Demand for Visualization
  13. About the Author
  14. Copyright

Chapter 6. Big Data in the Cloud

By Edd Dumbill

Big data and cloud technology go hand-in-hand. Big data needs clusters of servers for processing, which clouds can readily provide. So goes the marketing message, but what does that look like in reality? Both “cloud” and “big data” have broad definitions, obscured by considerable hype. This article breaks down the landscape as simply as possible, highlighting what’s practical, and what’s to come.

IaaS and Private Clouds

What is often called “cloud” amounts to virtualized servers: computing resource that presents itself as a regular server, rentable per consumption. This is generally called infrastructure as a service (IaaS), and is offered by platforms such as Rackspace Cloud or Amazon EC2. You buy time on these services, and install and configure your own software, such as a Hadoop cluster or NoSQL database. Most of the solutions I described in my Big Data Market Survey can be deployed on IaaS services.

Using IaaS clouds doesn’t mean you must handle all deployment manually: good news for the clusters of machines big data requires. You can use orchestration frameworks, which handle the management of resources, and automated infrastructure tools, which handle server installation and configuration. RightScale offers a commercial multi-cloud management platform that mitigates some of the problems of managing servers in the cloud.

Frameworks such as OpenStack and Eucalyptus aim to present a uniform interface to both private data centers and ...

The best content for your career. Discover unlimited learning on demand for around $1/day.