You are previewing Planning for Big Data.

Planning for Big Data

Cover of Planning for Big Data by Edd Dumbill Published by O'Reilly Media, Inc.
  1. Planning for Big Data
  2. Introduction
  3. 1. The Feedback Economy
    1. Data-Obese, Digital-Fast
    2. The Big Data Supply Chain
      1. Data collection
      2. Ingesting and cleaning
      3. Hardware
      4. Platforms
      5. Machine learning
      6. Human exploration
      7. Storage
      8. Sharing and acting
      9. Measuring and collecting feedback
    3. Replacing Everything with Data
    4. A Feedback Economy
  4. 2. What Is Big Data?
    1. What Does Big Data Look Like?
      1. Volume
      2. Velocity
      3. Variety
    2. In Practice
      1. Cloud or in-house?
      2. Big data is big
      3. Big data is messy
      4. Culture
      5. Know where you want to go
  5. 3. Apache Hadoop
    1. The Core of Hadoop: MapReduce
    2. Hadoop’s Lower Levels: HDFS and MapReduce
    3. Improving Programmability: Pig and Hive
    4. Improving Data Access: HBase, Sqoop, and Flume
      1. Getting data in and out
    5. Coordination and Workflow: Zookeeper and Oozie
    6. Management and Deployment: Ambari and Whirr
    7. Machine Learning: Mahout
    8. Using Hadoop
  6. 4. Big Data Market Survey
    1. Just Hadoop?
    2. Integrated Hadoop Systems
      1. EMC Greenplum
      2. IBM
      3. Microsoft
      4. Oracle
      5. Availability
    3. Analytical Databases with Hadoop Connectivity
      1. Quick facts
    4. Hadoop-Centered Companies
      1. Cloudera
      2. Hortonworks
      3. An overview of Hadoop distributions (part 1)
      4. An overview of Hadoop distributions (part 2)
    5. Notes
  7. 5. Microsoft’s Plan for Big Data
    1. Microsoft’s Hadoop Distribution
    2. Developers, Developers, Developers
    3. Streaming Data and NoSQL
    4. Toward an Integrated Environment
    5. The Data Marketplace
    6. Summary
  8. 6. Big Data in the Cloud
    1. IaaS and Private Clouds
    2. Platform solutions
      1. Amazon Web Services
      2. Google
      3. Microsoft
    3. Big data cloud platforms compared
    4. Conclusion
    5. Notes
  9. 7. Data Marketplaces
    1. What Do Marketplaces Do?
    2. Infochimps
    3. Factual
    4. Windows Azure Data Marketplace
    5. DataMarket
    6. Data Markets Compared
    7. Other Data Suppliers
  10. 8. The NoSQL Movement
    1. Size, Response, Availability
    2. Changing Data and Cheap Lunches
    3. The Sacred Cows
    4. Other features
    5. In the End
  11. 9. Why Visualization Matters
    1. A Picture Is Worth 1000 Rows
    2. Types of Visualization
      1. Explaining and exploring
    3. Your Customers Make Decisions, Too
    4. Do Yourself a Favor and Hire a Designer
  12. 10. The Future of Big Data
    1. More Powerful and Expressive Tools for Analysis
    2. Streaming Data Processing
    3. Rise of Data Marketplaces
    4. Development of Data Science Workflows and Tools
    5. Increased Understanding of and Demand for Visualization
  13. About the Author
  14. Copyright
O'Reilly logo

Chapter 10. The Future of Big Data

By Edd Dumbill

2011 was the “coming out” year for data science and big data. As the field matures in 2012, what can we expect over the course of the year?

image with no caption

More Powerful and Expressive Tools for Analysis

This year has seen consolidation and engineering around improving the basic storage and data processing engines of NoSQL and Hadoop. That will doubtless continue, as we see the unruly menagerie of the Hadoop universe increasingly packaged into distributions, appliances and on-demand cloud services. Hopefully it won’t be long before that’s dull, yet necessary, infrastructure.

Looking up the stack, there’s already an early cohort of tools directed at programmers and data scientists (Karmasphere, Datameer), as well as Hadoop connectors for established analytical tools such as Tableau and R. But there’s a way to go in making big data more powerful: that is, to decrease the cost of creating experiments.

Here are two ways in which big data can be made more powerful.

  1. Better programming language support. As we consider data, rather than business logic, as the primary entity in a program, we must create or rediscover idiom that lets us focus on the data, rather than abstractions leaking up from the underlying Hadoop machinery. In other words: write shorter programs that make it clear what we’re doing with the data. These abstractions will in turn lend themselves ...

The best content for your career. Discover unlimited learning on demand for around $1/day.