You are previewing Planning for Big Data.

Planning for Big Data

Cover of Planning for Big Data by Edd Dumbill Published by O'Reilly Media, Inc.
  1. Planning for Big Data
  2. Introduction
  3. 1. The Feedback Economy
    1. Data-Obese, Digital-Fast
    2. The Big Data Supply Chain
      1. Data collection
      2. Ingesting and cleaning
      3. Hardware
      4. Platforms
      5. Machine learning
      6. Human exploration
      7. Storage
      8. Sharing and acting
      9. Measuring and collecting feedback
    3. Replacing Everything with Data
    4. A Feedback Economy
  4. 2. What Is Big Data?
    1. What Does Big Data Look Like?
      1. Volume
      2. Velocity
      3. Variety
    2. In Practice
      1. Cloud or in-house?
      2. Big data is big
      3. Big data is messy
      4. Culture
      5. Know where you want to go
  5. 3. Apache Hadoop
    1. The Core of Hadoop: MapReduce
    2. Hadoop’s Lower Levels: HDFS and MapReduce
    3. Improving Programmability: Pig and Hive
    4. Improving Data Access: HBase, Sqoop, and Flume
      1. Getting data in and out
    5. Coordination and Workflow: Zookeeper and Oozie
    6. Management and Deployment: Ambari and Whirr
    7. Machine Learning: Mahout
    8. Using Hadoop
  6. 4. Big Data Market Survey
    1. Just Hadoop?
    2. Integrated Hadoop Systems
      1. EMC Greenplum
      2. IBM
      3. Microsoft
      4. Oracle
      5. Availability
    3. Analytical Databases with Hadoop Connectivity
      1. Quick facts
    4. Hadoop-Centered Companies
      1. Cloudera
      2. Hortonworks
      3. An overview of Hadoop distributions (part 1)
      4. An overview of Hadoop distributions (part 2)
    5. Notes
  7. 5. Microsoft’s Plan for Big Data
    1. Microsoft’s Hadoop Distribution
    2. Developers, Developers, Developers
    3. Streaming Data and NoSQL
    4. Toward an Integrated Environment
    5. The Data Marketplace
    6. Summary
  8. 6. Big Data in the Cloud
    1. IaaS and Private Clouds
    2. Platform solutions
      1. Amazon Web Services
      2. Google
      3. Microsoft
    3. Big data cloud platforms compared
    4. Conclusion
    5. Notes
  9. 7. Data Marketplaces
    1. What Do Marketplaces Do?
    2. Infochimps
    3. Factual
    4. Windows Azure Data Marketplace
    5. DataMarket
    6. Data Markets Compared
    7. Other Data Suppliers
  10. 8. The NoSQL Movement
    1. Size, Response, Availability
    2. Changing Data and Cheap Lunches
    3. The Sacred Cows
    4. Other features
    5. In the End
  11. 9. Why Visualization Matters
    1. A Picture Is Worth 1000 Rows
    2. Types of Visualization
      1. Explaining and exploring
    3. Your Customers Make Decisions, Too
    4. Do Yourself a Favor and Hire a Designer
  12. 10. The Future of Big Data
    1. More Powerful and Expressive Tools for Analysis
    2. Streaming Data Processing
    3. Rise of Data Marketplaces
    4. Development of Data Science Workflows and Tools
    5. Increased Understanding of and Demand for Visualization
  13. About the Author
  14. Copyright
O'Reilly logo

Chapter 4. Big Data Market Survey

By Edd Dumbill

The big data ecosystem can be confusing. The popularity of “big data” as industry buzzword has created a broad category. As Hadoop steamrolls through the industry, solutions from the business intelligence and data warehousing fields are also attracting the big data label. To confuse matters, Hadoop-based solutions such as Hive are at the same time evolving toward being a competitive data warehousing solution.

Understanding the nature of your big data problem is a helpful first step in evaluating potential solutions. Let’s remind ourselves of the definition of big data:

“Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.”

Big data problems vary in how heavily they weigh in on the axes of volume, velocity and variability. Predominantly structured yet large data, for example, may be most suited to an analytical database approach.

This survey makes the assumption that a data warehousing solution alone is not the answer to your problems, and concentrates on analyzing the commercial Hadoop ecosystem. We’ll focus on the solutions that incorporate storage and data processing, excluding those products which only sit above those layers, such as the visualization or analytical workbench software.

Getting started with Hadoop doesn’t require a large investment as the software is open source, and is also available instantly through the Amazon Web Services cloud. But for production environments, support, professional services and training are often required.

Just Hadoop?

Apache Hadoop is unquestionably the center of the latest iteration of big data solutions. At its heart, Hadoop is a system for distributing computation among commodity servers. It is often used with the Hadoop Hive project, which layers data warehouse technology on top of Hadoop, enabling ad-hoc analytical queries.

Big data platforms divide along the lines of their approach to Hadoop. The big data offerings from familiar enterprise vendors incorporate a Hadoop distribution, while other platforms offer Hadoop connectors to their existing analytical database systems. This latter category tends to comprise massively parallel processing (MPP) databases that made their name in big data before Hadoop matured: Vertica and Aster Data. Hadoop’s strength in these cases is in processing unstructured data in tandem with the analytical capabilities of the existing database on structured or structured data.

Practical big data implementations don’t in general fall neatly into either structured or unstructured data categories. You will invariably find Hadoop working as part of a system with a relational or MPP database.

Much as with Linux before it, no Hadoop solution incorporates the raw Apache Hadoop code. Instead, it’s packaged into distributions. At a minimum, these distributions have been through a testing process, and often include additional components such as management and monitoring tools. The most well-used distributions now come from Cloudera, Hortonworks and MapR. Not every distribution will be commercial, however: the BigTop project aims to create a Hadoop distribution under the Apache umbrella.

Integrated Hadoop Systems

The leading Hadoop enterprise software vendors have aligned their Hadoop products with the rest of their database and analytical offerings. These vendors don’t require you to source Hadoop from another party, and offer it as a core part of their big data solutions. Their offerings integrate Hadoop into a broader enterprise setting, augmented by analytical and workflow tools.

EMC Greenplum

Acquired by EMC, and rapidly taken to the heart of the company’s strategy, Greenplum is a relative newcomer to the enterprise, compared to other companies in this section. They have turned that to their advantage in creating an analytic platform, positioned as taking analytics “beyond BI” with agile data science teams.

Greenplum’s Unified Analytics Platform (UAP) comprises three elements: the Greenplum MPP database, for structured data; a Hadoop distribution, Greenplum HD; and Chorus, a productivity and groupware layer for data science teams.

The HD Hadoop layer builds on MapR’s Hadoop compatible distribution, which replaces the file system with a faster implementation and provides other features for robustness. Interoperability between HD and Greenplum Database means that a single query can access both database and Hadoop data.

Chorus is a unique feature, and is indicative of Greenplum’s commitment to the idea of data science and the importance of the agile team element to effectively exploiting big data. It supports organizational roles from analysts, data scientists and DBAs through to executive business stakeholders.

As befits EMC’s role in the data center market, Greenplum’s UAP is available in a modular appliance configuration.

IBM

IBM’s InfoSphere BigInsights is their Hadoop distribution, and part of a suite of products offered under the “InfoSphere” information management brand. Everything big data at IBM is helpfully labeled Big, appropriately enough for a company affectionately known as “Big Blue.”

BigInsights augments Hadoop with a variety of features, including management and administration tools. It also offers textual analysis tools that aid with entity resolution — identifying people, addresses, phone numbers and so on.

IBM’s Jaql query language provides a point of integration between Hadoop and other IBM products, such as relational databases or Netezza data warehouses.

InfoSphere BigInsights is interoperable with IBM’s other database and warehouse products, including DB2, Netezza and its InfoSphere warehouse and analytics lines. To aid analytical exploration, BigInsights ships with BigSheets, a spreadsheet interface onto big data.

IBM addresses streaming big data separately through its InfoSphere Streams product. BigInsights is not currently offered in an appliance form, but can be used in the cloud via Rightscale, Amazon, Rackspace, and IBM Smart Enterprise Cloud.

Microsoft

Microsoft have adopted Hadoop as the center of their big data offering, and are pursuing an integrated approach aimed at making big data available through their analytical tool suite, including to the familiar tools of Excel and PowerPivot.

Microsoft’s Big Data Solution brings Hadoop to the Windows Server platform, and in elastic form to their cloud platform Windows Azure. Microsoft have packaged their own distribution of Hadoop, integrated with Windows Systems Center and Active Directory. They intend to contribute back changes to Apache Hadoop to ensure that an open source version of Hadoop will run on Windows.

On the server side, Microsoft offer integrations to their SQL Server database and their data warehouse product. Using their warehouse solutions aren’t mandated, however. The Hadoop Hive data warehouse is part of the Big Data Solution, including connectors from Hive to ODBC and Excel.

Microsoft’s focus on the developer is evident in their creation of a JavaScript API for Hadoop. Using JavaScript, developers can create Hadoop jobs for MapReduce, Pig or Hive, even from a browser-based environment. Visual Studio and .NET integration with Hadoop is also provided.

Deployment is possible either on the server or in the cloud, or as a hybrid combination. Jobs written against the Apache Hadoop distribution should migrate with miniminal changes to Microsoft’s environment.

Oracle

Announcing their entry into the big data market at the end of 2011, Oracle is taking an appliance-based approach. Their Big Data Appliance integrates Hadoop, R for analytics, a new Oracle NoSQL database, and connectors to Oracle’s database and Exadata data warehousing product line.

Oracle’s approach caters to the high-end enterprise market, and particularly leans to the rapid-deployment, high-performance end of the spectrum. It is the only vendor to include the popular R analytical language integrated with Hadoop, and to ship a NoSQL database of their own design as opposed to Hadoop HBase.

Rather than developing their own Hadoop distribution, Oracle have partnered with Cloudera for Hadoop support, which brings them a mature and established Hadoop solution. Database connectors again promote the integration of structured Oracle data with the unstructured data stored in Hadoop HDFS.

Oracle’s NoSQL Database is a scalable key-value database, built on the Berkeley DB technology. In that, Oracle owes double gratitude to Cloudera CEO Mike Olson, as he was previously the CEO of Sleepycat, the creators of Berkeley DB. Oracle are positioning their NoSQL database as a means of acquiring big data prior to analysis.

The Oracle R Enterprise product offers direct integration into the Oracle database, as well as Hadoop, enabling R scripts to run on data without having to round-trip it out of the data stores.

Availability

While IBM and Greenplum’s offerings are available at the time of writing, the Microsoft and Oracle solutions are expected to be fully available early in 2012.

Analytical Databases with Hadoop Connectivity

MPP (massively parallel processing) databases are specialized for processing structured big data, as distinct from the unstructured data that is Hadoop’s specialty. Along with Greenplum, Aster Data and Vertica are early pioneers of big data products before the mainstream emergence of Hadoop.

These MPP solutions are databases specialized for analyical workloads and data integration, and provide connectors to Hadoop and data warehouses. A recent spate of acquisitions have seen these products become the analytical play by data warehouse and storage vendors: Teradata acquired Aster Data, EMC acquired Greenplum, and HP acquired Vertica.

Quick facts

Aster Data

ParAccel

Vertica

Database

  • MPP analytical database

Database

  • MPP analytical database

Database

  • MPP analytical database

Deployment options

Deployment options

Deployment options

Hadoop

  • Hadoop connector available

Hadoop

  • Hadoop integration available

Hadoop

  • Hadoop and Pig connectors available

Links

Links

Links

Hadoop-Centered Companies

Directly employing Hadoop is another route to creating a big data solution, especially where your infrastructure doesn’t fall neatly into the product line of major vendors. Practically every database now features Hadoop connectivity, and there are multiple Hadoop distributions to choose from.

Reflecting the developer-driven ethos of the big data world, Hadoop distributions are frequently offered in a community edition. Such editions lack enterprise management features, but contain all the functionality needed for evaluation and development.

The first iterations of Hadoop distributions, from Cloudera and IBM, focused on usability and adminstration. We are now seeing the addition of performance-oriented improvements to Hadoop, such as those from MapR and Platform Computing. While maintaining API compatibility, these vendors replace slow or fragile parts of the Apache distribution with better performing or more robust components.

Cloudera

The longest-established provider of Hadoop distributions, Cloudera provides an enterprise Hadoop solution, alongside services, training and support options. Along with Yahoo, Cloudera have made deep open source contributions to Hadoop, and through hosting industry conferences have done much to establish Hadoop in its current position.

Hortonworks

Though a recent entrant to the market, Hortonworks have a long history with Hadoop. Spun off from Yahoo, where Hadoop originated, Hortonworks aims to stick close to and promote the core Apache Hadoop technology. Hortonworks also have a partnership with Microsoft to assist and accelerate their Hadoop integration.

Hortonworks Data Platform is currently in a limited preview phase, with a public preview expected in early 2012. The company also provides support and training.

An overview of Hadoop distributions (part 1)

 

Cloudera

EMC Greenplum

Hortonworks

IBM

Product Name

Cloudera’s Distribution including Apache Hadoop

Greenplum HD

Hortonworks Data Platform

InfoSphere BigInsights

Free Edition

CDH

Integrated, tested distribution of Apache Hadoop

Community Edition

100% open source certified and supported version of the Apache Hadoop stack

 

Basic Edition

An integrated Hadoop distribution.

Enterprise Edition

Cloudera Enterprise

Adds management software layer over CDH

Enterprise Edition

Integrates MapR’s M5 Hadoop-compatible distribution, replaces HDFS with MapR’s C++-based file system. Includes MapR management tools

 

Enterprise Edition

Hadoop distribution, plus BigSheets spreadsheet interface, scheduler, text analytics, indexer, JDBC connector, security support.

Hadoop Components

Hive, Oozie, Pig, Zookeeper, Avro, Flume, HBase, Sqoop, Mahout, Whirr

Hive, Pig, Zookeeper, HBase

Hive, Pig, Zookeeper, HBase, None, Ambari

Hive, Oozie, Pig, Zookeeper, Avro, Flume, HBase, Lucene

Security

Cloudera Manager

Kerberos, role-based administration and audit trails

  

Security features

LDAP authentication, role-based authorization, reverse proxy

Admin Interface

Cloudera Manager

Centralized management and alerting

Administrative interfaces

MapR Heatmap cluster administrative tools

Apache Ambari

Monitoring, administration and lifecycle management for Hadoop clusters

Administrative interfaces

Administrative features including Hadoop HDFS and MapReduce administration, cluster and server management, view HDFS file content

Job Management

Cloudera Manager

Job analytics, monitoring and log search

High-availability job management

JobTracker HA and Distributed NameNode HA prevent lost jobs, restarts and failover incidents

Apache Ambari

Monitoring, administration and lifecycle management for Hadoop clusters

Job management features

Job creation, submission, cancellation, status, logging.

Database connectors

 

Greenplum Database

 

DB2, Netezza, InfoSphere Warehouse

Interop features

    

HDFS Access

Fuse-DFS

Mount HDFS as a traditional filesystem

NFS

Access HDFS as a conventional network file system

WebHDFS

REST API to HDFS

 

Installation

Cloudera Manager

Wizard-based deployment

  

Quick installation

GUI-driven installation tool

Additional APIs

   

Jaql

Jaql is a functional, declarative query language designed to process large data sets.

Volume Management

    

An overview of Hadoop distributions (part 2)

 

MapR

Microsoft

Platform Computing

Product Name

MapR

Big Data Solution

Platform MapReduce

Free Edition

MapR M3 Edition

Free community edition incorporating MapR’s performance increases

 

Platform MapReduce Developer Edition

Evaluation edition, excludes resource management features of regualt edition

Enterprise Edition

MapR M5 Edition

Augments M3 Edition with high availability and data protection features

Big Data Solution

Windows Hadoop distribution, integrated with Microsoft’s database and analytical products

Platform MapReduce

Enhanced runtime for Hadoop MapReduce, API-compatible with Apache Hadoop

Hadoop Components

Hive, Pig, Flume, HBase, Sqoop, Mahout, None, Oozie

Hive, Pig

 

Security

 

Active Directory integration

 

Admin Interface

Administrative interfaces

MapR Heatmap cluster administrative tools

System Center integration

Administrative interfaces

Platform MapReduce Workload Manager

Job Management

High-availability job management

JobTracker HA and Distributed NameNode HA prevent lost jobs, restarts and failover incidents

  

Database connectors

 

SQL Server, SQL Server Parallel Data Warehouse

 

Interop features

 

Hive ODBC Driver, Excel Hive Add-in

 

HDFS Access

NFS

Access HDFS as a conventional network file system

  

Installation

   

Additional APIs

REST API

JavaScript API

JavaScript Map/Reduce jobs, Pig-Latin, and Hive queries

Includes R, C/C++, C#, Java, Python

Volume Management

Mirroring, snapshots

  

Notes

  • Pure cloud solutions: Both Amazon Web Services and Google offer cloud-based big data solutions. These will be reviewed separately.

  • HPCC: Though dominant, Hadoop is not the only big data solution. LexisNexis’ HPCC offers an alternative approach.

  • Hadapt: not yet featured in this survey. Taking a different approach from both Hadoop-centered and MPP solutions, Hadapt integrates unstructured and structured data into one product: wrapping rather than exposing Hadoop. It is currently in “early access” stage.

  • NoSQL: Solutions built on databases such as Cassandra, MongoDB and Couchbase are not in the scope of this survey, though these databases do offer Hadoop integration.

  • Errors and omissions: given the fast-evolving nature of the market and variable quality of public information, any feedback about errors and omissions from this survey is most welcome. Please send it to .

The best content for your career. Discover unlimited learning on demand for around $1/day.