Appendix B. Cloudera’s Distribution for Hadoop

Cloudera’s Distribution for Hadoop (hereafter CDH) is based on the most recent stable version of Apache Hadoop with numerous patches, backports, and updates. Cloudera makes the distribution available in a number of different formats: source and binary tar files, RPMs, Debian packages, VMware images, and scripts for running CDH in the cloud. CDH is free, released under the Apache 2.0 license and available at http://www.cloudera.com/hadoop/.

To simplify deployment, Cloudera hosts packages on public yum and apt repositories. CDH enables you to install and configure Hadoop on each machine using a single command. Kickstart users can commission entire Hadoop clusters without manual intervention.

CDH manages cross-component versions and provides a stable platform with a compatible set of packages that work together. As of CDH3, the following packages are included, many of which are covered elsewhere in this book:

  • HDFS – Self-healing distributed file system

  • MapReduce – Powerful, parallel data processing framework

  • Hadoop Common – A set of utilities that support the Hadoop subprojects

  • HBase – Hadoop database for random read/write access

  • Hive – SQL-like queries and tables on large datasets

  • Pig – Dataflow language and compiler

  • Oozie – Workflow for interdependent Hadoop jobs

  • Sqoop – Integrate databases and data warehouses with Hadoop

  • Flume – Highly reliable, configurable streaming data collection

  • ZooKeeper – Coordination service for distributed applications ...

Get Hadoop: The Definitive Guide, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.