images

HDFS, Hive, HBase, and HCatalog

WHAT YOU WILL LEARN IN THIS CHAPTER:

  • Exploring HDFS
  • Working with Hive
  • Understanding HBase and HCatalog

One of the key pieces of the Hadoop big data platform is the file system. Functioning as the backbone, it is used to store and later retrieve your data, making it available to consumers for a multitude of tasks, including data processing.

Unlike the file system found on your desktop computer or laptop, where drives are typically measured in gigabytes, the Hadoop Distributed File System (HDFS) must be capable of storing files where each file can be of gigabyte or terabyte sizes. This presents a series of unique challenges that must be overcome.

This chapter discusses the HDFS, its architecture, and how it solves many of the hurdles, such as reliably storing your big data, efficient access, and other tasks like replicating data throughout your cluster. We will also look at Hive, HBase, and HCatalog, all platforms or tools available within the Hadoop ecosystem that help simplify the management and subsequent retrieval of data out of HDFS.

Exploring the Hadoop Distributed File System

Originally created as part of a web search engine project called Apache Nutch, HDFS is a distributed file system designed to run on a cluster of cost-effective commodity hardware. Although there are a number of distributed file systems in the marketplace, several notable ...

Get Microsoft Big Data Solutions now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.