Chapter 1. Apache Hadoop and Apache HBase: An Introduction

Apache Hadoop is a highly scalable, fault-tolerant distributed system meant to store large amounts of data and process it in place. Hadoop is designed to run large-scale processing systems on the same cluster that stores the data. The philosophy of Hadoop is to store all the data in one place and process the data in the same place—that is, move the processing to the data store and not move the data to the processing system. Apache HBase is a database system that is built on top of Hadoop to provide a key-value store that benefits from the distributed framework that Hadoop provides.

Data, once written to the Hadoop Distributed File System (HDFS), is immutable. Each file on HDFS is append-only. Once a file is created and written to, the file can either be appended to or deleted. It is not possible to change the data in the file. Though HBase runs on top of HDFS, HBase supports updating any data written to it, just like a normal database system.

This chapter will provide a brief introduction to Apache Hadoop and Apache HBase, though we will not go into too much detail.

HDFS

At the core of Hadoop is a distributed file system referred to as HDFS. HDFS is a highly distributed, fault-tolerant file system that is specifically built to run on commodity hardware and to scale as more data is added by simply adding more hardware. HDFS can be configured to replicate data several times on different machines to ensure that there is no data loss, even if a machine holding the data fails. Replicating data also allows the system to be highly available even if machines holding a copy of the data are disconnected from the network or go down. This section will briefly cover the design of HDFS and the various processing systems that run on top of HDFS.

HDFS was originally designed on the basis of the Google File System [gfs]. HDFS is a distributed system that can store data on thousands of off-the-shelf servers, with no special requirements for hardware configuration. This means HDFS does not require the use of storage area networks (SANs), expensive network configuration, or any special disks. HDFS can be run on any run-of-the-mill data center setup. HDFS replicates all data written to it, based on the replication factor configured by the user. The default replication factor is 3, which ensures that any data written to HDFS is replicated on three different servers within the cluster. This greatly reduces the possibility that any data written to HDFS will be lost.

HDFS, like any other file system, writes data to individual blocks. Each HDFS file consists of at least one block. Each file consists of multiple blocks, based on the size of the file. HDFS is designed to hold very large files. Therefore, HDFS block sizes are also usually pretty large compared to other file systems. HDFS block sizes are configurable, and in most cases range between 128 MB to 512 MB. HDFS tries to ensure that each block is replicated based on the replication factor, thus ensuring the file itself is replicated as much as the replication factor. HDFS is rack-aware, and the default block placement policy tries to ensure that each replica of a block is on a different rack.

HDFS consists of two types of servers: name nodes and data nodes. Most Hadoop clusters generally have two name nodes and several data nodes. Data nodes are the nodes on which the data is stored. At any point in time, there is one active name node and an optional standby name node. The active name node is the currently active name node that serves client and other data nodes. The standby name node is an active backup to the primary, and takes over if the active name node goes down or is no longer accessible for some reason. Name nodes are responsible for storing metadata about files and blocks on the file system. The name node maps every file to the list of blocks that the file consists of. The name node also holds information about each block’s location—which data nodes the block is stored on and where on the data node it is.

Each client write is initially written to a local file on the client machine, until the client flushes the file or closes it or the size of the temporary file exceeds a block boundary. At this point, the file is created (or a new block is added if new data is being written once a block boundary is crossed or an existing file is reopened for append) and the name node assigns blocks to it. Then the data is written to each block, which is replicated to multiple data nodes, one after another. The operation is successful only if all the data nodes succesfully replicate the blocks.

HDFS files cannot be edited and are append-only. Each file, once closed, can be opened only to append data to it. HDFS also does not guarantee that writes to a file are visible to other clients until the client writing the data flushes the data to data node memory, or closes the file. Each time a new block is required, the name node allocates a new block to the file and keeps track of it. For each read, the client gets the locations of the blocks that represent the file from the name node and directly reads the data from the data node. From the user’s point of view, HDFS is a single storage system and the fact that each file is replicated and stored on multiple systems is completely transparent to the user. So, user code need not worry about any of the failure tolerance or replication aspects of HDFS. Even the client API writing to a file on the local machine before a flush or close call is transparent to the user code.

The client API is one way of interacting with HDFS. HDFS also provides a set of shell commands that can be used to perform many common file operations. HDFS commands are of the form:

hdfs dfs -<command> <options>

For example, to get a listing of files in the /Data/ directory on HDFS, the following command can be used:

hdfs dfs -ls /Data

The list of supported commands can be found in the Hadoop documentation [commands]. Running these commands requires that HDFS be configured correctly on the system, with HADOOP_HOME or HADOOP_PREFIX set correctly with the Hadoop configuration files correctly in HADOOP_CLASSPATH. For more details on HDFS architecture and configuration, refer to Hadoop: The Definitive Guide [hdfs-architecture].

HDFS Data Formats

In general, data formats in HDFS are classified into splittable and unsplittable formats. A splittable format is one in which a file can be reliably split into multiple pieces called splits at record boundaries. A splittable file format can seek to the start of a record from any point. Splittable file formats are MapReduce-friendly, since MapReduce splits files to read data from a file in parallel from different mappers.

It is always better to use binary formats rather than text to write to HDFS. This is because most binary formats have some way of indicating corruption or incompleteness in a record. Failures can cause incomplete records to be written to files. An example is if the HDFS cluster runs out of space, or has connectivity issues: there could be a block allocation failure, which can cause the file to contain incomplete or corrupt records. Binary records help ensure that such incorrect records are detected and ignored. An example of a binary format that is used commonly in Hadoop is the Avro container file format. This format is splittable, and can detect corrupt or incomplete records in a file. MapReduce, Hive, Pig, etc. support Avro as an input format. Avro also supports compression using the Snappy and Deflate/bz2 compression codecs.

There are several data formats that are typically used on HDFS. One of the most common data formats on HDFS is a sequence file. A sequence file is a splittable file format that is typically used with MapReduce jobs. It is represented as a list of keys and values, each of which is an instance of a Writable, which basically represents a serializable class.

There are compression formats that are splittable, like bz2, preprocessed LZO, etc. More details on file formats in the Hadoop ecosystem can be found in Hadoop: The Definitive Guide [serialization].

Flume supports writing several of the built-in formats out of the box, and also allows users to plug in their own serializers that can write data in any format of their own choosing to HDFS. We will discuss this in Chapter 5, in “Controlling the Data Format Using Serializers*”.

Processing Data on HDFS

As we discussed, Hadoop brings the processing systems to the data store. As a result, the same nodes that host the data also run systems that can process the data stored on HDFS. MapReduce has long been the classical system that processes data on HDFS.

MapReduce is a distributed processing framework that allows the user to write Java code that reads data from HDFS and processes it. Each MapReduce program runs on multiple nodes, each processing a part of the input data. MapReduce programs have two phases: the Map phase and the Reduce phase. Each phase runs a piece of Java code on multiple nodes simultaneously, thus processing huge amounts of data in parallel. Each mapper reads an input split (a fixed subset of the inputs) from a specific directory on HDFS and processes the inputs as keys and corresponding values.

How the data in the files in the directory are mapped to keys and values depends on the format being used and the input format that processes it [input-format]. The Map phase processes the inputs and produces intermediate key-value pairs. All key-value pairs with the same intermediate key are then processed by the same reducer. Finally, the reducer eventually writes out final outputs as key-value pairs to a configured output directory. You can read more about MapReduce in Hadoop: A Definitive Guide [mr].

Apache Hive and Cloudera Impala provide SQL interfaces (really subsets of SQL) to process data on HDFS. Hive parses the SQL query to generate a MapReduce job that processes the data, while Impala has its own processing engine that reads the data and applies transformations based on the query to process the data. These systems map flat files on HDFS to tables on which the queries are run. Such systems provide an easy migration path for users who have been using SQL-based database systems to process and store their data. There are several other systems, like Apache Pig, Apache Spark, etc., that can be used to process data stored on HDFS.

Apache HBase

Apache HBase is the Hadoop Ecosystem’s key-value store. HBase is often used to write and update data in real time. It is also used to serve data in real time, in places where a traditional database could be used. HBase is built on top of HDFS and relies on HDFS for replication. Logically, the HBase data model is similar to a database with data being written to tables that have several rows and columns, though the columns are not fixed in the schema and can be created dynamically by a client (each row can have a different set of columns and there is no fixed schema representing a fixed set of columns).

Each row is accessed with a key known as the row key, which is very similar to the primary key in a standard database system. There can be as many columns for a row key as required, but there can be exactly one value per row for every column (though HBase can keep multiple “versions”—the last n values of the column for that row). HBase groups columns into column families, which are stored together on HDFS. Therefore, it is usually a good idea to group columns whose data is written and accessed in a similar pattern.

The HBase client API allows Java programs to interact with an HBase cluster. Writes to HBase are in the form of Puts, which represent writes to a single row. A single Put represents a single remote procedure call (RPC) call that can write to multiple columns within the same row. HBase also supports Increments, which can be used to increment values in columns that can be used as counters. Just like Puts, Increments can also update multiple columns in the same row in a single RPC call.

In the context of Flume’s HBase interaction, we are only concerned with Puts and Increments, though HBase provides RPC calls to update or delete data. More details on HBase operations can be found in HBase: The Definitive Guide [hbase-client]. To interact with HBase from languages other than Java, HBase provides a Thrift API, which you can read about on the Apache HBase wiki [hbase-thrift].

In addition to the client API, HBase provides a shell to interact with the HBase cluster. The HBase shell has commands to do Puts, Gets, Increments, Deletes, Scans, and so on, and also to create, disable, truncate, and delete tables [hbase-shell]. To start an HBase shell, use the following command:

hbase shell

HBase provides row-level atomicity. If a writer writes to multiple columns within the same row in a single Put, then it is guaranteed that a reader will read either old values of all columns or the new values of all columns and not old values of some columns and new values of others. HBase, though, provides no transactions or ACID (Atomicity, Consistency, Isolation, Durability) compliance. Since there are no transactions over multiple rows, there are no guarantees of consistency for clients reading multiple rows.

As mentioned earlier, HBase is built on top of HDFS. As a result, data on HBase is automatically replicated. HBase divides rows on HBase into Regions. A region is simply the set of rows with row keys between two fixed values. HBase partitions the entire dataset into multiple regions, each of which is hosted by a server known as a Region Server. At any point in time, there is exactly one region server hosting a particular region, though a single server can host more than one region. Every read or write to a row belonging to a region goes through the region server hosting that region. The server that decides which server hosts which region is the HBase Master. The Master is HBase’s version of the HDFS name node. The master also decides when a region becomes too big and has to be split, etc.

Flume allows users to Put data or Increment counters on HBase. The user can plug in custom pieces of code to do the translation from Flume events to HBase Puts or Increments. We will cover this in “Translating Flume Events to HBase Puts and Increments Using Serializers*”.

Summary

In this chapter, we discussed the basics of HDFS and HBase. Though Flume supports other systems, these are the most important and commonly used systems. In Chapter 5, we will discuss how to write data to these systems in a scalable way using Flume.

References

Get Using Flume now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.