Chapter 1. Introduction

Back in 2003, Google published a paper describing a scale-out architecture for storing massive amounts of data across clusters of servers, which it called the Google File System (GFS). A year later, Google published another paper describing a programming model called MapReduce, which took advantage of GFS to process data in a parallel fashion, bringing the program to where the data resides. Around the same time, Doug Cutting and others were building an open source web crawler now called Apache Nutch. The Nutch developers realized that the MapReduce programming model and GFS were the perfect building blocks for a distributed web crawler, and they began implementing their own versions of both projects. These components would later split from Nutch and form the Apache Hadoop project. The ecosystem1 of projects built around Hadoop’s scale-out architecture brought about a different way of approaching problems by allowing the storage and processing of all data important to a business.

While all these new and exciting ways to process and store data in the Hadoop ecosystem have brought many use cases across different verticals to use this technology, it has become apparent that managing petabytes of data in a single centralized cluster can be dangerous. Hundreds if not thousands of servers linked together in a common application stack raises many questions about how to protect such a valuable asset. While other books focus on such things as writing MapReduce code, designing optimal ingest frameworks, or architecting complex low-latency processing systems on top of the Hadoop ecosystem, this one focuses on how to ensure that all of these things can be protected using the numerous security features available across the stack as part of a cohesive Hadoop security architecture.

Security Overview

Before this book can begin covering Hadoop-specific content, it is useful to understand some key theory and terminology related to information security. At the heart of information security theory is a model known as CIA, which stands for confidentiality, integrity, and availability. These three components of the model are high-level concepts that can be applied to a wide range of information systems, computing platforms, and—more specifically to this book—Hadoop. We also take a closer look at authentication, authorization, and accounting, which are critical components of secure computing that will be discussed in detail throughout the book.

Warning

While the CIA model helps to organize some information security principles, it is important to point out that this model is not a strict set of standards to follow. Security features in the Hadoop platform may span more than one of the CIA components, or possibly none at all.

Confidentiality

Confidentiality is a security principle focusing on the notion that information is only seen by the intended recipients. For example, if Alice sends a letter in the mail to Bob, it would only be deemed confidential if Bob were the only person able to read it. While this might seem straightforward enough, several important security concepts are necessary to ensure that confidentiality actually holds. For instance, how does Alice know that the letter she is sending is actually being read by the right Bob? If the correct Bob reads the letter, how does he know that the letter actually came from the right Alice? In order for both Alice and Bob to take part in this confidential information passing, they need to have an identity that uniquely distinguishes themselves from any other person. Additionally, both Alice and Bob need to prove their identities via a process known as authentication. Identity and authentication are key components of Hadoop security and are covered at length in Chapter 5.

Another important concept of confidentiality is encryption. Encryption is a mechanism to apply a mathematical algorithm to a piece of information where the output is something that unintended recipients are not able to read. Only the intended recipients are able to decrypt the encrypted message back to the original unencrypted message. Encryption of data can be applied both at rest and in flight. At-rest data encryption means that data resides in an encrypted format when not being accessed. A file that is encrypted and located on a hard drive is an example of at-rest encryption. In-flight encryption, also known as over-the-wire encryption, applies to data sent from one place to another over a network. Both modes of encryption can be used independently or together. At-rest encryption for Hadoop is covered in Chapter 9, and in-flight encryption is covered in Chapters 10 and 11.

Integrity

Integrity is an important part of information security. In the previous example where Alice sends a letter to Bob, what happens if Charles intercepts the letter in transit and makes changes to it unbeknownst to Alice and Bob? How can Bob ensure that the letter he receives is exactly the message that Alice sent? This concept is data integrity. The integrity of data is a critical component of information security, especially in industries with highly sensitive data. Imagine if a bank did not have a mechanism to prove the integrity of customer account balances? A hospital’s data integrity of patient records? A government’s data integrity of intelligence secrets? Even if confidentiality is guaranteed, data that doesn’t have integrity guarantees is at risk of substantial damage. Integrity is covered in Chapters 9 and 10.

Availability

Availability is a different type of principle than the previous two. While confidentiality and integrity can closely be aligned to well-known security concepts, availability is largely covered by operational preparedness. For example, if Alice tries to send her letter to Bob, but the post office is closed, the letter cannot be sent to Bob, thus making it unavailable to him. The availability of data or services can be impacted by regular outages such as scheduled downtime for upgrades or applying security patches, but it can also be impacted by security events such as distributed denial-of-service (DDoS) attacks. The handling of high-availability configurations is covered in Hadoop Operations and Hadoop: The Definitive Guide, but the concepts will be covered from a security perspective in Chapters 3 and 10.

Authentication, Authorization, and Accounting

Authentication, authorization, and accounting (often abbreviated, AAA) refer to an architectural pattern in computer security where users of a service prove their identity, are granted access based on rules, and where a recording of a user’s actions is maintained for auditing purposes. Closely tied to AAA is the concept of identity. Identity refers to how a system distinguishes between different entities, users, and services, and is typically represented by an arbitrary string, such as a username or a unique number, such as a user ID (UID).

Before diving into how Hadoop supports identity, authentication, authorization, and accounting, consider how these concepts are used in the much simpler case of using the sudo command on a single Linux server. Let’s take a look at the terminal session for two different users, Alice and Bob. On this server, Alice is given the username alice and Bob is given the username bob. Alice logs in first, as shown in Example 1-1.

Example 1-1. Authentication and authorization
$ ssh alice@hadoop01
alice@hadoop01's password:
Last login: Wed Feb 12 15:26:55 2014 from 172.18.12.166
[alice@hadoop01 ~]$ sudo service sshd status
openssh-daemon (pid  1260) is running...
[alice@hadoop01 ~]$

In Example 1-1, Alice logs in through SSH and she is immediately prompted for her password. Her username/password pair is used to verify her entry in the /etc/passwd password file. When this step is completed, Alice has been authenticated with the identity alice. The next thing Alice does is use the sudo command to get the status of the sshd service, which requires superuser privileges. The command succeeds, indicating that Alice was authorized to perform that command. In the case of sudo, the rules that govern who is authorized to execute commands as the superuser are stored in the /etc/sudoers file, shown in Example 1-2.

Example 1-2. /etc/sudoers
[root@hadoop01 ~]# cat /etc/sudoers
root ALL = (ALL) ALL
%wheel ALL = (ALL) NOPASSWD:ALL
[root@hadoop01 ~]#

In Example 1-2, we see that the root user is granted permission to execute any command with sudo and that members of the wheel group are granted permission to execute any command with sudo while not being prompted for a password. In this case, the system is relying on the authentication that was performed during login rather than issuing a new authentication challenge. The final question is, how does the system know that Alice is a member of the wheel group? In Unix and Linux systems, this is typically controlled by the /etc/group file.

In this way, we can see that two files control Alice’s identity: the /etc/passwd file (see Example 1-4) assigns her username a unique UID as well as details such as her home directory, while the /etc/group file (see Example 1-3) further provides information about the identity of groups on the system and which users belong to which groups. These sources of identity information are then used by the sudo command, along with authorization rules found in the /etc/sudoers file, to verify that Alice is authorized to execute the requested command.

Example 1-3. /etc/group
[root@hadoop01 ~]# grep wheel /etc/group
wheel:x:10:alice
[root@hadoop01 ~]#
Example 1-4. /etc/passwd
[root@hadoop01 ~]# grep alice /etc/passwd
alice:x:1000:1000:Alice:/home/alice:/bin/bash
[root@hadoop01 ~]#

Now let’s see how Bob’s session turns out in Example 1-5.

Example 1-5. Authorization failure
$ ssh bob@hadoop01
bob@hadoop01's password:
Last login: Wed Feb 12 15:30:54 2014 from 172.18.12.166
[bob@hadoop01 ~]$ sudo service sshd status

We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:

    #1) Respect the privacy of others.
    #2) Think before you type.
    #3) With great power comes great responsibility.

[sudo] password for bob:
bob is not in the sudoers file.  This incident will be reported.
[bob@hadoop01 ~]$

In this example, Bob is able to authenticate in much the same way that Alice does, but when he attempts to use sudo he sees very different behavior. First, he is again prompted for his password and after successfully supplying it, he is denied permission to run the service command with superuser privileges. This happens because, unlike Alice, Bob is not a member of the wheel group and is therefore not authorized to use the sudo command.

That covers identity, authentication, and authorization, but what about accounting? For actions that interact with secure services such as SSH and sudo, Linux generates a logfile called /var/log/secure. This file records an account of certain actions including both successes and failures. If we take a look at this log after Alice and Bob have performed the preceding actions, we see the output in Example 1-6 (formatted for readability).

Example 1-6. /var/log/secure
[root@hadoop01 ~]# tail -n 6 /var/log/secure
Feb 12 20:32:04 ip-172-25-3-79 sshd[3774]: Accepted password for
  alice from 172.18.12.166 port 65012 ssh2
Feb 12 20:32:04 ip-172-25-3-79 sshd[3774]: pam_unix(sshd:session):
  session opened for user alice by (uid=0)
Feb 12 20:32:33 ip-172-25-3-79 sudo:    alice : TTY=pts/0 ;
  PWD=/home/alice ; USER=root ; COMMAND=/sbin/service sshd status
Feb 12 20:33:15 ip-172-25-3-79 sshd[3799]: Accepted password for
  bob from 172.18.12.166 port 65017 ssh2
Feb 12 20:33:15 ip-172-25-3-79 sshd[3799]: pam_unix(sshd:session):
  session opened for user bob by (uid=0)
Feb 12 20:33:39 ip-172-25-3-79 sudo:      bob : user NOT in sudoers;
  TTY=pts/2 ; PWD=/home/bob ; USER=root ; COMMAND=/sbin/service sshd status
[root@hadoop01 ~]#

For both users, the fact that they successfully logged in using SSH is recorded, as are their attempts to use sudo. In Alice’s case, the system records that she successfully used sudo to execute the /sbin/service sshd status command as the user root. For Bob, on the other hand, the system records that he attempted to execute the /sbin/service sshd status command as the user root and was denied permission because he is not in /etc/sudoers.

This example shows how the concepts of identity, authentication, authorization, and accounting are used to maintain a secure system in the relatively simple example of a single Linux server. These concepts are covered in detail in a Hadoop context in Part II.

Hadoop Security: A Brief History

Hadoop has its heart in storing and processing large amounts of data efficiently and as it turns out, cheaply (monetarily) when compared to other platforms. The focus early on in the project was around the actual technology to make this happen. Much of the code covered the logic on how to deal with the complexities inherent in distributed systems, such as handling of failures and coordination. Due to this focus, the early Hadoop project established a security stance that the entire cluster of machines and all of the users accessing it are part of a trusted network. What this effectively means is that Hadoop did not have strong security measures in place to enforce, well, much of anything.

As the project evolved, it became apparent that at a minimum there should be a mechanism for users to strongly authenticate to prove their identities. The mechanism chosen for the project was Kerberos, a well-established protocol that today is common in enterprise systems such as Microsoft Active Directory. After strong authentication came strong authorization. Strong authorization defined what an individual user could do after they had been authenticated. Initially, authorization was implemented on a per-component basis, meaning that administrators needed to define authorization controls in multiple places. Eventually this became easier with Apache Sentry (Incubating), but even today there is not a holistic view of authorization across the ecosystem, as we will see in Chapters 6 and 7.

Another aspect of Hadoop security that is still evolving is the protection of data through encryption and other confidentiality mechanisms. In the trusted network, it was assumed that data was inherently protected from unauthorized users because only authorized users were on the network. Since then, Hadoop has added encryption for data transmitted between nodes, as well as data stored on disk. We will see how this security evolution comes into play as we proceed, but first we will take a look at the Hadoop ecosystem to get our bearings.

Hadoop Components and Ecosystem

In this section, we will provide a 50,000-foot view of the Hadoop ecosystem components that are covered throughout the book. This will help to introduce components before talking about the security of them in later chapters. Readers that are well versed in the components listed can safely skip to the next section. Unless otherwise noted, security features described throughout this book apply to the versions of the associated project listed in Table 1-1.

Table 1-1. Project versionsa
Project Version

Apache HDFS

2.3.0

Apache MapReduce (for MR1)

1.2.1

Apache YARN (for MR2)

2.3.0

Apache Hive

0.12.0

Cloudera Impala

2.0.0

Apache HBase

0.98.0

Apache Accumulo

1.6.0

Apache Solr

4.4.0

Apache Oozie

4.0.0

Cloudera Hue

3.5.0

Apache ZooKeeper

3.4.5

Apache Flume

1.5.0

Apache Sqoop

1.4.4

Apache Sentry (Incubating)

1.4.0-incubating

a An astute reader will notice some omissions in the list of projects covered. In particular, there is no mention of Apache Spark, Apache Ranger, or Apache Knox. These projects were omitted due to time constraints and given their status as relatively new additions to the Hadoop ecosystem.

Apache HDFS

The Hadoop Distributed File System, or HDFS, is often considered the foundation component for the rest of the Hadoop ecosystem. HDFS is the storage layer for Hadoop and provides the ability to store mass amounts of data while growing storage capacity and aggregate bandwidth in a linear fashion. HDFS is a logical filesystem that spans many servers, each with multiple hard drives. This is important to understand from a security perspective because a given file in HDFS can span many or all servers in the Hadoop cluster. This means that client interactions with a given file might require communication with every node in the cluster. This is made possible by a key implementation feature of HDFS that breaks up files into blocks. Each block of data for a given file can be stored on any physical drive on any node in the cluster. Because this is a complex topic that we cannot cover in depth here, we are omitting the details of how that works and recommend Hadoop: The Definitive Guide, 3rd Edition by Tom White (O’Reilly). The important security takeaway is that all files in HDFS are broken up into blocks, and clients using HDFS will communicate over the network to all of the servers in the Hadoop cluster when reading and writing files.

HDFS is built on a head/worker architecture and is comprised of two primary components: NameNode (head) and DataNode (worker). Additional components include JournalNode, HttpFS, and NFS Gateway:

NameNode

The NameNode is responsible for keeping track of all the metadata related to the files in HDFS, such as filenames, block locations, file permissions, and replication. From a security perspective, it is important to know that clients of HDFS, such as those reading or writing files, always communicate with the NameNode. Additionally, the NameNode provides several important security functions for the entire Hadoop ecosystem, which are described later.

DataNode

The DataNode is responsible for the actual storage and retrieval of data blocks in HDFS. Clients of HDFS reading a given file are told by the NameNode which DataNode in the cluster has the block of data requested. When writing data to HDFS, clients write a block of data to a DataNode determined by the NameNode. From there, that DataNode sets up a write pipeline to other DataNodes to complete the write based on the desired replication factor.

JournalNode

The JournalNode is a special type of component for HDFS. When HDFS is configured for high availability (HA), JournalNodes take over the NameNode responsibility for writing HDFS metadata information. Clusters typically have an odd number of JournalNodes (usually three or five) to ensure majority. For example, if a new file is written to HDFS, the metadata about the file is written to every JournalNode. When the majority of the JournalNodes successfully write this information, the change is considered durable. HDFS clients and DataNodes do not interact with JournalNodes directly.

HttpFS

HttpFS is a component of HDFS that provides a proxy for clients to the NameNode and DataNodes. This proxy is a REST API and allows clients to communicate to the proxy to use HDFS without having direct connectivity to any of the other components in HDFS. HttpFS will be a key component in certain cluster architectures, as we will see later in the book.

NFS Gateway

The NFS gateway, as the name implies, allows for clients to use HDFS like an NFS-mounted filesystem. The NFS gateway is an actual daemon process that facilitates the NFS protocol communication between clients and the underlying HDFS cluster. Much like HttpFS, the NFS gateway sits between HDFS and clients and therefore affords a security boundary that can be useful in certain cluster architectures.

KMS

The Hadoop Key Management Server, or KMS, plays an important role in HDFS transparent encryption at rest. Its purpose is to act as the intermediary between HDFS clients, the NameNode, and a key server, handling encryption operations such as decrypting data encryption keys and managing encryption zone keys. This is covered in detail in Chapter 9.

Apache YARN

As Hadoop evolved, it became apparent that the MapReduce processing framework, while incredibly powerful, did not address the needs of additional use cases. Many problems are not easily solved, if at all, using the MapReduce programming paradigm. What was needed was a more generic framework that could better fit additional processing models. Apache YARN provides this capability. Other processing frameworks and applications, such as Impala and Spark, use YARN as the resource management framework. While YARN provides a more general resource management framework, MapReduce is still the canonical application that runs on it. MapReduce that runs on YARN is considered version 2, or MR2 for short. The YARN architecture consists of the following components:

ResourceManager

The ResourceManager daemon is responsible for application submission requests, assigning ApplicationMaster tasks, and enforcing resource management policies.

JobHistory Server

The JobHistory Server, as the name implies, keeps track of the history of all jobs that have run on the YARN framework. This includes job metrics like running time, number of tasks run, amount of data written to HDFS, and so on.

NodeManager

The NodeManager daemon is responsible for launching individual tasks for jobs within YARN containers, which consist of virtual cores (CPU resources) and RAM resources. Individual tasks can request some number of virtual cores and memory depending on its needs. The minimum, maximum, and increment ranges are defined by the ResourceManager. Tasks execute as separate processes with their own JVM. One important role of the NodeManager is to launch a special task called the ApplicationMaster. This task is responsible for managing the status of all tasks for the given application. YARN separates resource management from task management to better scale YARN applications in large clusters as each job executes its own ApplicationMaster.

Apache MapReduce

MapReduce is the processing counterpart to HDFS and provides the most basic mechanism to batch process data. When MapReduce is executed on top of YARN, it is often called MapReduce2, or MR2. This distinguishes the YARN-based verison of MapReduce from the standalone MapReduce framework, which has been retroactively named MR1. MapReduce jobs are submitted by clients to the MapReduce framework and operate over a subset of data in HDFS, usually a specified directory. MapReduce itself is a programming paradigm that allows chunks of data, or blocks in the case of HDFS, to be processed by multiple servers in parallel, independent of one another. While a Hadoop developer needs to know the intricacies of how MapReduce works, a security architect largely does not. What a security architect needs to know is that clients submit their jobs to the MapReduce framework and from that point on, the MapReduce framework handles the distribution and execution of the client code across the cluster. Clients do not interact with any of the nodes in the cluster to make their job run. Jobs themselves require some number of tasks to be run to complete the work. Each task is started on a given node by the MapReduce framework’s scheduling algorithm.

Note

Individual tasks started by the MapReduce framework on a given server are executed as different users depending on whether Kerberos is enabled. Without Kerberos enabled, individual tasks are run as the mapred system user. When Kerberos is enabled, the individual tasks are executed as the user that submitted the MapReduce job. However, even if Kerberos is enabled, it may not be immediately apparent which user is executing the underlying MapReduce tasks when another component or tool is submitting the MapReduce job. See “Impersonation” for a relevant detailed discussion regarding Hive impersonation.

Similar to HDFS, MapReduce is also a head/worker architecture and is comprised of two primary components:

JobTracker (head)

When clients submit jobs to the MapReduce framework, they are communicating with the JobTracker. The JobTracker handles the submission of jobs by clients and determines how jobs are to be run by deciding things like how many tasks the job requires and which TaskTrackers will handle a given task. The JobTracker also handles security and operational features such as job queues, scheduling pools, and access control lists to determine authorization. Lastly, the JobTracker handles job metrics and other information about the job, which are communicated to it from the various TaskTrackers throughout the execution of a given job. The JobTracker includes both resource management and task management, which were split in MR2 between the ResourceManager and ApplicationMaster.

TaskTracker (worker)

TaskTrackers are responsible for executing a given task that is part of a MapReduce job. TaskTrackers receive tasks to run from the JobTracker, and spawn off separate JVM processes for each task they run. TaskTrackers execute both map and reduce tasks, and the amount of each that can be run concurrently is part of the MapReduce configuration. The important takeaway from a security standpoint is that the JobTracker decides what tasks to be run and on which TaskTrackers. Clients do not have control over how tasks are assigned, nor do they communicate with TaskTrackers as part of normal job execution.

A key point about MapReduce is that other Hadoop ecosystem components are frameworks and libraries on top of MapReduce, meaning that MapReduce handles the actual processing of data, but these frameworks and libraries abstract the MapReduce job execution from clients. Hive, Pig, and Sqoop are examples of components that use MapReduce in this fashion.

Tip

Understanding how MapReduce jobs are submitted is an important part of user auditing in Hadoop, and is discussed in detail in “Block access tokens”. A user submitting her own Java MapReduce code is a much different activity from a security point of view than a user using Sqoop to import data from a RDBMS or executing a SQL query in Hive, even though all three of these activities use MapReduce.

Apache Hive

The Apache Hive project was started by Facebook. The company saw the utility of MapReduce to process data but found limitations in adoption of the framework due to the lack of Java programming skills in its analyst communities. Most of Facebook’s analysts did have SQL skills, so the Hive project was started to serve as a SQL abstraction layer that uses MapReduce as the execution engine. The Hive architecture consists of the following components:

Metastore database

The metastore database is a relational database that contains all the Hive metadata, such as information about databases, tables, columns, and data types. This information is used to apply structure to the underlying data in HDFS at the time of access, also known as schema on read.

Metastore server

The Hive Metastore Server is a daemon that sits between Hive clients and the metastore database. This affords a layer of security by not allowing clients to have the database credentials to the Hive metastore.

HiveServer2

HiveServer2 is the main access point for clients using Hive. HiveServer2 accepts JDBC and ODBC clients, and for this reason is leveraged by a variety of client tools and other third-party applications.

HCatalog

HCatalog is a series of libraries that allow non-Hive frameworks to have access to Hive metadata. For example, users of Pig can use HCatalog to read schema information about a given directory of files in HDFS. The WebHCat server is a daemon process that exposes a REST interface to clients, which in turn access HCatalog APIs.

For more thorough coverage of Hive, have a look at Programming Hive by Edward Capriolo, Dean Wampler, and Jason Rutherglen (O’Reilly).

Cloudera Impala

Cloudera Impala is a massive parallel processing (MPP) framework that is purpose-built for analytic SQL. Impala reads data from HDFS and utilizes the Hive metastore for interpreting data structures and formats. The Impala architecture consists of the following components:

Impala daemon (impalad)

The Impala daemon does all of the heavy lifting of data processing. These daemons are collocated with HDFS DataNodes to optimize for local reads.

StateStore

The StateStore daemon process maintains state information about all of the Impala daemons running. It monitors whether Impala daemons are up or down, and broadcasts status to all of the daemons. The StateStore is not a required component in the Impala architecture, but it does provide for faster failure tolerance in the case where one or more daemons have gone down.

Catalog server

The Catalog server is Impala’s gateway into the Hive metastore. This process is responsible for pulling metadata from the Hive metastore and synchronizing metadata changes that have occurred by way of Impala clients. Having a separate Catalog server helps to reduce the load the Hive metastore server encounters, as well as to provide additional optimizations for Impala for speed.

Tip

New users to the Hadoop ecosystem often ask what the difference is between Hive and Impala because they both offer SQL access to data in HDFS. Hive was created to allow users that are familiar with SQL to process data in HDFS without needing to know anything about MapReduce. It was designed to abstract the innards of MapReduce to make the data in HDFS more accessible. Hive is largely used for batch access and ETL work. Impala, on the other hand, was designed from the ground up to be a fast analytic processing engine to support ad hoc queries and business intelligence (BI) tools. There is utility in both Hive and Impala, and they should be treated as complementary components.

For more thorough coverage of all things Impala, check out Getting Started with Impala (O’Reilly).

Apache Sentry (Incubating)

Sentry is the component that provides fine-grained role-based access controls (RBAC) to several of the other ecosystem components, such as Hive and Impala. While individual components may have their own authorization mechanism, Sentry provides a unified authorization that allows centralized policy enforcement across components. It is a critical component of Hadoop security, which is why we have dedicated an entire chapter to the topic (Chapter 7). Sentry consists of the following components:

Sentry server

The Sentry server is a daemon process that facilitates policy lookups made by other Hadoop ecosystem components. Client components of Sentry are configured to delegate authorization decisions based on the policies put in place by Sentry.

Policy database

The Sentry policy database is the location where all authorization policies are stored. The Sentry server uses the policy database to determine if a user is allowed to perform a given action. Specifically, the Sentry server looks for a matching policy that grants access to a resource for the user. In earlier versions of Sentry, the policy database was a text file that contained all of the policies. The evolution of Sentry and the policy database is discussed in detail in Chapter 7.

Apache HBase

Apache HBase is a distributed key/value store inspired by Google’s BigTable paper, “BigTable: A Distributed Storage System for Structured Data”. HBase typically utilizes HDFS as the underlying storage layer for data, and for the purposes of this book we will assume that is the case. HBase tables are broken up into regions. These regions are partitioned by row key, which is the index portion of a given key. Row IDs are sorted, thus a given region has a range of sorted row keys. Regions are hosted by a RegionServer, where clients request data by a key. The key is comprised of several components: the row key, the column family, the column qualifier, and the timestamp. These components together uniquely identify a value stored in the table.

Clients accessing HBase first look up the RegionServers that are responsible for hosting a particular range of row keys. This lookup is done by scanning the hbase:meta table. When the right RegionServer is located, the client will make read/write requests directly to that RegionServer rather than through the master. The client caches the mapping of regions to RegionServers to avoid going through the lookup process. The location of the server hosting the hbase:meta table is looked up in ZooKeeper. HBase consists of the following components:

Master

As stated, the HBase Master daemon is responsible for managing the regions that are hosted by which RegionServers. If a given RegionServer goes down, the HBase Master is responsible for reassigning the region to a different RegionServer. Multiple HBase Masters can be run simultaneously and the HBase Masters will use ZooKeeper to elect a single HBase Master to be active at any one time.

RegionServer

RegionServers are responsible for serving regions of a given HBase table. Regions are sorted ranges of keys; they can either be defined manually using the HBase shell or automatically defined by HBase over time based upon the keys that are ingested into the table. One of HBase’s goals is to evenly distribute the key-space, giving each RegionServer an equal responsibility in serving data. Each RegionServer typically hosts multiple regions.

REST server

The HBase REST server provides a REST API to perform HBase operations. The default HBase API is provided by a Java API, just like many of the other Hadoop ecosystem projects. The REST API is commonly used as a language agnostic interface to allow clients to utilize any programming they wish.

Thrift server

In addition to the REST server, HBase also has a Thrift server. This serves as yet another useful API interface for clients to leverage.

For more information on the architecture of HBase and the use cases it is best suited for, we recommend HBase: The Definitive Guide by Lars George (O’Reilly).

Apache Accumulo

Apache Accumulo is a sorted and distributed key/value store designed to be a robust, scalable, high-performance storage and retrieval system. Like HBase, Accumulo was originally based on the Google BigTable design, but was built on top of the Apache Hadoop ecosystem of projects (in particular, HDFS, ZooKeeper, and Apache Thrift). Accumulo uses roughly the same data model as HBase. Each Accumulo table is split into one or more tablets that contains a roughly equal number of records distributed by the record’s row ID. Each record also has a multipart column key that includes a column family, column qualifier, and visibility label. The visibility label was one of Accumulo’s first major departures from the original BigTable design. Visibility labels added the ability to implement cell-level security (we’ll discuss them in more detail in Chapter 6). Finally, each record also contains a timestamp that allows users to store multiple versions of records that otherwise share the same record key. Collectively, the row ID, column, and timestamp make up a record’s key, which is associated with a particular value.

The tablets are distributed by splitting up the set of row IDs. The split points are calculated automatically as data is inserted into a table. Each tablet is hosted by a single TabletServer that is responsible for serving reads and writes to data in the given tablet. Each TabletServer can host multiple tablets from the same tables and/or different tables. This makes the tablet the unit of distribution in the system.

When clients first access Accumulo, they look up the location of the TabletServer hosting the accumulo.root table. The accumulo.root table stores the information for how the accumulo.meta table is split into tablets. The client will directly communicate with the TabletServer hosting accumulo.root and then again for TabletServers that are hosting the tablets of the accumulo.meta table. Because the data in these tables—especially accumulo.root—changes relatively less frequently than other data, the client will maintain a cache of tablet locations read from these tables to avoid bottlenecks in the read/write pipeline. Once the client has the location of the tablets for the row IDs that it is reading/writing, it will communicate directly with the required TabletServers. At no point does the client have to interact with the Master, and this greatly aids scalability. Overall, Accumulo consists of the following components:

Master

The Accumulo Master is responsible for coordinating the assignment of tablets to TabletServers. It ensures that each tablet is hosted by exactly one TabletServer and responds to events such as a TabletServer failing. It also handles administrative changes to a table and coordinates startup, shutdown, and write-ahead log recovery. Multiple Masters can be run simultaneously and they will elect a leader so that only one Master is active at a time.

TabletServer

The TabletServer handles all read/write requests for a subset of the tablets in the Accumulo cluster. For writes, it handles writing the records to the write-ahead log and flushing the in-memory records to disk periodically. During recovery, the TabletServer replays the records from the write-ahead log into the tablet being recovered.

GarbageCollector

The GarbageCollector periodically deletes files that are no longer needed by any Accumulo process. Multiple GarbageCollectors can be run simultaneously and they will elect a leader so that only one GarbageCollector is active at a time.

Tracer

The Tracer monitors the rest of the cluster using Accumulo’s distributed timing API and writes the data into an Accumulo table for future reference. Multiple Tracers can be run simultaneously and they will distribute the load evenly among them.

Monitor

The Monitor is a web application for monitoring the state of the Accumulo cluster. It displays key metrics such as record count, cache hit/miss rates, and table information such as scan rate. The Monitor also acts as an endpoint for log forwarding so that errors and warnings can be diagnosed from a single interface.

Apache Solr

The Apache Solr project, and specifically SolrCloud, enables the search and retrieval of documents that are part of a larger collection that has been sharded across multiple physical servers. Search is one of the canonical use cases for big data and is one of the most common utilities used by anyone accessing the Internet. Solr is built on top of the Apache Lucene project, which actually handles the bulk of the indexing and search capabilities. Solr expands on these capabilities by providing enterprise search features such as faceted navigation, caching, hit highlighting, and an administration interface.

Solr has a single component, the server. There can be many Solr servers in a single deployment, which scale out linearly through the sharding provided by SolrCloud. SolrCloud also provides replication features to accommodate failures in a distributed environment.

Apache Oozie

Apache Oozie is a workflow management and orchestration system for Hadoop. It allows for setting up workflows that contain various actions, each of which can utilize a different component in the Hadoop ecosystem. For example, an Oozie workflow could start by executing a Sqoop import to move data into HDFS, then a Pig script to transform the data, followed by a Hive script to set up metadata structures. Oozie allows for more complex workflows, such as forks and joins that allow multiple steps to be executed in parallel, and other steps that rely on multiple steps to be completed before continuing. Oozie workflows can run on a repeatable schedule based on different types of input conditions such as running at a certain time or waiting until a certain path exists in HDFS.

Oozie consists of just a single server component, and this server is responsible for handling client workflow submissions, managing the execution of workflows, and reporting status.

Apache ZooKeeper

Apache ZooKeeper is a distributed coordination service that allows for distributed systems to store and read small amounts of data in a synchronized way. It is often used for storing common configuration information. Additionally, ZooKeeper is heavily used in the Hadoop ecosystem for synchronizing high availability (HA) services, such as NameNode HA and ResourceManager HA.

ZooKeeper itself is a distributed system that relies on an odd number of servers called a ZooKeeper ensemble to reach a quorum, or majority, to acknowledge a given transaction. ZooKeeper has only one component, the ZooKeeper server.

Apache Flume

Apache Flume is an event-based ingestion tool that is used primarily for ingestion into Hadoop, but can actually be used completely independent of it. Flume, as the name would imply, was initially created for the purpose of ingesting log events into HDFS. The Flume architecture consists of three main pieces: sources, sinks, and channels.

A Flume source defines how data is to be read from the upstream provider. This would include things like a syslog server, a JMS queue, or even polling a Linux directory. A Flume sink defines how data should be written downstream. Common Flume sinks include an HDFS sink and an HBase sink. Lastly, a Flume channel defines how data is stored between the source and sink. The two primary Flume channels are the memory channel and file channel. The memory channel affords speed at the cost of reliability, and the file channel provides reliability at the cost of speed.

Flume consists of a single component, a Flume agent. Agents contain the code for sources, sinks, and channels. An important part of the Flume architecture is that Flume agents can be connected to each other, where the sink of one agent connects to the source of another. A common interface in this case is using an Avro source and sink. Flume ingestion and security is covered in Chapter 10 and in Using Flume.

Apache Sqoop

Apache Sqoop provides the ability to do batch imports and exports of data to and from a traditional RDBMS, as well as other data sources such as FTP servers. Sqoop itself submits map-only MapReduce jobs that launch tasks to interact with the RDBMS in a parallel fashion. Sqoop is used both as an easy mechanism to initially seed a Hadoop cluster with data, as well as a tool used for regular ingestion and extraction routines. There are currently two different versions of Sqoop: Sqoop1 and Sqoop2. In this book, the focus is on Sqoop1. Sqoop2 is still not feature complete at the time of this writing, and is missing some fundamental security features, such as Kerberos authentication.

Sqoop1 is a set of client libraries that are invoked from the command line using the sqoop binary. These client libraries are responsible for the actual submission of the MapReduce job to the proper framework (e.g., traditional MapReduce or MapReduce2 on YARN). Sqoop is discussed in more detail in Chapter 10 and in Apache Sqoop Cookbook.

Cloudera Hue

Cloudera Hue is a web application that exposes many of the Hadoop ecosystem components in a user-friendly way. Hue allows for easy access into the Hadoop cluster without requiring users to be familiar with Linux or the various command-line interfaces the components have. Hue has several different security controls available, which we’ll look at in Chapter 12. Hue is comprised of the following components:

Hue server

This is the main component of Hue. It is effectively a web server that serves web content to users. Users are authenticated at first logon and from there, actions performed by the end user are actually done by Hue itself on behalf of the user. This concept is known as impersonation (covered in Chapter 5).

Kerberos Ticket Renewer

As the name implies, this component is responsible for periodically renewing the Kerberos ticket-granting ticket (TGT), which Hue uses to interact with the Hadoop cluster when the cluster has Kerberos enabled (Kerberos is discussed at length in Chapter 4).

Summary

This chapter introduced some common security terminology that builds the foundation of the topics covered throughout the rest of the book. A key takeaway from this chapter is to become comfortable with the fact that security for Hadoop is not a completely foreign discussion. Tried-and-true security principles such as CIA and AAA resonate in the Hadoop context and will be discussed at length in the chapters to come. Lastly, we took a look at many of the Hadoop ecosystem projects (and their individual components) to understand their purpose in the stack, and to get a sense at how security will apply.

In the next chapter, we will dive right into securing distributed systems. You will find that many of the security threats and mitigations that apply to Hadoop are generally applicable to distributed systems.

1 Apache Hadoop itself consists of four subprojects: HDFS, YARN, MapReduce, and Hadoop Common. However, the Hadoop ecosystem, Hadoop, and the related projects that build on or integrate with Hadoop are often shortened to just Hadoop. We attempt to make it clear when we’re referring to Hadoop the project versus Hadoop the ecosystem.

Get Hadoop Security now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.