Posted on by & filed under Content - Highlights and Reviews, Programming & Development, Web Development.

Apache Hadoop is an open source software framework designed for writing and running distributed applications that process large amounts of data in a distributed system. Hadoop implements a computational paradigm known as MapReduce. The MapReduce model runs over Hadoop’s distributed filesystem (HDFS), and this combination allows Hadoop to process a large amount of data. Refer to Introduction to the Hadoop Framework for an overview of the Hadoop ecosystem.

HDFS is a scalable distributed filesystem designed to scale to petabytes of data while running on top of the underlying filesystem of the operating system. HDFS keeps track of where the data resides in a network by associating the name of its rack (or network switch) with the dataset. This allows Hadoop to efficiently schedule tasks to those nodes that contain data, or which are nearest to it, optimizing bandwidth utilization.

In this article we will briefly cover the basics of HDFS, features that make it standout, and explain how to practically interact with HDFS, using the command line interface.

Why a Distributed File System

When a dataset outgrows the storage capacity of a single physical machine, it becomes necessary to partition it across a number of separate machines. Filesystems that manage the storage across a network of machines are called distributed filesystems. You can store a big data set of (say) 10 TB as a single file in a DFS, something that would overwhelm any regular disk filesystem.

HDFS Features

HDFS is a filesystem designed for large-scale distributed data processing under frameworks such as MapReduce. The key features of HDFS are its ability to store very large amounts of data with streaming data access patterns on a cluster of commodity hardware. You can store a big data set of (say) 100 TB as a single file in HDFS, without the need of any specialized hardware; making HDFS a cost efficient solution.

Interacting with HDFS

Hadoop is written in Java, and all Hadoop filesystem interactions are conducted through the Java API. Here is a list of the common Hadoop interfaces in use today. Detailed descriptions of each interface can be read in Hadoop: The Definitive Guide, 3rd Edition by Tom White.

  • HTTP Interface
  • C Interface (libhdfs)
  • FUSE (Hadoop Fuse-DFS)
  • The Command line Interface

Using the Command Line Interface

Hadoop provides a set of command line utilities that work similarly to the Linux file commands, and serve as your primary interface with HDFS. We’re going to have a look into HDFS by interacting with it from the command line. We will take a look at the most common file management tasks in Hadoop, which include:

  • Adding files and directories to HDFS
  • Retrieving files from HDFS to local filesystem
  • Deleting files from HDFS

Hadoop file commands take the following form:

Where cmd is the specific file command and <args> is a variable number of arguments. The command cmd is usually named after the corresponding Unix equivalent. For example, the command for listing files is ls as in Unix.

Adding Files and Directories to HDFS

Before you can run Hadoop programs on data stored in HDFS, you’ll need to put the data into HDFS first. Let’s create a directory and put a file in it. HDFS has a default working directory of /user/$USER, where $USER is your login user name. This directory isn’t automatically created for you, though, so let’s create it with the mkdir command. For the purpose of illustration, we use chuck. You should substitute your user name in the example commands.

Hadoop’s mkdir command automatically creates parent directories if they don’t already exist. Now that we have a working directory, we can put a file into it. Create some text file on your local filesystem called example.txt. The Hadoop command put is used to copy files from the local system into HDFS.

Note the period (.) as the last argument in the command above. It means that we’re putting the file into the default working directory. The command above is equivalent to:

Retrieving Files from HDFS

The Hadoop command get copies files from HDFS back to the local filesystem. To retrieve example.txt, we can run the following command:

Another way to access the data is to display it. The Hadoop cat command allows us to do that:

Deleting Files from HDFS

You shouldn’t be too surprised by now that the Hadoop command for removing files is rm:

The rm command can also be used to delete empty directories.

Interacting with HDFS Programmatically

We covered how to perform basic file operations in HDFS using a command line interface. Although the command line utilities are sufficient for most of your interaction with the HDFS filesystem, they’re not exhaustive and there’ll be situations where you may want deeper access into the HDFS API. This can be achieved using the Java interface which is covered here in Hadoop: The Definitive Guide, available on Safari Books.

Safari Books Online has the content you need

Check out these Hadoop books available from Safari Books Online:

Ready to unlock the power of your data? With Hadoop: The Definitive Guide, 3rd Edition, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. You’ll also find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.
Hadoop in Action teaches readers how to use Hadoop and write MapReduce programs. This book will lead the reader from obtaining a copy of Hadoop to setting it up in a cluster and writing data analytic programs. This book also takes you beyond the mechanics of running Hadoop, teaching you to write meaningful programs in a MapReduce framework.
In Pro Hadoop, you will learn the ins and outs of MapReduce: how to structure a cluster, design and implement the Hadoop file system, and how to structure your first cloud—computing tasks using Hadoop. You will also learn how to let Hadoop take care of distributing and parallelizing your software—you just focus on the code, Hadoop takes care of the rest.
If you’ve been tasked with the job of maintaining large and complex Hadoop clusters, or are about to be, Hadoop Operations is a must. You’ll learn the particulars of Hadoop operations, from planning, installing, and configuring the system to providing ongoing maintenance.

About the authors

Aamir Majeed is Senior Solutions Engineer at TunaCode, Inc. He holds a degree in Avionics Engineering. His interest areas are anything and everything GPUs – from writing highly optimized, performance oriented GPU code to experimenting with latest tools and solutions. When not working, Aamir spends his time trekking snow capped mountains. He can be reached at

Tags: Hadoop, Hadoop Distributed File System, HDFS, java,

Comments are closed.