Appendix A. Creating a Hadoop Pseudo-Distributed Development Environment

In order to execute the code in this book, you’ll need to set up a development environment. Hadoop developers usually test their scripts and code on a pseudo-distributed environment (also known as a single node setup), which is a virtual machine that runs all of the Hadoop daemons simultaneously on a single machine.

These instructions will help you install a pseudo-distributed environment with Hadoop 2.5.0 on Ubuntu 14.04.

Quick Start

There are a couple of options if you are not familiar with systems administration on Linux, or do not wish to work through the process of installing Hadoop yourself. We have provided a VMDK for you to use in the virtualization software of your choice (e.g., VirtualBox or VMWare Fusion). Alternatively, both Hortonworks and Cloudera supply virtual machines for quick download.

To get up and started quickly, simply download the VM and run it in your favorite virtualization software. Be aware that if you do use Cloudera or Hortonworks distributions, the environment may be subtly different than the one we use. To get everything set up, either download the preconfigured machine or follow the steps described here.

If you are using the VMDK supplied by us, to log in to the machine use the username and password as follows:

username: student
password: password

If you’re brave enough to set up the environment yourself, go ahead and move to the next section!

Setting Up Linux

Before you ...

Get Data Analytics with Hadoop now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.