Chapter 8. Cloud Computing and Virtualization

Most Hadoop clusters today run on “real iron”—that is, on small, Intel-based computers running some variant of the Linux operating system with directly attached storage. However, you might want to try this in a cloud or virtual environment. While virtualization usually comes with some degree of performance degradation, you may find it minimal for your task set or that it’s a worthwhile trade-off for the benefits of cloud computing; these benefits include low up-front costs and the ability to scale up (and down sometimes) as your dataset and analytic needs change.

By cloud computing, we’ll follow guidelines established by the National Institute of Standards and Technology (NIST), whose definition of cloud computing you’ll find here. A Hadoop cluster in the cloud will have:

  • On-demand self-service

  • Network access

  • Resource sharing

  • Rapid elasticity

  • Measured resource service

While these resource need not exist virtually, in practice, they usually do.

Virtualization means creating virtual, as opposed to real, computing entities. Frequently, the virtualized object is an operating system on which software or applications are overlaid, but storage and networks can also be virtualized. Lest you think that virtualization is a relatively new computing technology, in 1972 IBM released VM/370, in which the 370 mainframe could be divided into many small, single-user virtual machines. Currently, Amazon Web Services is likely ...

Get Field Guide to Hadoop now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.