Foreword
Apache Hadoop as software is a simple framework that allows for distributed processing of data across many machines. As a technology, Hadoop and the surrounding ecosystem have changed the way we think about data processing at scale. No longer does our data need to fit in the memory of a single machine, nor are we limited by the I/O of a single machineâs disks. These are powerful tenets.
So too has cloud computing changed our way of thinking. While the notion of colocating machines in a faraway data center isnât new, allowing users to provision machines on-demand is, and itâs changed everything. No longer are developers or architects limited by the processing power installed in on-premise data centers, nor do we need to host small web farms under our desks or in that old storage closet. The pay-as-you-go model has been a boon for ad hoc testing and proof-of-concept efforts, eliminating time spent in purchasing, installation, and setup.
Both Hadoop and cloud computing represent major paradigm shifts, not just in enterprise computing, but affecting many other industries. Much has been written about how these technologies have been used to make advances in retail, public sector, manufacturing, energy, and healthcare, just to name a few. Entire businesses have sprung up as a result, dedicated to the care, feeding, integration, and optimization of these new systems.
It was inevitable that Hadoop workloads would be run on cloud computing providersâ infrastructure. The cloud offers incredible flexibility to users, often complementing on-premise solutions, enabling them to use Hadoop in ways simply not possible previously.
Ever the conscientious software engineer, author Bill Havanki has a strong penchant for documenting. Heâs able to break down complex concepts and explain them in simple terms, without making you feel foolish. Bill writes the kind of documentation that you actually enjoy, the kind you find yourself reading long after youâve discovered the solution to your original problem.
Hadoop and cloud computing are powerful and valuable tools, but arenât simple technologies by any means. This stuff is hard. Both have a multitude of configuration options and itâs very easy to become overwhelmed. All major cloud providers offer similar services like virtual machines, network attached storage, relational databases, and object storageâall of which can be utilized by Hadoopâbut each provider uses different naming conventions and has different capabilities and limitations. For example, some providers require that resource provisioning occurs in a specific order. Some providers create isolated virtual networks for your machines automatically while others require manual creation and assignment. It can be confusing. Whether youâre working with Hadoop for the first time or a veteran installing on a cloud provider youâve never used before, knowing about the specifics of each environment will save you a lot of time and pain.
Cloud computing appeals to a dizzying array of users running a wide variety of workloads. Most cloud providersâ official documentation isnât specific to any particular application (such as Hadoop). Using Hadoop on cloud infrastructure introduces additional architectural issues that need to be considered and addressed. It helps to have a guide to demystify the options specific to Hadoop deployments and to ease you through the setup process on a variety of cloud providers, step by step, providing tips and best practices along the way. This book does precisely that, in a way that I wish had been available when I started working in the cloud computing world.
Whether code or expository prose, Billâs creations are approachable, sensible, and easy to consume. With this book and its author, youâre in capable hands for your first foray into moving Hadoop to the Cloud.
Get Moving Hadoop to the Cloud now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.