Chapter 11. Relational Data with Apache Hive

So far, the clusters established in the cloud using the instructions in this book have only been capable of running classic MapReduce jobs. Of course, the Hadoop ecosystem offers many other ways to work with large amounts of data, and one of the most attractive is viewing it as relational data that can be queried using Structured Query Language (SQL). For decades before the advent of Hadoop and similar cluster architectures, data analysts worked with large data sets in relational databases, and for many use cases that is still appropriate today. Hadoop components such as Apache Hive allow those with experience in relational databases to transition their skills over to the big data world.

As you might expect, a Hadoop cluster running on a cloud provider can support these components. What’s more, the cloud providers have features that the components can take advantage of, and the components themselves have ways to explicitly use cloud provider features to enhance their capabilities.

The content in this chapter starts off with installing Hive into a cloud cluster. The instructions assume that you have a cluster set up in the configuration developed in Chapter 9 but, as usual, you should be able to adapt the instructions to your specific situation.

Planning for Hive in the Cloud

The most important pieces of Hive to consider are the Hive server (HiveServer21), a server process that accepts requests from other Hive clients, and the Hive ...

Get Moving Hadoop to the Cloud now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.