Posted on by & filed under Content - Highlights and Reviews, Programming & Development.

Apache Hadoop is an open source framework for the distributed processing of large amounts of data across a cluster. It relies upon the MapReduce paradigm to reduce complex tasks into smaller parallel tasks that can be executed concurrently across multiple machines. However, writing MapReduce tasks on top of Hadoop for processing data is not for everyone since it requires learning a new framework and a new programming paradigm altogether. What is needed is an easy-to-use abstraction on top of Hadoop that allows people not familiar with it to use its capabilities as easily.

Apache Hive aims to solve this problem by offering an SQL-like interface, called HiveQL, on top of Hadoop. Hive achieves this task by converting queries written in HiveQL into MapReduce tasks that are then run across the Hadoop cluster to fetch the desired results. As such, Hive is best suited for batch processing large amounts of data (such as in data warehousing) but is not ideally suitable as a routine transactional database because of its slow response times (it needs to fetch data from across a cluster). Learn more about HiveQL in Chapter 4. HiveQL: Data Definition in Programming Hive.

A common task for which Hive is used is the processing of logs of web servers. These logs have a regular structure and hence can be readily converted into a format that Hive can understand and process. Assuming that Hadoop and Hive are properly configured (the instructions to do so are beyond the scope of this article), we need to perform three tasks: create a schema, load the data, and execute a query.

To create a schema in Hive for processing the logs, we need to create a table with fields for date, log level and message. To do this, open the hive console by typing:

Once the hive console is opened, you need to run the query to create the table:

The ‘ROW FORMAT…’ part of the query tells Hive that the fields in the log file are separated by whitespace.

Next, you need to load the log file into the newly created table. To do this, use the LOAD query:

Note how “sample.log” is the name of the log file. Once data is loaded into the table, you can run queries on this data. For example, let’s extract all error messages from the logs (messages for rows with log level “error”):

This will output the desired error messages. If the underlying Hadoop cluster comprises of multiple machines then several MapReduce tasks might be concurrently run to execute this query. This is exactly how Hive processes large amounts of data spread across multiple nodes.

Similarly, to extract the information about the number of logs for each different type of log levels you can write a query such as:

This will produce the desired information. As shown in this post, Hive enables you to use a familiar SQL-like syntax to process a large amount of data that may be distributed across multiple nodes. This power and flexibility allows Hive to be used in a variety of ways for processing humungous amounts of information limited only by our imagination and creativity.

You can find out more on using Hive in the books listed below.

Safari Books Online has the content you need

Check out these Hive books available from Safari Books Online:

Programming Hive introduces you to Apache Hive, Hadoop’s data warehouse infrastructure. You’ll quickly learn how to use Hive’s SQL dialect—HiveQL—to summarize, query, and analyze large datasets stored in Hadoop’s distributed filesystem. This example-driven guide shows you how to set up and configure Hive in your environment, provides a detailed overview of Hadoop and MapReduce, and demonstrates how Hive works within the Hadoop ecosystem.
Hadoop Real-World Solutions Cookbook helps developers become more comfortable and proficient with solving problems in the Hadoop space. You will become more familiar with a wide variety of Hadoop-related tools and best practices for implementation. The book teaches you how to build solutions using tools such as Apache Hive, Pig, MapReduce, Mahout, Giraph, HDFS, Accumulo, Redis, and Ganglia. It provides in-depth explanations and code examples. Each chapter contains a set of recipes that pose, then solve, technical challenges, and can be completed in any order.

About the author

Shaneeb Kamran is a Computer Engineer from one of the leading universities of Pakistan. His programming journey started at the age of 12 and ever since he has dabbled himself in every new and shiny software technology he could get his hands on. He is currently involved in a startup that is working on cloud computing products.

Tags: Apache Hadoop, Apache Hive, HiveQL, MapReduce,

Comments are closed.