Posted on by & filed under Content - Highlights and Reviews, Information Technology, Programming & Development, Web Development.

The example in our initial blog post, Introduction to Hive, ran a simple query over our dataset. However, the query read the entire dataset even though we had a where clause filter on the age column. In most MapReduce jobs over large amounts of data, the bottleneck is I/O on the data. Consequently, if we can reduce I/O needed by our MapReduce job, our queries would become faster. A very common method to do so is data partitioning. In Hive’s implementation of partitioning, data within a table is split across multiple partitions. Each partition corresponds to a particular value(s) of partition column(s) and is stored as a sub-directory within the table’s directory on HDFS. When the table is queried, where applicable, only the required partitions of the table are queried, thereby reducing the I/O required by the query.

For your reference, the query we ran in the first blog post was:

With the above query, if the table had a million records, all of them will be fed to the MapReduce job.

The create table statement for the above non-partitioned table looked like:

Now, let’s convert the same example to use a partitioned table, partitioned by the entry_date column. The create table statement in this case would look like:

Note that we removed the entry_date column from the columns list but added it in the PARTITION BY clause, thereby making it a virtual column.

Let’s add partitions for entry_date=’2012-11-11′ and entry_date=’2012-11-12′:

Now, let’s run the same query as before on the partitioned table:

The result is the same as before:

This time, however, it only queried one partition of the table (corresponding to entry_date=’2012-11-12′) instead of the explain plan. We can confirm that by looking at the explain plans of the queries.
Here is the explain plan for the query using the non-partitioned table.

In the above explain plan, two predicates are being applied on the data in the mappers of the MapReduce job – one on age and another on entry_date.

Here is the explain plan from the query on partitioned table:

In the above explain plan, only one predicate is being applied in the mappers of the MapReduce job – on age. This is because the predicate on entry_date has been used to prune the partitions. So the mapper only needs to apply remaining predicate(s) – the age predicate.

Having looked at how we can partition our data in Hive, in the next blog post (Using Joins in Hive), we will look at how we can join data from various Hive tables and understand how Hive joins work.

Safari Books Online has the content you need

Below are some Hive books to help you develop applications, or you can check out all of the Hive books and training videos available from Safari Books Online. You can browse the content in preview mode or you can gain access to more information with a free trial or subscription to Safari Books Online.

Programming Hive introduces you to Apache Hive, Hadoop’s data warehouse infrastructure. You’ll quickly learn how to use Hive’s SQL dialect—HiveQL—to summarize, query, and analyze large datasets stored in Hadoop’s distributed filesystem. This example-driven guide shows you how to set up and configure Hive in your environment, provides a detailed overview of Hadoop and MapReduce, and demonstrates how Hive works within the Hadoop ecosystem.
If your organization is looking for a storage solution to accommodate a virtually endless amount of data, this book will show you how Apache HBase can fulfill your needs. As the open source implementation of Google’s BigTable architecture, HBase scales to billions of rows and millions of columns, while ensuring that write and read performance remain constant. HBase: The Definitive Guide provides the details you require to evaluate this high-performance, non-relational database, or put it into practice right away.
Ready to unlock the power of your data? With Hadoop: The Definitive Guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. You will also find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

Start your FREE 10-day trial to Safari Books Online

About this author

Mark Grover is a contributor to the Apache Hive project and an active respondent on Hive’s mailing list and IRC channel. He is a section author of O’Reilly’s book on Hive called, Programming Hive. He works as a Software Developer at Cloudera and is also a contributor to the Apache Bigtop project.

Tags: Big Data, databases, HBase, Hive, MapReduce, partitioning data, scalable databases,

Comments are closed.