Posted on by & filed under Content - Highlights and Reviews, Information Technology, Programming & Development, Web Development.

It all started back in 2004 when Google published a paper on MapReduce: Simplified Data Processing on Large Clusters. This inspired Doug Cutting to develop Hadoop – an open-source implementation of MapReduce and the underlying file system called Hadoop Distributed File System (HDFS). Hadoop allowed users to perform batch processing of data on commodity hardware. Hadoop quickly gained adoption and enabled users to process massive amounts of data by just scaling up the number of nodes in the cluster.

However, analyzing data in Hadoop meant writing MapReduce jobs. Not only did it mean hiring developers to write workloads but it also alienated a community of SQL-savvy analysts from being able to analyze the data stored in Hadoop clusters. This led to a desire for having a SQL-like interface for querying and analyzing data stored in Hadoop clusters which became the motivation for Apache Hive. Apache Hive is a data warehousing system that allows users to query data stored in Hadoop-compatible file systems using SQL-like syntax. Hive queries are written in a SQL-like language called Hive Query Language (HQL) which gets compiled to MapReduce jobs that run on the Hadoop cluster.

Hive natively provides support for reading and writing data from/to HDFS. Moreover, it supports various compression methods like gzip, snappy, etc. for data stored in HDFS. For accessing data not stored in HDFS, Hive provides a pluggable StorageHandler interface. One of the most commonly used Storage handlers is HBase storage handler used for querying data stored in HBase within Hive. Hive storage handlers have also been written for data stored in Hypertable, Cassandra, JDBC, MongoDB and Google Spreadsheets. Some of these storage handlers are in different stages on development so please check the status of individual storage handlers before using them.

Hive supports primitive data types like int, float, double, and string but also complex types like arrays, maps, structs and unions. Users can connect to Hive from various programming languages via JDBC driver.

Now, let’s take a look at our first Hive query and run some Hive queries on a small dataset. The following discussion assumes that you have a working Hadoop cluster set up with Hive installed on it. To learn more about how to install Hadoop and Hive, follow the directions on the Apache Hadoop and Hive wiki pages. It is also recommended that you set up MySQL as Hive metastore (instead of the default embedded derby database). Instructions on how to configure Hive to use a MySQL metastore are available here.

Let’s create a file called data.csv, and place it in your home directory. This will be our dataset for this example:

Now create a table over the uploaded dataset. Start the Hive command line by typing hive on your bash command line:

This command creates a table called table1 with the specified columns. The rest of the create statement specifies that the fields in the underlying file are separated by commas. If this is not specified, Hive uses ^A as the field delimiter by default.

This table is presently empty, but you can ensure that the table is created by doing the following:

Now let’s load some data into this table:

This command copies over the file data.csv to an appropriate location in HDFS so Hive can read it. While running the above command if you see a warning like rmr: DEPRECATED: Please use ‘rm -r’ instead., you can ignore it. This happens because we are overwriting the table and Hive is requesting Hadoop to delete the table’s data from HDFS using a deprecated command. The delete and overwrite still occurs successfully, albeit with a warning about the use of a deprecated command by Hive.

You can ensure that the contents got loaded by running the following:

The above command does a simple read from HDFS and lists out the contents of the table. No MapReduce job is run in this case.

Now, let’s do something more interesting with our dataset:

The above query finds out the average age and standard deviation in our dataset by calling avg and stddev_pop User Definition Functions (UDFs) on the age column for people whose age is greater than 21. The result looks like this:

For a list of all UDFs currently available in Hive, please refer to the UDF wiki page. Hive presents a pluggable interface for adding UDFs making it easy for users to write their own UDFs. We will talk more about this in another blog post in the coming days.

The above example showed how you can run a simple query on Hive. In the next blog post (Tip: Partioning Data in Hive), we will learn about partitioning – one of the best practices of storing data in HDFS for use by Hive. Then, in the subsequent blog post (Tip: Using Joins in Hive) we will learn about how to perform joins in Hive, so stay tuned!

Safari Books Online has the content you need

Below are some Hive books to help you develop applications, or you can check out all of the Hive books and training videos available from Safari Books Online. You can browse the content in preview mode or you can gain access to more information with a free trial or subscription to Safari Books Online.

Programming Hive introduces you to Apache Hive, Hadoop’s data warehouse infrastructure. You’ll quickly learn how to use Hive’s SQL dialect—HiveQL—to summarize, query, and analyze large datasets stored in Hadoop’s distributed filesystem. This example-driven guide shows you how to set up and configure Hive in your environment, provides a detailed overview of Hadoop and MapReduce, and demonstrates how Hive works within the Hadoop ecosystem.
If your organization is looking for a storage solution to accommodate a virtually endless amount of data, this book will show you how Apache HBase can fulfill your needs. As the open source implementation of Google’s BigTable architecture, HBase scales to billions of rows and millions of columns, while ensuring that write and read performance remain constant. HBase: The Definitive Guide provides the details you require to evaluate this high-performance, non-relational database, or put it into practice right away.
Ready to unlock the power of your data? With Hadoop: The Definitive Guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. You will also find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.

Start your FREE 10-day trial to Safari Books Online

About this author

Mark Grover is a contributor to the Apache Hive project and an active respondent on Hive’s mailing list and IRC channel. He is a section author of O’Reilly’s book on Hive called, Programming Hive. He works as a Software Developer at Cloudera and is also a contributor to the Apache Bigtop project.

Tags: Big Data, Hadoop, HDFS, Hive, HQL, MapReduce, scalable databases,


  1.  Tip: Using Joins in Hive | Safari Books Online's Official Blog