Coming to Impala from an Apache Hadoop Background

If you are already experienced with the Apache Hadoop software stack and are adding Impala as another arrow in your quiver, you will find it interoperable on several levels.

Apache Hive

Apache Hive is the first generation of SQL-on-Hadoop technology, focused on batch processing with long-running jobs. Impala tables and Hive tables are highly interoperable, allowing you to switch into Hive to do a batch operation such as a data import, then switch back to Impala and do an interactive query on the same table. You might see HDFS paths such as /user/hive/warehouse in Impala examples, because for simplicity we sometimes use this historical default path for both Impala and Hive databases.

For users who already use Hive to run SQL batch jobs on Hadoop, the Impala SQL dialect is highly compatible with HiveQL. The main limitations involve nested data types, UDFs, and custom file formats. These are not permanent limitations—they’re being worked through in priority sequence based on the Impala roadmap.

If you are an experienced Hive user, one thing to unlearn is the notion of a SQL query as a long-running, heavyweight job. With Impala, you typically issue the query and see the results in the same interactive session of the Impala shell or a business intelligence tool. For example, when you ask for even a simple query such as *SELECT COUNT(\*)* in the Hive shell, it prints many lines of status output showing mapper and reducer processes, and even ...

Get Cloudera Impala now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.