You are previewing Learning Cloudera Impala.
O'Reilly logo
Learning Cloudera Impala

Book Description

Everything you need to know about Cloudera Impala is here – from installation onwards. Your raw data processing in Hadoop takes on new dimensions of speed and volume with this hands-on tutorial.

  • Step-by-step guidance to get you started with Impala on your Hadoop cluster

  • Manipulate your data rapidly by writing proper SQL statements

  • Explore the concepts of Impala security, administration, and troubleshooting in detail to maintain your Impala cluster

  • In Detail

    If you have always wanted to crunch billions of rows of raw data on Hadoop in a couple of seconds, then Cloudera Impala is the number one choice for you. Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.

    In this practical, example-oriented book, you will learn everything you need to know about Cloudera Impala so that you can get started on your very own project. The book covers everything about Cloudera Impala from installation, administration, and query processing, all the way to connectivity with other third party applications. With this book in your hand, you will find yourself empowered to play with your data in Hadoop.

    As a reader of this book, you will learn about the origin of Impala and the technology behind it that allows it to run on thousands of machines. You will learn how to install, run, manage, and troubleshoot Impala in your own Hadoop cluster using the step-by-step guidance provided in the book. The book covers tenets of data processing such as loading data stored in Hadoop into Impala tables and querying data using Impala SQL statements, all with various code illustrations and a real-world example.

    The book is written to get you started with Impala by providing rich information so you can understand what Impala is, what it can do for you, and finally how you can use it to achieve your objective.

    Table of Contents

    1. Learning Cloudera Impala
      1. Table of Contents
      2. Learning Cloudera Impala
      3. Credits
      4. About the Author
      5. About the Reviewer
      6. www.PacktPub.com
        1. Support files, eBooks, discount offers and more
          1. Why Subscribe?
          2. Free Access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Errata
          2. Piracy
          3. Questions
      8. 1. Getting Started with Impala
        1. Impala requirements
          1. Dependency on Hive for Impala
          2. Dependency on Java for Impala
          3. Hardware dependency
          4. Networking requirements
          5. User account requirements
        2. Installing Impala
          1. Installing Impala with Cloudera Manager
          2. Installing Impala without Cloudera Manager
        3. Configuring Impala after installation
        4. Starting Impala
        5. Stopping Impala
        6. Restarting Impala
        7. Upgrading Impala
          1. Upgrading Impala using parcels with Cloudera Manager
          2. Upgrading Impala using packages with Cloudera Manager
          3. Upgrading Impala without Cloudera Manager
        8. Impala core components
          1. Impala daemon
          2. Impala statestore
          3. Impala metadata and metastore
          4. The Impala programming interface
        9. The Impala execution architecture
          1. Working with Apache Hive
          2. Working with HDFS
          3. Working with HBase
        10. Impala security
          1. Authorization
            1. The SELECT privilege
            2. The INSERT privilege
            3. The ALL privilege
          2. Authentication through Kerberos
          3. Auditing
        11. Impala security guidelines for a higher level of protection
        12. Summary
      9. 2. The Impala Shell Commands and Interface
        1. Using Cloudera Manager for Impala
        2. Launching Impala shell
        3. Connecting impala-shell to the remotely located impalad daemon
        4. Impala-shell command-line options with brief explanations
          1. General command-line options
          2. Connection-specific options
          3. Query-specific options
          4. Secure connectivity-specific options
        5. Impala-shell command reference
          1. General commands
          2. Query-specific commands
          3. Table- and database-specific commands
        6. Summary
      10. 3. The Impala Query Language and Built-in Functions
        1. Impala SQL language statements
          1. Database-specific statements
            1. The CREATE DATABASE statement
            2. The DROP DATABASE statement
            3. The SHOW DATABASES statement
            4. Using database-specific query sentence in an example
          2. Table-specific statements
            1. The CREATE TABLE statement
            2. The CREATE EXTERNAL TABLE statement
            3. The ALTER TABLE statement
            4. The DROP TABLE statement
            5. The SHOW TABLES statement
            6. The DESCRIBE statement
            7. The INSERT statement
            8. The SELECT statement
            9. Internal and external tables
        2. Data types
        3. Operators
        4. Functions
        5. Clauses
        6. Query-specific SQL statements in Impala
        7. Defining VIEWS in Impala
        8. Loading data from HDFS using the LOAD DATA statement
        9. Comments in Impala SQL statements
        10. Built-in function support in Impala
          1. The type conversion function
        11. Unsupported SQL statements in Impala
        12. Summary
      11. 4. Impala Walkthrough with an Example
        1. Creating an example scenario
          1. Example dataset one – automobiles (automobiles.txt)
          2. Example dataset two – motorcycles (motorcycles.txt)
          3. Data and schema considerations
        2. Commands for loading data into Impala tables
          1. HDFS specific commands
          2. Loading data into the Impala table from HDFS
        3. Launching the Impala shell
          1. Database and table specific commands
        4. SQL queries against the example database
        5. SQL join operation with the example database
          1. Using various types of SQL statements
        6. Summary
      12. 5. Impala Administration and Performance Improvements
        1. Impala administration
          1. Administration with Cloudera Manager
          2. The Impala statestore UI
        2. Impala High Availability
        3. Single point of failure in Impala
        4. Improving performance
          1. Enabling block location tracking
          2. Enabling native checksumming
          3. Enabling Impala to perform short-circuit read on DataNode
          4. Adding more Impala nodes to achieve higher performance
          5. Optimizing memory usage during query execution
          6. Query execution dependency on memory
          7. Using resource isolation
        5. Testing query performance
          1. Benchmarking queries
          2. Verifying data locality
        6. Choosing an appropriate file format and compression type for better performance
        7. Fine-tuning Impala performance
          1. Partitioning
          2. Join queries
          3. Table and column statistics
        8. Summary
      13. 6. Troubleshooting Impala
        1. Troubleshooting various problems
          1. Impala configuration-related issues
            1. The block locality issue
            2. Native checksumming issues
          2. Various connectivity issues
            1. Connectivity between Impala shell and Impala daemon
            2. ODBC/JDBC-specific connectivity issues
          3. Query-specific issues
          4. Issues specific to User Access Control (UAC)
          5. Platform-specific issues
            1. Impala port mapping issues
            2. HDFS-specific problems
          6. Input file format-specific issues
        2. Using Cloudera Manager to troubleshoot problems
          1. Impala log analysis using Cloudera Manager
          2. Using the Impala web interface for monitoring and troubleshooting
          3. Using the Impala statestore web interface
          4. Using the Impala Maintenance Mode
          5. Checking Impala events
        3. Summary
      14. 7. Advanced Impala Concepts
        1. Impala and MapReduce
        2. Impala and Hive
          1. Key differences between Impala and Hive
        3. Impala and Extract, Transform, Load (ETL)
        4. Why Impala is faster than Hive in query processing
        5. Impala processing strategy
        6. Impala and HBase
          1. Using Impala to query HBase tables
        7. File formats and compression types supported in Impala
        8. Processing different file and compression types in Impala
          1. The regular text file format with Impala tables
          2. The Avro file format with Impala tables
          3. The RCFile file format with Impala tables
          4. The SequenceFile file format with Impala tables
          5. The Parquet file format with Impala tables
        9. The unsupported features in Impala
        10. Impala resources
        11. Summary
      15. A. Technology Behind Impala and Integration with Third-party Applications
        1. Technology behind Impala
        2. Data visualization using Impala
          1. Tableau and Impala
          2. Microsoft Excel and Impala
          3. Microstrategy and Impala
          4. Zoomdata and Impala
        3. Real-time query with Impala on Hadoop
          1. Real-time query subscriptions with Impala
        4. What is new in Impala 1.2.0 (Beta)
      16. Index