You are previewing Getting Started with Impala.
O'Reilly logo
Getting Started with Impala

Book Description

Learn how to write, tune, and port SQL queries and other statements for a Big Data environment, using Impala—the massively parallel processing SQL query engine for Apache Hadoop. The best practices in this practical guide help you design database schemas that not only interoperate with other Hadoop components, and are convenient for administers to manage and monitor, but also accommodate future expansion in data size and evolution of software capabilities. Ideal for database developers and business analysts, the latest revision covers analytics functions, complex types, incremental statistics, subqueries, and submission to the Apache incubator.

Table of Contents

  1. Introduction
    1. Who Is This Book For?
    2. Conventions Used in This Book
    3. Using Code Examples
    4. Safari® Books Online
    5. How to Contact Us
    6. Acknowledgments
  2. 1. Why Impala?
    1. Impala’s Place in the Big Data Ecosystem
    2. Flexibility for Your Big Data Workflow
    3. High-Performance Analytics
    4. Exploratory Business Intelligence
  3. 2. Getting Up and Running with Impala
    1. Installation
    2. Connecting to Impala
    3. Your First Impala Queries
  4. 3. Impala for the Database Developer
    1. The SQL Language
      1. Standard SQL
      2. Limited DML
      3. No Transactions
      4. Numbers
      5. Recent Additions
    2. Big Data Considerations
      1. Billions and Billions of Rows
      2. HDFS Block Size
      3. Parquet Files: The Biggest Blocks of All
    3. How Impala Is Like a Data Warehouse
    4. Physical and Logical Data Layouts
      1. The HDFS Storage Model
    5. Distributed Queries
    6. Normalized and Denormalized Data
    7. File Formats
      1. Text File Format
      2. Parquet File Format
      3. Getting File Format Information
      4. Switching File Formats
    8. Aggregation
  5. 4. Common Developer Tasks for Impala
    1. Getting Data into an Impala Table
      1. INSERT Statement
      2. LOAD DATA Statement
      3. External Tables
      4. Figuring Out Where Impala Data Resides
      5. Manually Loading Data Files into HDFS
      6. Hive
      7. Sqoop
      8. Kite
    2. Porting SQL Code to Impala
    3. Using Impala from a JDBC or ODBC Application
      1. JDBC
      2. ODBC
    4. Using Impala with a Scripting Language
      1. Running Impala SQL Statements from Scripts
      2. Variable Substitution
      3. Saving Query Results
      4. The impyla Package for Python Scripting
    5. Optimizing Impala Performance
      1. Optimizing Query Performance
      2. Optimizing Memory Usage
      3. Working with Partitioned Tables
      4. Finding the Ideal Granularity
      5. Inserting into Partitioned Tables
      6. Adding and Loading New Partitions
    6. Writing User-Defined Functions
    7. Collaborating with Your Administrators
      1. Designing for Security
      2. Understanding Resource Management
      3. Helping to Plan for Performance (Stats, HDFS Caching)
      4. Understanding Cluster Topology
      5. Always Close Your Queries
  6. 5. Tutorials and Deep Dives
    1. Tutorial: From Unix Data File to Impala Table
    2. Tutorial: Queries Without a Table
    3. Tutorial: The Journey of a Billion Rows
      1. Generating a Billion Rows of CSV Data
      2. Normalizing the Original Data
      3. Converting to Parquet Format
      4. Making a Partitioned Table
      5. Next Steps
    4. Deep Dive: Joins and the Role of Statistics
      1. Creating a Million-Row Table to Join With
      2. Loading Data and Computing Stats
      3. Reviewing the EXPLAIN Plan
      4. Trying a Real Query
      5. The Story So Far
      6. Final Join Query with 1B x 1M Rows
    5. Anti-Pattern: A Million Little Pieces
    6. Tutorial: Across the Fourth Dimension
      1. TIMESTAMP Data Type
      2. Format Strings for Dates and Times
      3. Working with Individual Date and Time Fields
      4. Date and Time Arithmetic
      5. Let’s Solve the Y2K Problem
      6. More Fun with Dates
    7. Tutorial: Verbose and Quiet impala-shell Output
    8. Tutorial: When Schemas Evolve
      1. Numbers Versus Strings
      2. Dealing with Out-of-Range Integers
    9. Tutorial: Levels of Abstraction
      1. String Formatting
      2. Temperature Conversion
  7. Colophon
  8. Copyright