You are previewing Programming Pig, 2nd Edition.
O'Reilly logo
Programming Pig, 2nd Edition

Book Description

For many organizations, Hadoop is the first step for dealing with massive amounts of data. The next step? Processing and analyzing datasets with the Apache Pig scripting platform. With Pig, you can batch-process data without having to create a full-fledged application, making it easy to experiment with new datasets. Updated with use cases and programming examples, this second edition of Programming Pig is the ideal learning tool for new and experienced users alike.

Table of Contents

  1. Preface
    1. Who Should Read This Book
    2. What’s New in This Edition
    3. Conventions Used in This Book
    4. Code Examples in This Book
    5. Using Code Examples
    6. Safari® Books Online
    7. How to Contact Us
    8. Acknowledgments from the First Edition (Alan Gates)
    9. Second Edition Acknowledgments (Alan Gates and Daniel Dai)
  2. 1. What Is Pig?
    1. Pig Latin, a Parallel Data Flow Language
      1. Comparing Query and Data Flow Languages
    2. Pig on Hadoop
      1. MapReduce’s “Hello World”
      2. How Pig Differs from MapReduce
    3. What Is Pig Useful For?
    4. The Pig Philosophy
    5. Pig’s History
  3. 2. Installing and Running Pig
    1. Downloading and Installing Pig
      1. Downloading the Pig Package from Apache
      2. Installation and Setup
      3. Downloading Pig Artifacts from Maven
      4. Downloading the Source
      5. Downloading Pig from Distributions
    2. Running Pig
      1. Running Pig Locally on Your Machine
      2. Running Pig on Your Hadoop Cluster
      3. Running Pig in the Cloud
      4. Command-Line and Configuration Options
      5. Return Codes
    3. Grunt
      1. Entering Pig Latin Scripts in Grunt
      2. HDFS Commands in Grunt
      3. Controlling Pig from Grunt
      4. Running External Commands
      5. Others
  4. 3. Pig’s Data Model
    1. Types
      1. Scalar Types
      2. Complex Types
      3. Nulls
    2. Schemas
      1. Casts
  5. 4. Introduction to Pig Latin
    1. Preliminary Matters
      1. Case Sensitivity
      2. Comments
    2. Input and Output
      1. load
      2. store
      3. dump
    3. Relational Operations
      1. foreach
      2. filter
      3. group
      4. order by
      5. distinct
      6. join
      7. limit
      8. sample
      9. parallel
    4. User-Defined Functions
      1. Registering Java UDFs
      2. Registering UDFs in Scripting Languages
      3. define and UDFs
      4. Calling Static Java Functions
      5. Calling Hive UDFs
  6. 5. Advanced Pig Latin
    1. Advanced Relational Operations
      1. Advanced Features of foreach
      2. Casting a Relation to a Scalar
      3. Using Different Join Implementations
      4. cogroup
      5. union
      6. cross
      7. More on Nested foreach
      8. rank
      9. cube
      10. assert
    2. Integrating Pig with Executables and Native Jobs
      1. stream
      2. native
    3. split and Nonlinear Data Flows
    4. Controlling Execution
      1. set
      2. Setting the Partitioner
    5. Pig Latin Preprocessor
      1. Parameter Substitution
      2. Macros
      3. Including Other Pig Latin Scripts
  7. 6. Developing and Testing Pig Latin Scripts
    1. Development Tools
      1. Syntax Highlighting and Checking
      2. describe
      3. explain
      4. illustrate
      5. Pig Statistics
      6. Job Status
      7. Debugging Tips
    2. Testing Your Scripts with PigUnit
  8. 7. Making Pig Fly
    1. Writing Your Scripts to Perform Well
      1. Filter Early and Often
      2. Project Early and Often
      3. Set Up Your Joins Properly
      4. Use Multiquery When Possible
      5. Choose the Right Data Type
      6. Select the Right Level of Parallelism
    2. Writing Your UDFs to Perform
    3. Tuning Pig and Hadoop for Your Job
    4. Using Compression in Intermediate Results
    5. Data Layout Optimization
    6. Map-Side Aggregation
    7. The JAR Cache
    8. Processing Small Jobs Locally
    9. Bloom Filters
    10. Schema Tuple Optimization
    11. Dealing with Failures
  9. 8. Embedding Pig
    1. Embedding Pig Latin in Scripting Languages
      1. Compiling
      2. Binding
      3. Running
      4. Utility Methods
    2. Using the Pig Java APIs
      1. PigServer
      2. PigRunner
  10. 9. Writing Evaluation and Filter Functions
    1. Writing an Evaluation Function in Java
      1. Where Your UDF Will Run
      2. Evaluation Function Basics
      3. Input and Output Schemas
      4. Error Handling and Progress Reporting
      5. Constructors and Passing Data from Frontend to Backend
      6. Overloading UDFs
      7. Variable-Length Input Schema
      8. Memory Issues in Eval Funcs
      9. Compile-Time Evaluation
      10. Shipping JARs Automatically
    2. The Algebraic Interface
    3. The Accumulator Interface
    4. Writing Filter Functions
    5. Writing Evaluation Functions in Scripting Languages
      1. Jython UDFs
      2. JavaScript UDFs
      3. JRuby UDFs
      4. Groovy UDFs
      5. Streaming Python UDFs
      6. Comparing Scripting Language UDF Features
  11. 10. Writing Load and Store Functions
    1. Load Functions
      1. Frontend Planning Functions
      2. Passing Information from the Frontend to the Backend
      3. Backend Data Reading
      4. Additional Load Function Interfaces
    2. Store Functions
      1. Store Function Frontend Planning
      2. Store Functions and UDFContext
      3. Writing Data
      4. Failure Cleanup
      5. Storing Metadata
    3. Shipping JARs Automatically
    4. Handling Bad Records
  12. 11. Pig on Tez
    1. What Is Tez?
    2. Running Pig on Tez
    3. Potential Differences When Running on Tez
      1. UDFs
      2. Using PigRunner
      3. Testing and Debugging
    4. Pig on Tez Internals
      1. Multiple Backends in Pig
      2. The Tez Optimizer
      3. Operators and Implementation
      4. Automatic Parallelism
  13. 12. Pig and Other Members of the Hadoop Community
    1. Pig and Hive
      1. HCatalog
      2. WebHCat
    2. Cascading
    3. Spark
    4. NoSQL Databases
      1. HBase
      2. Accumulo
      3. Cassandra
    5. DataFu
    6. Oozie
  14. 13. Use Cases and Programming Examples
    1. Sparse Tuples
    2. k-Means
    3. intersect and except
    4. Pig at Yahoo!
      1. Apache Pig Use Cases at Yahoo!
      2. Large-Scale ETL with Apache Pig
      3. Features That Make Pig Attractive
      4. Pig on Tez
      5. Moving Forward
    5. Pig at Particle News
      1. Compute Arrival Rate and Conversion Rate
      2. Compute Sessions Triggered by a Push
  15. A. Built-in User Defined Functions and PiggyBank
    1. Built-in UDFs
      1. Built-in Load and Store Functions
      2. Built-in Evaluation and Filter Functions
    2. PiggyBank
  16. Index