You are previewing Pig Design Patterns.
O'Reilly logo
Pig Design Patterns

Book Description

Simplify Hadoop programming to create complex end-to-end Enterprise Big Data solutions with Pig

In Detail

Pig Design Patterns is a comprehensive guide that will enable readers to readily use design patterns that simplify the creation of complex data pipelines in various stages of data management. This book focuses on using Pig in an enterprise context, bridging the gap between theoretical understanding and practical implementation. Each chapter contains a set of design patterns that pose and then solve technical challenges that are relevant to the enterprise use cases.

The book covers the journey of Big Data from the time it enters the enterprise to its eventual use in analytics, in the form of a report or a predictive model. By the end of the book, readers will appreciate Pig's real power in addressing each and every problem encountered when creating an analytics-based data product. Each design pattern comes with a suggested solution, analyzing the trade-offs of implementing the solution in a different way, explaining how the code works, and the results.

What You Will Learn

  • Understand Pig's relevance in an enterprise context
  • Use Pig in design patterns that enable data movement across platforms during and after analytical processing
  • See how Pig can co-exist with other components of the Hadoop ecosystem to create Big Data solutions using design patterns
  • Simplify the process of creating complex data pipelines using transformations, aggregations, enrichment, cleansing, filtering, reformatting, lookups, and data type conversions
  • Apply knowledge of Pig in design patterns that deal with integration of Hadoop with other systems to enable multi-platform analytics
  • Comprehend design patterns and use Pig in cases related to complex analysis of pure structured data
  • Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

    Table of Contents

    1. Pig Design Patterns
      1. Table of Contents
      2. Pig Design Patterns
      3. Credits
      4. Foreword
      5. About the Author
      6. Acknowledgments
      7. About the Reviewers
      8. www.PacktPub.com
        1. Support files, eBooks, discount offers and more
          1. Why Subscribe?
          2. Free Access for Packt account holders
      9. Preface
        1. What this book covers
          1. Motivation for this book
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
            1. Third-party libraries
            2. Datasets
          2. Errata
          3. Piracy
          4. Questions
      10. 1. Setting the Context for Design Patterns in Pig
        1. Understanding design patterns
        2. The scope of design patterns in Pig
        3. Hadoop demystified – a quick reckoner
          1. The enterprise context
          2. Common challenges of distributed systems
          3. The advent of Hadoop
          4. Hadoop under the covers
          5. Understanding the Hadoop Distributed File System
            1. HDFS design goals
            2. Working of HDFS
          6. Understanding MapReduce
            1. Understanding how MapReduce works
            2. The MapReduce internals
        4. Pig – a quick intro
          1. Understanding the rationale of Pig
          2. Understanding the relevance of Pig in the enterprise
          3. Working of Pig – an overview
            1. Firing up Pig
            2. The use case
            3. Code listing
            4. The dataset
        5. Understanding Pig through the code
          1. Pig's extensibility
          2. Operators used in code
          3. The EXPLAIN operator
          4. Understanding Pig's data model
            1. Primitive types
            2. Complex types
              1. The relevance of schemas
        6. Summary
      11. 2. Data Ingest and Egress Patterns
        1. The context of data ingest and egress
        2. Types of data in the enterprise
        3. Ingest and egress patterns for multistructured data
          1. Considerations for log ingestion
            1. The Apache log ingestion pattern
            2. Background
            3. Motivation
            4. Use cases
            5. Pattern implementation
            6. Code snippets
              1. Code for the CommonLogLoader class
              2. Code for the CombinedLogLoader class
            7. Results
            8. Additional information
          2. The Custom log ingestion pattern
            1. Background
            2. Motivation
            3. Use cases
            4. Pattern implementation
            5. Code snippets
            6. Results
            7. Additional information
          3. The image ingress and egress pattern
            1. Background
            2. Motivation
            3. Use cases
            4. Pattern implementation
              1. The image Ingress Implementation
              2. The image egress implementation
            5. Code snippets
              1. The image ingress
                1. Pig script
                2. Image to a sequence UDF snippet
              2. The image egress
                1. Pig script
                2. Sequence to an image UDF
            6. Results
            7. Additional information
        4. The ingress and egress patterns for the NoSQL data
          1. MongoDB ingress and egress patterns
            1. Background
            2. Motivation
            3. Use cases
            4. Pattern implementation
              1. The ingress implementation
              2. The egress implementation
            5. Code snippets
              1. The ingress code
              2. The egress code
            6. Results
            7. Additional information
          2. The HBase ingress and egress pattern
            1. Background
            2. Motivation
            3. Use cases
            4. Pattern implementation
              1. The ingress implementation
              2. The egress implementation
            5. Code snippets
              1. The ingress code
              2. The egress code
            6. Results
            7. Additional information
        5. The ingress and egress patterns for structured data
          1. The Hive ingress and egress patterns
            1. Background
            2. Motivation
            3. Use cases
            4. Pattern implementation
              1. The ingress implementation
              2. The egress implementation
            5. Code snippets
              1. The ingress Code
                1. Importing data using RCFile
                2. Importing data using HCatalog
              2. The egress code
            6. Results
            7. Additional information
        6. The ingress and egress patterns for semi-structured data
          1. The mainframe ingestion pattern
            1. Background
            2. Motivation
            3. Use cases
            4. Pattern implementation
            5. Code snippets
            6. Results
            7. Additional information
          2. XML ingest and egress patterns
            1. Background
            2. Motivation
              1. Motivation for ingesting raw XML
              2. Motivation for ingesting binary XML
              3. Motivation for egression of XML
            3. Use cases
            4. Pattern implementation
              1. The implementation of the XML raw ingestion
              2. The implementation of the XML binary ingestion
          3. Code snippets
            1. The XML raw ingestion code
            2. The XML binary ingestion code
            3. The XML egress code
              1. Pig script
              2. The XML storage
            4. Results
            5. Additional information
        7. JSON ingress and egress patterns
          1. Background
            1. Motivation
            2. Use cases
            3. Pattern implementation
              1. The ingress implementation
              2. The egress implementation
            4. Code snippets
              1. The ingress code
                1. The code for simple JSON
                2. The code for nested JSON
              2. The egress code
            5. Results
            6. Additional information
        8. Summary
      12. 3. Data Profiling Patterns
        1. Data profiling for Big Data
          1. Big Data profiling dimensions
          2. Sampling considerations for profiling Big Data
            1. Sampling support in Pig
        2. Rationale for using Pig in data profiling
        3. The data type inference pattern
          1. Background
          2. Motivation
          3. Use cases
          4. Pattern implementation
          5. Code snippets
            1. Pig script
            2. Java UDF
          6. Results
          7. Additional information
        4. The basic statistical profiling pattern
          1. Background
          2. Motivation
          3. Use cases
          4. Pattern implementation
          5. Code snippets
            1. Pig script
            2. Macro
          6. Results
          7. Additional information
        5. The pattern-matching pattern
          1. Background
          2. Motivation
          3. Use cases
          4. Pattern implementation
          5. Code snippets
            1. Pig script
            2. Macro
          6. Results
          7. Additional information
        6. The string profiling pattern
          1. Background
          2. Motivation
          3. Use cases
          4. Pattern implementation
          5. Code snippets
            1. Pig script
            2. Macro
          6. Results
          7. Additional information
        7. The unstructured text profiling pattern
          1. Background
          2. Motivation
          3. Use cases
          4. Pattern implementation
          5. Code snippets
            1. Pig script
            2. Java UDF for stemming
            3. Java UDF for generating TF-IDF
          6. Results
          7. Additional information
        8. Summary
      13. 4. Data Validation and Cleansing Patterns
        1. Data validation and cleansing for Big Data
        2. Choosing Pig for validation and cleansing
        3. The constraint validation and cleansing design pattern
          1. Background
          2. Motivation
          3. Use cases
          4. Pattern implementation
          5. Code snippets
          6. Results
          7. Additional information
        4. The regex validation and cleansing design pattern
          1. Background
          2. Motivation
          3. Use cases
          4. Pattern implementation
          5. Code snippets
          6. Results
          7. Additional information
        5. The corrupt data validation and cleansing design pattern
          1. Background
          2. Motivation
          3. Use cases
          4. Pattern implementation
          5. Code snippets
          6. Results
          7. Additional information
        6. The unstructured text data validation and cleansing design pattern
          1. Background
          2. Motivation
          3. Use cases
          4. Pattern implementation
          5. Code snippets
          6. Results
          7. Additional information
        7. Summary
      14. 5. Data Transformation Patterns
        1. Data transformation processes
        2. The structured-to-hierarchical transformation pattern
          1. Background
          2. Motivation
          3. Use cases
          4. Pattern implementation
          5. Code snippets
          6. Results
          7. Additional information
        3. The data normalization pattern
          1. Background
          2. Motivation
          3. Use cases
          4. Pattern implementation
          5. Code snippets
          6. Results
          7. Additional information
        4. The data integration pattern
          1. Background
          2. Motivation
          3. Use cases
          4. Pattern implementation
          5. Code snippets
          6. Results
          7. Additional information
        5. The aggregation pattern
          1. Background
          2. Motivation
          3. Use cases
          4. Pattern implementation
          5. Code snippets
          6. Results
          7. Additional information
        6. The data generalization pattern
          1. Background
          2. Motivation
          3. Use cases
          4. Pattern implementation
          5. Code snippets
          6. Results
          7. Additional information
        7. Summary
      15. 6. Understanding Data Reduction Patterns
        1. Data reduction – a quick introduction
        2. Data reduction considerations for Big Data
        3. Dimensionality reduction – the Principal Component Analysis design pattern
          1. Background
          2. Motivation
          3. Use cases
          4. Pattern implementation
            1. Limitations of PCA implementation
          5. Code snippets
          6. Results
          7. Additional information
        4. Numerosity reduction – the histogram design pattern
          1. Background
          2. Motivation
          3. Use cases
          4. Pattern implementation
          5. Code snippets
          6. Results
          7. Additional information
        5. Numerosity reduction – sampling design pattern
          1. Background
          2. Motivation
          3. Use cases
          4. Pattern implementation
          5. Code snippets
          6. Results
          7. Additional information
        6. Numerosity reduction – clustering design pattern
          1. Background
          2. Motivation
          3. Use cases
          4. Pattern implementation
          5. Code snippets
          6. Results
          7. Additional information
        7. Summary
      16. 7. Advanced Patterns and Future Work
        1. The clustering pattern
          1. Background
          2. Motivation
          3. Use cases
          4. Pattern implementation
          5. Code snippets
          6. Results
          7. Additional information
        2. The topic discovery pattern
          1. Background
          2. Motivation
          3. Use cases
          4. Pattern implementation
          5. Code snippets
          6. Results
          7. Additional information
        3. The natural language processing pattern
          1. Background
          2. Motivation
          3. Use cases
          4. Pattern implementation
          5. Code snippets
          6. Results
          7. Additional information
        4. The classification pattern
          1. Background
          2. Motivation
          3. Use cases
          4. Pattern implementation
          5. Code snippets
          6. Results
          7. Additional information
        5. Future trends
          1. Emergence of data-driven patterns
          2. The emergence of solution-driven patterns
          3. Patterns addressing programmability constraints
        6. Summary
      17. Index