You are previewing Apache Oozie.
O'Reilly logo
Apache Oozie

Book Description

Get a solid grounding in Apache Oozie, the workflow scheduler system for managing Hadoop jobs. With this hands-on guide, two experienced Hadoop practitioners walk you through the intricacies of this powerful and flexible platform, with numerous examples and real-world use cases.

Once you set up your Oozie server, you’ll dive into techniques for writing and coordinating Workflows, and learn how to write complex data pipelines. Advanced topics show you how to handle shared libraries in Oozie, as well as how to implement and manage Oozie’s security capabilities.

Table of Contents

  1. Foreword
  2. Preface
    1. Contents of This Book
    2. Conventions Used in This Book
    3. Using Code Examples
    4. Safari® Books Online
    5. How to Contact Us
    6. Acknowledgments
  3. 1. Introduction to Oozie
    1. Big Data Processing
      1. A Recurrent Problem
      2. A Common Solution: Oozie
      3. A Simple Oozie Job
      4. Oozie Releases
      5. Some Oozie Usage Numbers
  4. 2. Oozie Concepts
    1. Oozie Applications
      1. Oozie Workflows
      2. Oozie Coordinators
      3. Oozie Bundles
    2. Parameters, Variables, and Functions
    3. Application Deployment Model
    4. Oozie Architecture
  5. 3. Setting Up Oozie
    1. Oozie Deployment
    2. Basic Installations
      1. Requirements
      2. Build Oozie
      3. Install Oozie Server
      4. Hadoop Cluster
      5. Start and Verify the Oozie Server
    3. Advanced Oozie Installations
      1. Configuring Kerberos Security
      2. DB Setup
      3. Shared Library Installation
      4. Oozie Client Installations
  6. 4. Oozie Workflow Actions
    1. Workflow
    2. Actions
      1. Action Execution Model
      2. Action Definition
    3. Action Types
      1. MapReduce Action
      2. Java Action
      3. Pig Action
      4. FS Action
      5. Sub-Workflow Action
      6. Hive Action
      7. DistCp Action
      8. Email Action
      9. Shell Action
      10. SSH Action
      11. Sqoop Action
    4. Synchronous Versus Asynchronous Actions
  7. 5. Workflow Applications
    1. Outline of a Basic Workflow
    2. Control Nodes
      1. <start> and <end>
      2. <fork> and <join>
      3. <decision>
      4. <kill>
      5. <OK> and <ERROR>
    3. Job Configuration
      1. Global Configuration
      2. Job XML
      3. Inline Configuration
      4. Launcher Configuration
    4. Parameterization
      1. EL Variables
      2. EL Functions
      3. EL Expressions
    5. The job.properties File
      1. Command-Line Option
      2. The config-default.xml File
      3. The <parameters> Section
    6. Configuration and Parameterization Examples
    7. Lifecycle of a Workflow
      1. Action States
  8. 6. Oozie Coordinator
    1. Coordinator Concept
    2. Triggering Mechanism
      1. Time Trigger
      2. Data Availability Trigger
    3. Coordinator Application and Job
      1. Coordinator Action
      2. Our First Coordinator Job
      3. Coordinator Submission
      4. Oozie Web Interface for Coordinator Jobs
    4. Coordinator Job Lifecycle
    5. Coordinator Action Lifecycle
    6. Parameterization of the Coordinator
      1. EL Functions for Frequency
      2. Day-Based Frequency
      3. Month-Based Frequency
    7. Execution Controls
    8. An Improved Coordinator
  9. 7. Data Trigger Coordinator
    1. Expressing Data Dependency
      1. Dataset
    2. Example: Rollup
    3. Parameterization of Dataset Instances
      1. current(n)
      2. latest(n)
    4. Parameter Passing to Workflow
      1. dataIn(eventName):
      2. dataOut(eventName)
      3. nominalTime()
      4. actualTime()
      5. dateOffset(baseTimeStamp, skipInstance, timeUnit)
      6. formatTime(timeStamp, formatString)
    5. A Complete Coordinator Application
  10. 8. Oozie Bundles
    1. Bundle Basics
      1. Bundle Definition
      2. Why Do We Need Bundles?
    2. Bundle Specification
      1. Execution Controls
    3. Bundle State Transitions
  11. 9. Advanced Topics
    1. Managing Libraries in Oozie
      1. Origin of JARs in Oozie
      2. Design Challenges
      3. Managing Action JARs
      4. Supporting the User’s JAR
      5. JAR Precedence in classpath
    2. Oozie Security
      1. Oozie Security Overview
      2. Oozie to Hadoop
      3. Oozie Client to Server
      4. Supporting Custom Credentials
    3. Supporting New API in MapReduce Action
    4. Supporting Uber JAR
    5. Cron Scheduling
      1. A Simple Cron-Based Coordinator
      2. Oozie Cron Specification
    6. Emulate Asynchronous Data Processing
    7. HCatalog-Based Data Dependency
  12. 10. Developer Topics
    1. Developing Custom EL Functions
      1. Requirements for a New EL Function
      2. Implementing a New EL Function
    2. Supporting Custom Action Types
      1. Creating a Custom Synchronous Action
    3. Overriding an Asynchronous Action Type
      1. Implementing the New ActionMain Class
      2. Testing the New Main Class
    4. Creating a New Asynchronous Action
      1. Writing an Asynchronous Action Executor
      2. Writing the ActionMain Class
      3. Writing Action’s Schema
      4. Deploying the New Action Type
      5. Using the New Action Type
  13. 11. Oozie Operations
    1. Oozie CLI Tool
      1. CLI Subcommands
      2. Useful CLI Commands
    2. Oozie REST API
    3. Oozie Java Client
    4. The oozie-site.xml File
    5. The Oozie Purge Service
    6. Job Monitoring
      1. JMS-Based Monitoring
    7. Oozie Instrumentation and Metrics
    8. Reprocessing
      1. Workflow Reprocessing
      2. Coordinator Reprocessing
      3. Bundle Reprocessing
    9. Server Tuning
      1. JVM Tuning
      2. Service Settings
    10. Oozie High Availability
    11. Debugging in Oozie
      1. Oozie Logs
      2. Developing and Testing Oozie Applications
      3. Application Deployment Tips
      4. Common Errors and Debugging
    12. MiniOozie and LocalOozie
    13. The Competition
  14. Index