You are previewing Apache Oozie Essentials.
O'Reilly logo
Apache Oozie Essentials

Book Description

Unleash the power of Apache Oozie to create and manage your big data and machine learning pipelines in one go

About This Book

  • Teaches you everything you need to know to get started with Apache Oozie from scratch and manage your data pipelines effortlessly

  • Learn to write data ingestion workflows with the help of real-life examples from the author’s own personal experience

  • Embed Spark jobs to run your machine learning models on top of Hadoop

  • Who This Book Is For

    If you are an expert Hadoop user who wants to use Apache Oozie to handle workflows efficiently, this book is for you. This book will be handy to anyone who is familiar with the basics of Hadoop and wants to automate data and machine learning pipelines.

    What You Will Learn

  • Install and configure Oozie from source code on your Hadoop cluster

  • Dive into the world of Oozie with Java MapReduce jobs

  • Schedule Hive ETL and data ingestion jobs

  • Import data from a database through Sqoop jobs in HDFS

  • Create and process data pipelines with Pig, hive scripts as per business requirements.

  • Run machine learning Spark jobs on Hadoop

  • Create quick Oozie jobs using Hue

  • Make the most of Oozie’s security capabilities by configuring Oozie’s security

  • In Detail

    As more and more organizations are discovering the use of big data analytics, interest in platforms that provide storage, computation, and analytic capabilities is booming exponentially. This calls for data management. Hadoop caters to this need. Oozie fulfils this necessity for a scheduler for a Hadoop job by acting as a cron to better analyze data.

    Apache Oozie Essentials starts off with the basics right from installing and configuring Oozie from source code on your Hadoop cluster to managing your complex clusters. You will learn how to create data ingestion and machine learning workflows.

    This book is sprinkled with the examples and exercises to help you take your big data learning to the next level. You will discover how to write workflows to run your MapReduce, Pig ,Hive, and Sqoop scripts and schedule them to run at a specific time or for a specific business requirement using a coordinator. This book has engaging real-life exercises and examples to get you in the thick of things. Lastly, you’ll get a grip of how to embed Spark jobs, which can be used to run your machine learning models on Hadoop.

    By the end of the book, you will have a good knowledge of Apache Oozie. You will be capable of using Oozie to handle large Hadoop workflows and even improve the availability of your Hadoop environment.

    Style and approach

    This book is a hands-on guide that explains Oozie using real-world examples. Each chapter is blended beautifully with fundamental concepts sprinkled in-between case study solution algorithms and topped off with self-learning exercises.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

    Table of Contents

    1. Apache Oozie Essentials
      1. Table of Contents
      2. Apache Oozie Essentials
      3. Credits
      4. About the Author
      5. About the Reviewers
      6. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      8. 1. Setting up Oozie
        1. Configuring Oozie in Hortonworks distribution
        2. Installing Oozie using tar ball
          1. Creating a test virtual machine
          2. Building Oozie source code
            1. Summary of the build script
            2. Codehaus Maven move
            3. Download dependency jars
            4. Preparing to create a WAR file
            5. Create a WAR file
          3. Configure Oozie MySQL database
          4. Configure the shared library
          5. Start server testing and verification
        3. Summary
      9. 2. My First Oozie Job
        1. Installing and configuring Hue
        2. Oozie concepts
          1. Workflows
          2. Coordinator
          3. Bundles
        3. Book case study
        4. Running our first Oozie job
        5. Types of nodes
          1. Control flow nodes
          2. Action nodes
        6. Oozie web console
        7. The Oozie command line
        8. Summary
      10. 3. Oozie Fundamentals
        1. Chapter case study
          1. The Decision node
          2. The Email action
          3. Expression Language functions
            1. Basic EL constants
            2. Basic EL functions
            3. Workflow EL functions
            4. Hadoop EL constants
            5. HDFS EL functions
          4. Email action configuration
          5. Job property file
          6. Submission from the command line
          7. Workflow states
        2. Summary
      11. 4. Running MapReduce Jobs
        1. Chapter case study
        2. Running MapReduce jobs from Oozie
          1. The job.properties file
          2. Running the job
        3. Running Oozie MapReduce job
        4. Coordinators
          1. Datasets
            1. Frequency and time
            2. Cron syntax for frequency
            3. Timezone
            4. The <done-flag> tag
            5. Initial instance
        5. My first Coordinator
          1. Coordinator v1 definition
            1. job.properties v1 definition
          2. Coordinator v2 definition
            1. job.properties v2 definition
            2. Checking the job log
        6. Running a MapReduce streaming job
        7. Summary
      12. 5. Running Pig Jobs
        1. Chapter case study
        2. The Pig command line
        3. The config-default.xml file
        4. Pig action
        5. Pig Coordinator job v2
        6. Parameters in the Dataset's input and output events
          1. current(int n)
          2. hoursInDay(int n)
          3. daysInMonth(int n)
          4. latest(int n)
        7. Coordinator controls
        8. Pig Coordinator job v3
        9. Summary
      13. 6. Running Hive Jobs
        1. Chapter case study
        2. Running a Hive job from the command line
        3. Hive action
          1. Validating Oozie Workflow
        4. Hive 2 action
        5. Parameterization of Coordinator jobs
          1. dateOffset(String baseDate, int instance, String timeUnit)
          2. dateTzOffet(String baseDate, String timezone)
          3. formatTime(String timeStamp, String format)
        6. Summary
      14. 7. Running Sqoop Jobs
        1. Chapter case study
        2. Running Sqoop command line
        3. Sqoop action
        4. HCatalog
          1. HCatalog datasets
          2. HCatalog EL functions
          3. HCatalog Coordinator functions
          4. Pig script
          5. The job.properties file
          6. The Sqoop action Coordinator
            1. Running the job
            2. Checking data in the Hive table
        5. Summary
      15. 8. Running Spark Jobs
        1. Spark action
        2. Bundles
        3. Data pipelines
        4. Summary
      16. 9. Running Oozie in Production
        1. Packaging and continuous delivery
        2. Oozie in secured cluster
        3. Rerun
          1. Rerun Workflow
          2. Rerun Coordinator
          3. Rerun Bundle
        4. Summary
      17. Index