Foreword

First developed when I was at Yahoo! in 2008, Apache Oozie remains the most sophisticated and powerful workflow scheduler for managing Apache Hadoop jobs. Although simpler open source alternatives have been introduced, Oozie is still my recommended workflow scheduler due to its ability to handle complexity, ease of integration with established and emerging Hadoop components (like Spark), and the growing ecosystem of projects, such as Apache Falcon, that rely on its workflow engine.

That said, Oozie also remains one of the more challenging schedulers to learn and master. If ever a system required a comprehensive user’s manual, Oozie is it. To take advantage of the full power that Oozie has to offer, developers need the guidance and advice of expert users. That is why I am delighted to see this book get published.

When Oozie was first developed, I was Chief Architect of Yahoo!’s Search and Advertising Technology Group. At the time, our group was starting to migrate the event-processing pipelines of our advertising products from a proprietary technology stack to Apache Hadoop.

The advertising pipelines at Yahoo! were extremely complex. Data was processed in batches that ranged from 5 minutes to 30 days in length, with aggregates “graduating” in complex ways from one time scale to another. In addition, these pipelines needed to detect and gracefully handle late data, missing data, software bugs tickled by “black swan” event data, and software bugs introduced by recent software pushes. On top of all of that, billions of dollars of revenue—and a good deal of the company’s growth prospects—depended on these pipelines, raising the stakes for data quality, security, and compliance. We had about a half-dozen workflow systems in use back then, and there was a lot of internal competition to be selected as the standard for Hadoop. Ultimately, the design for Oozie came from ideas from two systems: PacMan, a system already integrated with Hadoop, and Lexus, a system already in place for the advertising pipelines.

Oozie’s origins as a second-generation system designed to meet the needs of extremely complicated applications are both a strength and a weakness. On the positive side, there is no use case or scenario that Oozie can’t handle—and if you know what you’re doing, handle well. On the negative side, Oozie suffers from the over-engineering that you’d expect from second-system effect. It has complex features that are great for handling complicated applications, but can be very nonintuitive for inexperienced users. For these newer users, I want to let you know that Oozie is worth the investment of your time. While the newer, simpler workflow schedulers are much easier for simple pipelines, it is in the nature of data pipelines to grow more sophisticated over time. The simpler solutions will ultimately limit the solutions that you can create. Don’t limit yourself.

As guides to Oozie, there can be no better experts than Aravind Srinivasan and Mohammad Kamrul Islam. Aravind represents the “voice of the user,” as he was one of the engineers who moved Yahoo!’s advertising pipelines over to Oozie, bringing the lessons of Lexus to the Oozie developers. Subsequently, he has worked on many other Oozie applications, both inside and outside of Yahoo!. Mohammad represents the “voice of the developer,” as a core contributor to Oozie since its 1.x days. Mohammad is currently Vice President of the Oozie project at the Apache Software Foundation, and he also makes significant contributions to other Hadoop-related projects such as YARN and Tez.

In this book, the authors have striven for practicality, focusing on the concepts, principles, tips, and tricks necessary for developers to get the most out of Oozie. A volume such as this is long overdue. Developers will get a lot more out the Hadoop ecosystem by reading it.

Get Apache Oozie now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.