You are previewing Hadoop Blueprints.
O'Reilly logo
Hadoop Blueprints

Book Description

Use Hadoop to solve business problems by learning from a rich set of real-life case studies

About This Book

  • Solve real-world business problems using Hadoop and other Big Data technologies

  • Build efficient data lakes in Hadoop, and develop systems for various business cases like improving marketing campaigns, fraud detection, and more

  • Power packed with six case studies to get you going with Hadoop for Business Intelligence

  • Who This Book Is For

    If you are interested in building efficient business solutions using Hadoop, this is the book for you This book assumes that you have basic knowledge of Hadoop, Java, and any scripting language.

    What You Will Learn

  • Learn about the evolution of Hadoop as the big data platform

  • Understand the basics of Hadoop architecture

  • Build a 360 degree view of your customer using Sqoop and Hive

  • Build and run classification models on Hadoop using BigML

  • Use Spark and Hadoop to build a fraud detection system

  • Develop a churn detection system using Java and MapReduce

  • Build an IoT-based data collection and visualization system

  • Get to grips with building a Hadoop-based Data Lake for large enterprises

  • Learn about the coexistence of NoSQL and In-Memory databases in the Hadoop ecosystem

  • In Detail

    If you have a basic understanding of Hadoop and want to put your knowledge to use to build fantastic Big Data solutions for business, then this book is for you. Build six real-life, end-to-end solutions using the tools in the Hadoop ecosystem, and take your knowledge of Hadoop to the next level.

    Start off by understanding various business problems which can be solved using Hadoop. You will also get acquainted with the common architectural patterns which are used to build Hadoop-based solutions. Build a 360-degree view of the customer by working with different types of data, and build an efficient fraud detection system for a financial institution. You will also develop a system in Hadoop to improve the effectiveness of marketing campaigns. Build a churn detection system for a telecom company, develop an Internet of Things (IoT) system to monitor the environment in a factory, and build a data lake – all making use of the concepts and techniques mentioned in this book.

    The book covers other technologies and frameworks like Apache Spark, Hive, Sqoop, and more, and how they can be used in conjunction with Hadoop. You will be able to try out the solutions explained in the book and use the knowledge gained to extend them further in your own problem space.

    Style and approach

    This is an example-driven book where each chapter covers a single business problem and describes its solution by explaining the structure of a dataset and tools required to process it. Every project is demonstrated with a step-by-step approach, and explained in a very easy-to-understand manner.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at If you purchased this book elsewhere, you can visit and register to have the code file.

    Table of Contents

    1. Hadoop Blueprints
      1. Hadoop Blueprints
      2. Credits
      3. About the Authors
      4. About the Reviewers
        1. Why subscribe?
      6. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      7. 1. Hadoop and Big Data
        1. The beginning of the big data problem
          1. Limitations of RDBMS systems
          2. Scaling out a database on Google
          3. Parallel processing of large datasets
        2. Building open source Hadoop
        3. Enterprise Hadoop
          1. Social media and mobile channels
          2. Data storage cost reduction
          3. Enterprise software vendors
          4. Pure Play Hadoop vendors
          5. Cloud Hadoop vendors
        4. The design of the Hadoop system
          1. The Hadoop Distributed File System (HDFS)
            1. Data organization in HDFS
            2. HDFS file management commands
            3. NameNode and DataNodes
            4. Metadata store in NameNode
            5. Preventing a single point of failure with Hadoop HA
            6. Checkpointing process
            7. Data Store on a DataNode
            8. Handshakes and heartbeats
        5. MapReduce
          1. The execution model of MapReduce Version 1
          2. Apache YARN
        6. Building a MapReduce Version 2 program
          1. Problem statement
          2. Solution workflow
            1. Getting the dataset
            2. Studying the dataset
            3. Cleaning the dataset
            4. Loading the dataset on the HDFS
            5. Starting with a MapReduce program
              1. Installing Eclipse
            6. Creating a project in Eclipse
            7. Coding and building a MapReduce program
            8. Run the MapReduce program locally
            9. Examine the result
            10. Run the MapReduce program on Hadoop
              1. Further processing of results
        7. Hadoop platform tools
          1. Data ingestion tools
          2. Data access tools
          3. Monitoring tools
          4. Data governance tools
        8. Big data use cases
          1. Creating a 360 degree view of a customer
          2. Fraud detection systems for banks
          3. Marketing campaign planning
          4. Churn detection in telecom
          5. Analyzing sensor data
          6. Building a data lake
        9. The architecture of Hadoop-based systems
          1. Lambda architecture
        10. Summary
      8. 2. A 360-Degree View of the Customer
        1. Capturing business information
          1. Collecting data from data sources
          2. Creating a data processing approach
          3. Presenting the results
        2. Setting up the technology stack
          1. Tools used
          2. Installing Hortonworks Sandbox
          3. Creating user accounts
          4. Exploring HUE
          5. Exploring MYSQL and the HIVE command line
          6. Exploring Sqoop at the command line
        3. Test driving Hive and Sqoop
          1. Querying data using Hive
          2. Importing data in Hive using Sqoop
        4. Engineering the solution
          1. Datasets
            1. Loading customer master data into Hadoop
            2. Loading web logs into Hadoop
            3. Loading tweets into Hadoop
          2. Creating the 360-degree view
          3. Exporting data from Hadoop
        5. Presenting the view
          1. Building a web application
          2. Installing Node.js
          3. Coding the web application in Node.js
        6. Summary
      9. 3. Building a Fraud Detection System
        1. Understanding the business problem
        2. Selecting and cleansing the dataset
          1. Finding relevant fields
        3. Machine learning for fraud detection
          1. Clustering as an unsupervised machine learning method
        4. Designing the high-level architecture
          1. Introducing Apache Spark
            1. Apache Spark architecture
            2. Resilient Distributed Datasets
              1. Transformation functions
              2. Actions
            3. Test driving Apache Spark
            4. Calculating the yearly average stock prices using Spark
          2. Apache Spark 2.X
          3. Understanding MLib
          4. Test driving K-means using MLib
        5. Creating our fraud detection model
          1. Building our K-means clustering model
            1. Processing the data
        6. Putting the fraud detection model to use
          1. Generating a data stream
          2. Processing the data stream using Spark streaming
          3. Putting the model to use
          4. Scaling the solution
          5. Summary
      10. 4. Marketing Campaign Planning
        1. Creating the solution outline
        2. Supervised learning
        3. Tree-structure models for classification
        4. Finding the right dataset
        5. Setting the up the solution architecture
          1. Coupon scan at POS
          2. Join and transform
          3. Train the classification model
          4. Scoring
          5. Mail merge
        6. Building the machine learning model
          1. Introducing BigML
          2. Model building steps
          3. Sign up as a user on BigML site
          4. Upload the data file
          5. Creating the dataset
          6. Building the classification model
          7. Downloading the classification model
        7. Running the Model on Hadoop
        8. Creating the target List
        9. Post campaign activities
        10. Summary
      11. 5. Churn Detection
        1. A business case for churn detection
        2. Creating the solution outline
          1. Building a predictive model using Hadoop
          2. Bayes' Theorem
          3. Playing with the Bayesian predictor
          4. Running a Node.js-based Bayesian predictor
          5. Understanding the predictor code
          6. Limitations of our solution
        3. Building a churn predictor using Hadoop
          1. Synthetic data generation tools
          2. Preparing a synthetic historical churn dataset
          3. The processing approach
          4. Running the MapReduce program
          5. Understanding the frequency counter code
          6. Putting the model to use
          7. Integrating the churn predictor
        4. Summary
      12. 6. Analyze Sensor Data Using Hadoop
        1. A business case for sensor data analytics
        2. Creating the solution outline
        3. Technology stack
          1. Kafka
          2. Flume
          3. HDFS
            1. Hive
            2. Open TSDB
          4. HBase
          5. Grafana
        4. Batch data analytics
          1. Loading streams of sensor data from Kafka topics to HDFS
          2. Using Hive to perform analytics on inserted data
          3. Data visualization in MS Excel
        5. Stream data analytics
          1. Loading streams of sensor data
          2. Data visualization using Grafana
        6. Summary
      13. 7. Building a Data Lake
        1. Data lake building blocks
          1. Ingestion tier
          2. Storage tier
          3. Insights tier
          4. Ops facilities
          5. Limitation of open source Hadoop ecosystem tools
        2. Hadoop security
          1. HDFS permissions model
            1. Fine-grained permissions with HDFS ACLs
        3. Apache Ranger
          1. Installing Apache Ranger
          2. Test driving Apache Ranger
          3. Define services and access policies
          4. Examine the audit logs
          5. Viewing users and groups in Ranger
          6. Data Lake security with Apache Ranger
        4. Apache Flume
          1. Understanding the Design of Flume
          2. Installing Apache Flume
          3. Running Apache Flume
        5. Apache Zeppelin
          1. Installation of Apache Zeppelin
          2. Test driving Zeppelin
          3. Exploring data visualization features of Zeppelin
            1. Define the gold price movement table in Hive
            2. Load gold price history in the Table
            3. Run a select query
            4. Plot price change per month
            5. Running the paragraph
            6. Zeppelin in Data Lake
        6. Technology stack for Data Lake
        7. Data Lake business requirements
          1. Understanding the business requirements
          2. Understanding the IT systems and security
          3. Designing the data pipeline
          4. Building the data pipeline
          5. Setting up the access control
            1. Synchronizing the users and groups in Ranger
            2. Setting up data access policies in Ranger
            3. Restricting the access in Zeppelin
          6. Testing our data pipeline
          7. Scheduling the data loading
          8. Refining the business requirements
          9. Implementing the new requirements
            1. Loading the stock holding data in Data Lake
            2. Restricting the access to stock holding data in Data Lake
            3. Testing the Loaded Data with Zeppelin
          10. Adding stock feed in the Data Lake
            1. Fetching data from Yahoo Service
            2. Configuring Flume
            3. Running Flume as Stock Feeder to Data Lake
            4. Transforming the data in Data Lake
          11. Growing Data Lake
        8. Summary
      14. 8. Future Directions
        1. Hadoop solutions team
          1. The role of the data engineer
          2. Data science for non-experts
          3. From the data science model to business value
        2. Hadoop on Cloud
          1. Deploying Hadoop on cloud servers
            1. Using Hadoop as a service
        3. NoSQL databases
          1. Types of NoSQL databases
          2. Common observations about NoSQL databases
          3. In-memory databases
          4. Apache Ignite as an in-memory database
          5. Apache Ignite as a Hadoop accelerator
          6. Apache Spark versus Apache Ignite
        4. Summary