You are previewing Real-Time Big Data Analytics.
O'Reilly logo
Real-Time Big Data Analytics

Book Description

Design, process, and analyze large sets of complex data in real time

About This Book

  • Get acquainted with transformations and database-level interactions, and ensure the reliability of messages processed using Storm

  • Implement strategies to solve the challenges of real-time data processing

  • Load datasets, build queries, and make recommendations using Spark SQL

  • Who This Book Is For

    If you are a Big Data architect, developer, or a programmer who wants to develop applications/frameworks to implement real-time analytics using open source technologies, then this book is for you.

    What You Will Learn

  • Explore big data technologies and frameworks

  • Work through practical challenges and use cases of real-time analytics versus batch analytics

  • Develop real-word use cases for processing and analyzing data in real-time using the programming paradigm of Apache Storm

  • Develop real-word use cases for processing and analyzing data in real-time using the programming paradigm of Apache Storm

  • Optimize and tune Apache Storm for varied workloads and production deployments

  • Process and stream data with Amazon Kinesis and Elastic MapReduce

  • Perform interactive and exploratory data analytics using Spark SQL

  • Develop common enterprise architectures/applications for real-time and batch analytics

  • In Detail

    Enterprise has been striving hard to deal with the challenges of data arriving in real time or near real time.

    Although there are technologies such as Storm and Spark (and many more) that solve the challenges of real-time data, using the appropriate technology/framework for the right business use case is the key to success. This book provides you with the skills required to quickly design, implement and deploy your real-time analytics using real-world examples of big data use cases.

    Although there are technologies such as Storm and Spark (and many more) that solve the challenges of real-time data, using the appropriate technology/framework for the right business use case is the key to success. This book provides you with the skills required to quickly design, implement and deploy your real-time analytics using real-world examples of big data use cases.

    Moving on, we’ll familiarize you with “Amazon Kinesis” for real-time data processing on cloud. We will further develop your understanding of real-time analytics through a comprehensive review of Apache Spark along with the high-level architecture and the building blocks of a Spark program.

    You will learn how to transform your data, get an output from transformations, and persist your results using Spark RDDs, using an interface called Spark SQL to work with Spark.

    You will learn how to transform your data, get an output from transformations, and persist your results using Spark RDDs, using an interface called Spark SQL to work with Spark.

    Style and approach

    This step-by-step is an easy-to-follow, detailed tutorial, filled with practical examples of basic and advanced features.

    Each topic is explained sequentially and supported by real-world examples and executable code snippets.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

    Table of Contents

    1. Real-Time Big Data Analytics
      1. Table of Contents
      2. Real-Time Big Data Analytics
      3. Credits
      4. About the Authors
      5. About the Reviewer
      6. www.PacktPub.com
        1. eBooks, discount offers, and more
          1. Why subscribe?
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      8. 1. Introducing the Big Data Technology Landscape and Analytics Platform
        1. Big Data – a phenomenon
        2. The Big Data dimensional paradigm
        3. The Big Data ecosystem
        4. The Big Data infrastructure
        5. Components of the Big Data ecosystem
          1. The Big Data analytics architecture
            1. Building business solutions
            2. Dataset processing
            3. Solution implementation
            4. Presentation
        6. Distributed batch processing
          1. Batch processing in distributed mode
            1. Push code to data
        7. Distributed databases (NoSQL)
          1. Advantages of NoSQL databases
          2. Choosing a NoSQL database
        8. Real-time processing
          1. The telecoms or cellular arena
          2. Transportation and logistics
          3. The connected vehicle
          4. The financial sector
        9. Summary
      9. 2. Getting Acquainted with Storm
        1. An overview of Storm
          1. The journey of Storm
          2. Storm abstractions
            1. Streams
            2. Topology
            3. Spouts
            4. Bolts
              1. Tasks
              2. Workers
        2. Storm architecture and its components
          1. A Zookeeper cluster
          2. A Storm cluster
        3. How and when to use Storm
        4. Storm internals
          1. Storm parallelism
          2. Storm internal message processing
        5. Summary
      10. 3. Processing Data with Storm
        1. Storm input sources
          1. Meet Kafka
            1. Getting to know more about Kafka
        2. Other sources for input to Storm
          1. A file as an input source
          2. A socket as an input source
          3. Kafka as an input source
        3. Reliability of data processing
          1. The concept of anchoring and reliability
          2. The Storm acking framework
        4. Storm simple patterns
          1. Joins
          2. Batching
        5. Storm persistence
          1. Storm's JDBC persistence framework
        6. Summary
      11. 4. Introduction to Trident and Optimizing Storm Performance
        1. Working with Trident
          1. Transactions
          2. Trident topology
            1. Trident tuples
            2. Trident spout
          3. Trident operations
            1. Merging and joining
            2. Filter
            3. Function
            4. Aggregation
            5. Grouping
            6. State maintenance
        2. Understanding LMAX
          1. Memory and cache
          2. Ring buffer – the heart of the disruptor
            1. Producers
            2. Consumers
        3. Storm internode communication
          1. ZeroMQ
            1. Storm ZeroMQ configurations
          2. Netty
        4. Understanding the Storm UI
          1. Storm UI landing page
          2. Topology home page
        5. Optimizing Storm performance
        6. Summary
      12. 5. Getting Acquainted with Kinesis
        1. Architectural overview of Kinesis
          1. Benefits and use cases of Amazon Kinesis
          2. High-level architecture
          3. Components of Kinesis
        2. Creating a Kinesis streaming service
          1. Access to AWS Kinesis
          2. Configuring the development environment
          3. Creating Kinesis streams
          4. Creating Kinesis stream producers
          5. Creating Kinesis stream consumers
          6. Generating and consuming crime alerts
        3. Summary
      13. 6. Getting Acquainted with Spark
        1. An overview of Spark
          1. Batch data processing
          2. Real-time data processing
          3. Apache Spark – a one-stop solution
          4. When to use Spark – practical use cases
        2. The architecture of Spark
          1. High-level architecture
          2. Spark extensions/libraries
          3. Spark packaging structure and core APIs
          4. The Spark execution model – master-worker view
        3. Resilient distributed datasets (RDD)
          1. RDD – by definition
            1. Fault tolerance
            2. Storage
            3. Persistence
            4. Shuffling
        4. Writing and executing our first Spark program
          1. Hardware requirements
          2. Installation of the basic software
            1. Spark
            2. Java
            3. Scala
            4. Eclipse
          3. Configuring the Spark cluster
          4. Coding a Spark job in Scala
          5. Coding a Spark job in Java
          6. Troubleshooting – tips and tricks
            1. Port numbers used by Spark
            2. Classpath issues – class not found exception
            3. Other common exceptions
        5. Summary
      14. 7. Programming with RDDs
        1. Understanding Spark transformations and actions
          1. RDD APIs
          2. RDD transformation operations
          3. RDD action operations
        2. Programming Spark transformations and actions
        3. Handling persistence in Spark
        4. Summary
      15. 8. SQL Query Engine for Spark – Spark SQL
        1. The architecture of Spark SQL
          1. The emergence of Spark SQL
          2. The components of Spark SQL
            1. The DataFrame API
              1. DataFrames and RDD
              2. User-defined functions
              3. DataFrames and SQL
            2. The Catalyst optimizer
            3. SQL and Hive contexts
        2. Coding our first Spark SQL job
          1. Coding a Spark SQL job in Scala
          2. Coding a Spark SQL job in Java
        3. Converting RDDs to DataFrames
          1. Automated process
          2. The manual process
        4. Working with Parquet
          1. Persisting Parquet data in HDFS
          2. Partitioning and schema evolution or merging
            1. Partitioning
            2. Schema evolution/merging
        5. Working with Hive tables
        6. Performance tuning and best practices
          1. Partitioning and parallelism
          2. Serialization
          3. Caching
          4. Memory tuning
        7. Summary
      16. 9. Analysis of Streaming Data Using Spark Streaming
        1. High-level architecture
          1. The components of Spark Streaming
          2. The packaging structure of Spark Streaming
            1. Spark Streaming APIs
            2. Spark Streaming operations
        2. Coding our first Spark Streaming job
          1. Creating a stream producer
          2. Writing our Spark Streaming job in Scala
          3. Writing our Spark Streaming job in Java
          4. Executing our Spark Streaming job
        3. Querying streaming data in real time
          1. The high-level architecture of our job
          2. Coding the crime producer
          3. Coding the stream consumer and transformer
          4. Executing the SQL Streaming Crime Analyzer
        4. Deployment and monitoring
          1. Cluster managers for Spark Streaming
            1. Executing Spark Streaming applications on Yarn
            2. Executing Spark Streaming applications on Apache Mesos
          2. Monitoring Spark Streaming applications
        5. Summary
      17. 10. Introducing Lambda Architecture
        1. What is Lambda Architecture
          1. The need for Lambda Architecture
          2. Layers/components of Lambda Architecture
        2. The technology matrix for Lambda Architecture
        3. Realization of Lambda Architecture
          1. high-level architecture
          2. Configuring Apache Cassandra and Spark
          3. Coding the custom producer
          4. Coding the real-time layer
          5. Coding the batch layer
          6. Coding the serving layer
          7. Executing all the layers
        4. Summary
      18. Index