You are previewing Big Data Analytics with Spark: A Practitioner’s Guide to Using Spark for Large-Scale Data Processing, Machine Learning, and Graph Analytics, and High-Velocity Data Stream Processing.
O'Reilly logo
Big Data Analytics with Spark: A Practitioner’s Guide to Using Spark for Large-Scale Data Processing, Machine Learning, and Graph Analytics, and High-Velocity Data Stream Processing

Book Description

This book is a step-by-step guide for learning how to use Spark for different types of big-data analytics projects, including batch, interactive, graph, and stream data analysis as well as machine learning. It covers Spark core and its add-on libraries, including Spark SQL, Spark Streaming, GraphX, MLlib, and Spark ML.

Big Data Analytics with Spark shows you how to use Spark and leverage its easy-to-use features to increase your productivity. You learn to perform fast data analysis using its in-memory caching and advanced execution engine, employ in-memory computing capabilities for building high-performance machine learning and low-latency interactive analytics applications, and much more. Moreover, the book shows you how to use Spark as a single integrated platform for a variety of data processing tasks, including ETL pipelines, BI, live data stream processing, graph analytics, and machine learning.

The book also includes a chapter on Scala, the hottest functional programming language, and the language that underlies Spark. You’ll learn the basics of functional programming in Scala, so that you can write Spark applications in it.

What's more, Big Data Analytics with Spark provides an introduction to other big data technologies that are commonly used along with Spark, such as HDFS, Avro, Parquet, Kafka, Cassandra, HBase, Mesos, and so on. It also provides an introduction to machine learning and graph concepts. So the book is self-sufficient; all the technologies that you need to know to use Spark are covered. The only thing that you are expected to have is some programming knowledge in any language.

Table of Contents

  1. Cover
  2. Title
  3. Copyright
  4. Dedication
  5. Contents at a Glance
  6. Contents
  7. About the Author
  8. About the Technical Reviewers
  9. Acknowledgments
  10. Introduction
  11. Chapter 1 : Big Data Technology Landscape
    1. Hadoop
      1. HDFS (Hadoop Distributed File System)
      2. MapReduce
      3. Hive
    2. Data Serialization
      1. Avro
      2. Thrift
      3. Protocol Buffers
      4. SequenceFile
    3. Columnar Storage
      1. RCFile
      2. ORC
      3. Parquet
    4. Messaging Systems
      1. Kafka
      2. ZeroMQ
    5. NoSQL
      1. Cassandra
      2. HBase
    6. Distributed SQL Query Engine
      1. Impala
      2. Presto
      3. Apache Drill
    7. Summary
  12. Chapter 2 : Programming in Scala
    1. Functional Programming (FP)
      1. Functions
      2. Immutable Data Structures
      3. Everything Is an Expression
    2. Scala Fundamentals
      1. Getting Started
      2. Basic Types
      3. Variables
      4. Functions
      5. Classes
      6. Singletons
      7. Case Classes
      8. Pattern Matching
      9. Operators
      10. Traits
      11. Tuples
      12. Option Type
      13. Collections
    3. A Standalone Scala Application
    4. Summary
  13. Chapter 3 : Spark Core
    1. Overview
      1. Key Features
      2. Ideal Applications
    2. High-level Architecture
      1. Workers
      2. Cluster Managers
      3. Driver Programs
      4. Executors
      5. Tasks
    3. Application Execution
      1. Terminology
      2. How an Application Works
    4. Data Sources
    5. Application Programming Interface (API)
      1. SparkContext
      2. Resilient Distributed Datasets (RDD)
      3. Creating an RDD
      4. RDD Operations
      5. Saving an RDD
    6. Lazy Operations
      1. Action Triggers Computation
    7. Caching
      1. RDD Caching Methods
      2. RDD Caching Is Fault Tolerant
      3. Cache Memory Management
    8. Spark Jobs
    9. Shared Variables
      1. Broadcast Variables
      2. Accumulators
    10. Summary
  14. Chapter 4 : Interactive Data Analysis with Spark Shell
    1. Getting Started
      1. Download
      2. Extract
      3. Run
    2. REPL Commands
    3. Using the Spark Shell as a Scala Shell
    4. Number Analysis
    5. Log Analysis
    6. Summary
  15. Chapter 5 : Writing a Spark Application
    1. Hello World in Spark
    2. Compiling and Running the Application
      1. sbt (Simple Build Tool)
      2. Compiling the Code
      3. Running the Application
    3. Monitoring the Application
    4. Debugging the Application
    5. Summary
  16. Chapter 6 : Spark Streaming
    1. Introducing Spark Streaming
      1. Spark Streaming Is a Spark Add-on
      2. High-Level Architecture
      3. Data Stream Sources
      4. Receiver
      5. Destinations
    2. Application Programming Interface (API)
      1. StreamingContext
      2. Basic Structure of a Spark Streaming Application
      3. Discretized Stream (DStream)
      4. Creating a DStream
      5. Processing a Data Stream
      6. Output Operations
      7. Window Operation
    3. A Complete Spark Streaming Application
    4. Summary
  17. Chapter 7 : Spark SQL
    1. Introducing Spark SQL
      1. Integration with Other Spark Libraries
      2. Usability
      3. Data Sources
      4. Data Processing Interface
      5. Hive Interoperability
    2. Performance
      1. Reduced Disk I/O
      2. Partitioning
      3. Columnar Storage
      4. In-Memory Columnar Caching
      5. Skip Rows
      6. Predicate Pushdown
      7. Query Optimization
    3. Applications
      1. ETL (Extract Transform Load)
      2. Data Virtualization
      3. Distributed JDBC/ODBC SQL Query Engine
      4. Data Warehousing
    4. Application Programming Interface (API)
      1. Key Abstractions
      2. Creating DataFrames
      3. Processing Data Programmatically with SQL/HiveQL
      4. Processing Data with the DataFrame API
      5. Saving a DataFrame
    5. Built-in Functions
      1. Aggregate
      2. Collection
      3. Date/Time
      4. Math
      5. String
      6. Window
    6. UDFs and UDAFs
    7. Interactive Analysis Example
    8. Interactive Analysis with Spark SQL JDBC Server
    9. Summary
  18. Chapter 8 : Machine Learning with Spark
    1. Introducing Machine Learning
      1. Features
      2. Labels
      3. Models
      4. Training Data
      5. Test Data
      6. Machine Learning Applications
      7. Machine Learning Algorithms
      8. Hyperparameter
      9. Model Evaluation
      10. Machine Learning High-level Steps
    2. Spark Machine Learning Libraries
    3. MLlib Overview
      1. Integration with Other Spark Libraries
      2. Statistical Utilities
      3. Machine Learning Algorithms
    4. The MLlib API
      1. Data Types
      2. Algorithms and Models
      3. Model Evaluation
    5. An Example MLlib Application
      1. Dataset
      2. Goal
      3. Code
    6. Spark ML
      1. ML Dataset
      2. Transformer
      3. Estimator
      4. Pipeline
      5. PipelineModel
      6. Evaluator
      7. Grid Search
      8. CrossValidator
    7. An Example Spark ML Application
      1. Dataset
      2. Goal
      3. Code
    8. Summary
  19. Chapter 9 : Graph Processing with Spark
    1. Introducing Graphs
      1. Undirected Graphs
      2. Directed Graphs
      3. Directed Multigraphs
      4. Property Graphs
    2. Introducing GraphX
    3. GraphX API
      1. Data Abstractions
      2. Creating a Graph
      3. Graph Properties
      4. Graph Operators
    4. Summary
  20. Chapter 10 : Cluster Managers
    1. Standalone Cluster Manager
      1. Architecture
      2. Setting Up a Standalone Cluster
      3. Running a Spark Application on a Standalone Cluster
    2. Apache Mesos
      1. Architecture
      2. Setting Up a Mesos Cluster
      3. Running a Spark Application on a Mesos Cluster
    3. YARN
      1. Architecture
      2. Running a Spark Application on a YARN Cluster
    4. Summary
  21. Chapter 11 : Monitoring
    1. Monitoring a Standalone Cluster
      1. Monitoring a Spark Master
      2. Monitoring a Spark Worker
    2. Monitoring a Spark Application
      1. Monitoring Jobs Launched by an Application
      2. Monitoring Stages in a Job
      3. Monitoring Tasks in a Stage
      4. Monitoring RDD Storage
      5. Monitoring Environment
      6. Monitoring Executors
      7. Monitoring a Spark Streaming Application
      8. Monitoring Spark SQL Queries
      9. Monitoring Spark SQL JDBC/ODBC Server
    3. Summary
  22. Bibliography
  23. Index