O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Sams Teach Yourself Apache Spark™ in 24 Hours

Book Description

Apache Spark is a fast, scalable, and flexible open source distributed processing engine for big data systems and is one of the most active open source big data projects to date. In just 24 lessons of one hour or less, Sams Teach Yourself Apache Spark in 24 Hours helps you build practical Big Data solutions that leverage Spark’s amazing speed, scalability, simplicity, and versatility.

This book’s straightforward, step-by-step approach shows you how to deploy, program, optimize, manage, integrate, and extend Spark–now, and for years to come. You’ll discover how to create powerful solutions encompassing cloud computing, real-time stream processing, machine learning, and more. Every lesson builds on what you’ve already learned, giving you a rock-solid foundation for real-world success.

Whether you are a data analyst, data engineer, data scientist, or data steward, learning Spark will help you to advance your career or embark on a new career in the booming area of Big Data.

Learn how to
• Discover what Apache Spark does and how it fits into the Big Data landscape
• Deploy and run Spark locally or in the cloud
• Interact with Spark from the shell
• Make the most of the Spark Cluster Architecture
• Develop Spark applications with Scala and functional Python
• Program with the Spark API, including transformations and actions
• Apply practical data engineering/analysis approaches designed for Spark
• Use Resilient Distributed Datasets (RDDs) for caching, persistence, and output
• Optimize Spark solution performance
• Use Spark with SQL (via Spark SQL) and with NoSQL (via Cassandra)
• Leverage cutting-edge functional programming techniques
• Extend Spark with streaming, R, and Sparkling Water
• Start building Spark-based machine learning and graph-processing applications
• Explore advanced messaging technologies, including Kafka
• Preview and prepare for Spark’s next generation of innovations

Instructions walk you through common questions, issues, and tasks; Q-and-As, Quizzes, and Exercises build and test your knowledge; "Did You Know?" tips offer insider advice and shortcuts; and "Watch Out!" alerts help you avoid pitfalls. By the time you're finished, you'll be comfortable using Apache Spark to solve a wide spectrum of Big Data problems.

Table of Contents

  1. About This E-Book
  2. Title Page
  3. Copyright Page
  4. Contents at a Glance
  5. Table of Contents
  6. Preface
    1. Why Should I Learn Spark?
    2. How This Book Is Organized
    3. Data Used in the Exercises
    4. Conventions Used in This Book
  7. About the Author
  8. Dedication
  9. Acknowledgments
  10. We Want to Hear from You
  11. Reader Services
  12. Part I: Getting Started with Apache Spark
    1. Hour 1. Introducing Apache Spark
      1. What Is Spark?
        1. Spark and Hadoop
        2. Spark as an Abstraction
        3. Spark Is Fast, Efficient, and Scalable
      2. What Sort of Applications Use Spark?
      3. Programming Interfaces to Spark
      4. Ways to Use Spark
        1. Interactive Use
        2. Non-interactive Use
        3. Input/Output Types
      5. Summary
      6. Q&A
      7. Workshop
        1. Quiz
        2. Answers
    2. Hour 2. Understanding Hadoop
      1. Hadoop and a Brief History of Big Data
      2. Hadoop Explained
      3. Introducing HDFS
        1. HDFS Overview
        2. HDFS Architecture
      4. Introducing YARN
        1. What Is YARN?
        2. Running an Application on YARN
        3. Other Resource Managers
      5. Anatomy of a Hadoop Cluster
      6. How Spark Works with Hadoop
        1. HDFS as a Data Source for Spark
        2. YARN as a Resource Scheduler for Spark
      7. Summary
      8. Q&A
      9. Workshop
        1. Quiz
        2. Answers
    3. Hour 3. Installing Spark
      1. Spark Deployment Modes
      2. Preparing to Install Spark
      3. Installing Spark in Standalone Mode
        1. Getting Spark
        2. Installing a Multi-node Spark Standalone Cluster
      4. Exploring the Spark Install
      5. Deploying Spark on Hadoop
        1. Using a Management Console or Interface
        2. Installing Manually
      6. Summary
      7. Q&A
      8. Workshop
        1. Quiz
        2. Answers
      9. Exercises
    4. Hour 4. Understanding the Spark Application Architecture
      1. Anatomy of a Spark Application
      2. Spark Driver
        1. The Spark Context
        2. Application Planning
        3. Application Scheduling
        4. Other Driver Functions
      3. Spark Executors and Workers
      4. Spark Master and Cluster Manager
        1. Spark Master
        2. Cluster Manager
      5. Spark Applications Running on YARN
        1. ResourceManager as the Cluster Manager
        2. ApplicationsMaster as the Spark Master
        3. yarn-cluster Mode
        4. yarn-client Mode
        5. Log File Management with Spark on YARN
      6. Local Mode
      7. Summary
      8. Q&A
      9. Workshop
        1. Quiz
        2. Answers
    5. Hour 5. Deploying Spark in the Cloud
      1. Amazon Web Services Primer
        1. Elastic Compute Cloud (EC2)
        2. Simple Storage Service (S3)
        3. Elastic MapReduce (EMR)
        4. AWS Pricing and Getting Started
      2. Spark on EC2
      3. Spark on EMR
      4. Hosted Spark with Databricks
      5. Summary
      6. Q&A
      7. Workshop
        1. Quiz
        2. Answers
  13. Part II: Programming with Apache Spark
    1. Hour 6. Learning the Basics of Spark Programming with RDDs
      1. Introduction to RDDs
      2. Loading Data into RDDs
        1. Creating an RDD from a File or Files
        2. Creating an RDD from a Datasource
        3. Creating an RDD Programatically
      3. Operations on RDDs
        1. Coarse-Grained versus Fine-Grained Transformations
        2. Transformations, Actions, and Lazy Evaluation
        3. RDD Persistence and Re-use
        4. RDD Lineage
        5. Fault Tolerance with RDDs
      4. Types of RDDs
      5. Summary
      6. Q&A
      7. Workshop
        1. Quiz
        2. Answers
    2. Hour 7. Understanding MapReduce Concepts
      1. MapReduce History and Background
        1. The Motivation for MapReduce
        2. The Design Goals for MapReduce
      2. Records and Key Value Pairs
        1. Key Value Pairs and Records
      3. MapReduce Explained
        1. Map Phase
        2. Partitioning Function
        3. Shuffle
        4. Reduce Phase
        5. Fault Tolerance
        6. Combiner Functions
        7. Asymmetry and Speculative Execution
        8. Map-only MapReduce Applications
        9. An Election Analogy for MapReduce
      4. Word Count: The “Hello, World” of MapReduce
        1. Why Count Words?
        2. How It Works
        3. Map and Reduce Functions in Spark
      5. Summary
      6. Q&A
      7. Workshop
        1. Quiz
        2. Answers
    3. Hour 8. Getting Started with Scala
      1. Scala History and Background
        1. Scala Beginnings
      2. Scala Basics
        1. Scala’s Compile Time and Run Time Architecture
        2. Variables and Primitives in Scala
        3. Data Structures in Scala
        4. Control Structures in Scala
      3. Object-Oriented Programming in Scala
        1. Classes and Inheritance
        2. Mixin Composition
        3. Singleton Objects
        4. Polymorphism
      4. Functional Programming in Scala
        1. First-class Functions
        2. Anonymous Functions
        3. Higher-order Functions
        4. Closures
        5. Currying
        6. Lazy Evaluation
        7. Immutable Data Structures
      5. Spark Programming in Scala
      6. Summary
      7. Q&A
      8. Workshop
        1. Quiz
        2. Answers
    4. Hour 9. Functional Programming with Python
      1. Python Overview
        1. Python Background
        2. Python Runtime Architecture
      2. Data Structures and Serialization in Python
        1. Lists
        2. Sets
        3. Tuples
        4. Dictionaries
        5. Python Object Serialization
      3. Python Functional Programming Basics
        1. Anonymous Functions and lambda
        2. Higher-order Functions
        3. Tail Calls
        4. Short-circuiting
        5. Parallelization
        6. Closures in Python
      4. Interactive Programming Using IPython
        1. IPython History and Background
        2. Using IPython with Spark
        3. Jupyter, the IPython Notebook
      5. Summary
      6. Q&A
      7. Workshop
        1. Quiz
        2. Answers
    5. Hour 10. Working with the Spark API (Transformations and Actions)
      1. RDDs and Data Sampling
        1. RDD Refresher
        2. Data Sampling with Spark
      2. Spark Transformations
        1. Functional Transformations
        2. Grouping, Sorting, and Distinct Functions
        3. Set Operations
      3. Spark Actions
        1. The count Action
        2. The collect, take, top, and first Actions
        3. The reduce and fold Actions
        4. The foreach Action
      4. Key Value Pair Operations
        1. Key Value Pair RDD Dictionary Functions
        2. Functional Key Value Pair RDD Transformations
        3. Grouping, Aggregation, Sorting, and Set Operations
      5. Join Functions
        1. Join Types
        2. Join Transformations
      6. Numerical RDD Operations
        1. min()
        2. max()
        3. mean()
        4. sum()
        5. stdev()
        6. variance()
        7. stats()
      7. Summary
      8. Q&A
      9. Workshop
        1. Quiz
        2. Answers
    6. Hour 11. Using RDDs: Caching, Persistence, and Output
      1. RDD Storage Levels
        1. RDD Lineage Revisited
        2. RDD Storage Levels
      2. Caching, Persistence, and Checkpointing
        1. Caching RDDs
        2. Persisting RDDs
        3. Choosing When to Persist or Cache RDDs
        4. Checkpointing RDDs
      3. Saving RDD Output
        1. External Storage Systems
        2. Storage Formats
      4. Introduction to Alluxio (Tachyon)
        1. Alluxio Background
        2. Alluxio Architecture
        3. Alluxio as a Filesystem
        4. Alluxio for Off Heap RDD Persistence
        5. Other Alluxio Features and Usages
      5. Summary
      6. Q&A
      7. Workshop
        1. Quiz
        2. Answers
    7. Hour 12. Advanced Spark Programming
      1. Broadcast Variables
        1. Broadcast Variable Creation and Usage
        2. Advantages of Broadcast Variables
      2. Accumulators
        1. Using Accumulators
        2. Custom Accumulators
        3. Uses for Accumulators
      3. Partitioning and Repartitioning
        1. Partitioning Overview
        2. Controlling Partitions
        3. Repartitioning Functions
        4. Partition-specific API Methods
      4. Processing RDDs with External Programs
        1. pipe()
      5. Summary
      6. Q&A
      7. Workshop
        1. Quiz
        2. Answers
  14. Part III: Extensions to Spark
    1. Hour 13. Using SQL with Spark
      1. Introduction to Spark SQL
        1. Background
        2. Hive Overview
        3. SQL on Hadoop
        4. Spark SQL Architecture
        5. HiveContext and SQLContext
      2. Getting Started with Spark SQL DataFrames
        1. Creating a DataFrame from an Existing RDD
        2. Creating a DataFrame from a Hive Table
        3. Creating a DataFrame from JSON Objects
        4. Creating DataFrames from Files Using the DataFrameReader
        5. Converting DataFrames to RDDs
        6. DataFrame Data Model
        7. DataFrame Schemas
      3. Using Spark SQL DataFrames
        1. DataFrame Metadata Operations
        2. Basic DataFrame Operations
        3. DataFrame Built-in Functions and UDFs
        4. DataFrame Set Operations
        5. Caching, Persisting, and Repartitioning DataFrames
        6. Saving DataFrame Output Using the DataFrameWriter
      4. Accessing Spark SQL
        1. Accessing Spark SQL Using the spark-sql Shell
        2. Running the Thrift JDBC/ODBC server
      5. Summary
      6. Q&A
      7. Workshop
        1. Quiz
        2. Answers
    2. Hour 14. Stream Processing with Spark
      1. Introduction to Spark Streaming
        1. Streaming, Spark Style
        2. Spark Streaming Architecture
        3. The StreamingContext
      2. Using DStreams
        1. DStream Sources
        2. DStream Transformations
        3. DStream Output Operations
      3. State Operations
        1. updateStateByKey()
      4. Sliding Window Operations
        1. window()
        2. reduceByKeyAndWindow()
      5. Summary
      6. Q&A
      7. Workshop
        1. Quiz
        2. Answers
    3. Hour 15. Getting Started with Spark and R
      1. Introduction to R
        1. Getting Started with the R Language
      2. Introducing SparkR
        1. The SparkR Shell
        2. Creating Data Frames in SparkR
      3. Using SparkR
        1. Building Predictive Models with SparkR
      4. Using SparkR with RStudio
      5. Summary
      6. Q&A
      7. Workshop
        1. Quiz
        2. Answers
    4. Hour 16. Machine Learning with Spark
      1. Introduction to Machine Learning and MLlib
        1. Machine Learning Primer
        2. Machine Learning with Spark
      2. Classification Using Spark MLlib
        1. Decision Trees
        2. Naive Bayes
      3. Collaborative Filtering Using Spark MLlib
      4. Clustering Using Spark MLlib
        1. k-means Clustering
      5. Summary
      6. Q&A
      7. Workshop
        1. Quiz
        2. Answers
    5. Hour 17. Introducing Sparkling Water (H20 and Spark)
      1. Introduction to H2O
        1. H2O Deep Learning
        2. H2O Flow
        3. H2O Architecture
        4. Running H2O on Hadoop
      2. Sparkling Water—H2O on Spark
        1. Sparkling Water Architecture
      3. Summary
      4. Q&A
      5. Workshop
        1. Quiz
        2. Answers
    6. Hour 18. Graph Processing with Spark
      1. Introduction to Graphs
      2. Graph Processing in Spark
        1. Google, Pregel, and PageRank
        2. GraphX: Spark’s Graph Processing System
      3. Introduction to GraphFrames
        1. Accessing the GraphFrames Library
        2. Creating a GraphFrame
        3. GraphFrame Operations
        4. Using Graphing Algorithms with GraphFrames
      4. Summary
      5. Q&A
      6. Workshop
        1. Quiz
        2. Answers
    7. Hour 19. Using Spark with NoSQL Systems
      1. Introduction to NoSQL
        1. Bigtable: The Beginnings of the NoSQL Movement
        2. NoSQL System Characteristics
        3. Types of NoSQL Systems
      2. Using Spark with HBase
        1. HBase Data Model and Shell
        2. Data Distribution in HBase
        3. HBase and Spark
      3. Using Spark with Cassandra
        1. Cassandra Data Model
        2. Cassandra Query Language (CQL)
        3. Accessing Cassandra Using Spark
      4. Using Spark with DynamoDB and More
        1. Amazon DynamoDB
        2. Other NoSQL Implementations
        3. The Future for NoSQL
      5. Summary
      6. Q&A
      7. Workshop
        1. Quiz
        2. Answers
    8. Hour 20. Using Spark with Messaging Systems
      1. Overview of Messaging Systems
        1. Pub-Sub Messaging Exchange Pattern
      2. Using Spark with Apache Kafka
        1. Kafka Overview
        2. Spark and Kafka
      3. Spark, MQTT, and the Internet of Things
        1. MQTT Overview
        2. Using Spark with MQTT
      4. Using Spark with Amazon Kinesis
        1. Kinesis Streams
        2. Using Spark with Kinesis
      5. Summary
      6. Q&A
      7. Workshop
        1. Quiz
        2. Answers
  15. Part IV: Managing Spark
    1. Hour 21. Administering Spark
      1. Spark Configuration
        1. Spark Environment Variables
        2. Spark Configuration
      2. Administering Spark Standalone
        1. Spark Standalone Revisited
        2. Deploying Spark Standalone Clusters
        3. Scheduling with Spark Standalone
      3. Administering Spark on YARN
        1. Spark on YARN Revisited
        2. Deploying Spark on YARN
        3. Managing Spark Applications Running on YARN
        4. YARN Scheduling
      4. Summary
      5. Q&A
      6. Workshop
        1. Quiz
        2. Answers
    2. Hour 22. Monitoring Spark
      1. Exploring the Spark Application UI
        1. Jobs
        2. Stages
        3. Storage
        4. Environment
        5. Executors
        6. Viewing the Status of All Running Applications
      2. Spark History Server
        1. Deploying the Spark History Server
        2. Exploring the Spark History Server UI
        3. Spark History Server API Access
      3. Spark Metrics
      4. Logging in Spark
        1. Log4j
      5. Summary
      6. Q&A
      7. Workshop
        1. Quiz
        2. Answers
    3. Hour 23. Extending and Securing Spark
      1. Isolating Spark
        1. Perimeter Security
        2. Gateway Services
        3. Authentication and Authorization
      2. Securing Spark Communication
        1. Spark Authentication Using a Shared Secret
        2. Encrypting Spark Communication
        3. Securing the Spark Web UI
      3. Securing Spark with Kerberos
        1. Kerberos Overview
        2. Kerberos with Hadoop
        3. Kerberos Configuration with Spark
      4. Summary
      5. Q&A
      6. Workshop
        1. Quiz
        2. Answers
    4. Hour 24. Improving Spark Performance
      1. Benchmarking Spark
        1. Benchmarks
        2. Canary Queries
        3. Performance Monitoring Solutions
      2. Application Development Best Practices
        1. Application Development Optimizations
        2. System, Configuration, or Job Submission Optimizations
      3. Optimizing Partitions
        1. Inefficient Partitioning
      4. Diagnosing Application Performance Issues
        1. Using the Application UI to Diagnose Performance Issues
        2. Using the Spark History UI to Diagnose Performance Issues
      5. Summary
      6. Q&A
      7. Workshop
        1. Quiz
        2. Answers
  16. Index
  17. Code Snippets