O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Advanced Analytics with Spark, 2nd Edition

Book Description

With Early Release ebooks, you get books in their earliest form — the author's raw and unedited content as he or she writes — so you can take advantage of these technologies long before the official release of these titles. You'll also receive updates when significant changes are made, new chapters are available, and the final ebook bundle is released.

In the second edition of this practical book, five Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world datasets together to teach you how to approach analytics problems by example. Updated for Spark 2.1, this edition acts as an introduction to these techniques and other best practices in Spark programming.

You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques—including classification, clustering, collaborative filtering, and anomaly detection—to fields such as genomics, security, and finance. New chapters cover PySpark and MLlib, and Embarrassingly Parallel Python.

If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find the book’s patterns useful for working on your own data applications.

With this book, you will:

  • Familiarize yourself with the Spark programming model
  • Become comfortable within the Spark ecosystem
  • Learn general approaches in data science
  • Examine complete implementations that analyze large public datasets
  • Discover which machine learning tools make sense for particular problems
  • Acquire code that can be adapted to many uses

Table of Contents

  1. Foreword
  2. Preface
    1. What’s in This Book
    2. The Second Edition
    3. Using Code Examples
    4. O’Reilly Safari
    5. How to Contact Us
    6. Acknowledgments
  3. 1. Analyzing Big Data
    1. The Challenges of Data Science
    2. Introducing Apache Spark
    3. About This Book
    4. The Second Edition
  4. 2. Introduction to Data Analysis with Scala and Spark
    1. Scala for Data Scientists
    2. The Spark Programming Model
    3. Record Linkage
    4. Getting Started: The Spark Shell and SparkContext
    5. Bringing Data from the Cluster to the Client
    6. Shipping Code from the Client to the Cluster
    7. From RDDs to DataFrames
    8. Analyzing Data with the DataFrame API
    9. Fast Summary Statistics for DataFrames
    10. Pivoting And Reshaping DataFrames
    11. Joining DataFrames and Selecting Features
    12. Preparing Models for Production Environments
    13. Model Evaluation
    14. Where to Go from Here
  5. 3. Recommending Music and the Audioscrobbler Data Set
    1. Data Set
    2. The Alternating Least Squares Recommender Algorithm
    3. Preparing the Data
    4. Building a First Model
    5. Spot Checking Recommendations
    6. Evaluating Recommendation Quality
    7. Computing AUC
    8. Hyperparameter Selection
    9. Making Recommendations
    10. Where to Go from Here
  6. 4. Predicting Forest Cover with Decision Trees
    1. Fast Forward to Regression
    2. Vectors and Features
    3. Training Examples
    4. Decision Trees and Forests
    5. Covtype Data Set
    6. Preparing the Data
    7. A First Decision Tree
    8. Decision Tree Hyperparameters
    9. Tuning Decision Trees
    10. Categorical Features Revisited
    11. Random Decision Forests
    12. Making Predictions
    13. Where to Go from Here
  7. 5. Anomaly Detection in Network Traffic with K-means Clustering
    1. Anomaly Detection
    2. K-means Clustering
    3. Network Intrusion
    4. KDD Cup 1999 Data Set
    5. A First Take on Clustering
    6. Choosing k
    7. Visualization with SparkR
    8. Feature Normalization
    9. Categorical Variables
    10. Using Labels with Entropy
    11. Clustering in Action
    12. Where to Go from Here
  8. 6. Understanding Wikipedia with Latent Semantic Analysis
    1. The Document-Term Matrix
    2. Getting the Data
    3. Parsing and Preparing the Data
    4. Lemmatization
    5. Computing the TF-IDFs
    6. Singular Value Decomposition
    7. Finding Important Concepts
    8. Querying and Scoring with the Low-Dimensional Representation
    9. Term-Term Relevance
    10. Document-Document Relevance
    11. Document-Term Relevance
    12. Multiple-Term Queries
    13. Where to Go from Here
  9. 7. Analyzing Co-occurrence Networks with GraphX
    1. The MEDLINE Citation Index: A Network Analysis
    2. Getting the Data
    3. Parsing XML Documents with Scala’s XML Library
    4. Analyzing the MeSH Major Topics and Their Co-occurrences
    5. Constructing a Co-occurrence Network with GraphX
    6. Understanding the Structure of Networks
      1. Connected Components
      2. Degree Distribution
    7. Filtering Out Noisy Edges
      1. Processing EdgeTriplets
      2. Analyzing the Filtered Graph
    8. Small-World Networks
      1. Cliques and Clustering Coefficients
      2. Computing Average Path Length with Pregel
    9. Where to Go from Here
  10. 8. Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data
    1. Getting the Data
    2. Working with Third Party Libraries in Spark
    3. Geospatial Data with the Esri Geometry API and Spray
      1. Exploring the Esri Geometry API
      2. Intro to GeoJSON
    4. Preparing the New York City Taxi Trip Data
      1. Handling Invalid Records at Scale
      2. Geospatial Analysis
    5. Sessionization in Spark
      1. Building Sessions: Secondary Sorts in Spark
    6. Where to Go from Here
  11. 9. Estimating Financial Risk through Monte Carlo Simulation
    1. Terminology
    2. Methods for Calculating VaR
      1. Variance-Covariance
      2. Historical Simulation
      3. Monte Carlo Simulation
    3. Our Model
    4. Getting the Data
    5. Preprocessing
    6. Determining the Factor Weights
    7. Sampling
      1. The Multivariate Normal Distribution
    8. Running the Trials
    9. Visualizing the Distribution of Returns
    10. Evaluating Our Results
    11. Where to Go from Here
  12. 10. Analyzing Genomics Data and the BDG Project
    1. Decoupling Storage from Modeling
    2. Ingesting Genomics Data with the ADAM CLI
      1. Parquet Format and Columnar Storage
    3. Predicting Transcription Factor Binding Sites from ENCODE Data
    4. Querying Genotypes from the 1000 Genomes Project
    5. Where to Go from Here
  13. 11. Analyzing Neuroimaging Data with PySpark and Thunder
    1. Overview of PySpark
      1. PySpark Internals
    2. Overview and Installation of the Thunder Library
    3. Loading Data with Thunder
      1. Thunder Core Data Types
    4. Categorizing Neuron Types with Thunder
    5. Where to Go from Here
  14. Index