You are previewing Advanced Analytics with Spark.
O'Reilly logo
Advanced Analytics with Spark

Book Description

In this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by example.

You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques—classification, collaborative filtering, and anomaly detection among others—to fields such as genomics, security, and finance. If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find these patterns useful for working on your own data applications.

Table of Contents

  1. Foreword
  2. Preface
    1. What’s in This Book
    2. Using Code Examples
    3. Safari® Books Online
    4. How to Contact Us
    5. Acknowledgments
  3. 1. Analyzing Big Data
    1. The Challenges of Data Science
    2. Introducing Apache Spark
    3. About This Book
  4. 2. Introduction to Data Analysis with <span xmlns="" xmlns:epub="" class="keep-together">Scala and Spark</span>
    1. Scala for Data Scientists
    2. The Spark Programming Model
    3. Record Linkage
    4. Getting Started: The Spark Shell and SparkContext
    5. Bringing Data from the Cluster to the Client
    6. Shipping Code from the Client to the Cluster
    7. Structuring Data with Tuples and Case Classes
    8. Aggregations
    9. Creating Histograms
    10. Summary Statistics for Continuous Variables
    11. Creating Reusable Code for Computing Summary Statistics
    12. Simple Variable Selection and Scoring
    13. Where to Go from Here
  5. 3. Recommending Music and the Audioscrobbler Data Set
    1. Data Set
    2. The Alternating Least Squares Recommender Algorithm
    3. Preparing the Data
    4. Building a First Model
    5. Spot Checking Recommendations
    6. Evaluating Recommendation Quality
    7. Computing AUC
    8. Hyperparameter Selection
    9. Making Recommendations
    10. Where to Go from Here
  6. 4. Predicting Forest Cover with Decision Trees
    1. Fast Forward to Regression
    2. Vectors and Features
    3. Training Examples
    4. Decision Trees and Forests
    5. Covtype Data Set
    6. Preparing the Data
    7. A First Decision Tree
    8. Decision Tree Hyperparameters
    9. Tuning Decision Trees
    10. Categorical Features Revisited
    11. Random Decision Forests
    12. Making Predictions
    13. Where to Go from Here
  7. 5. Anomaly Detection in Network Traffic <span xmlns="" xmlns:epub="" class="keep-together">with K-means Clustering</span>
    1. Anomaly Detection
    2. K-means Clustering
    3. Network Intrusion
    4. KDD Cup 1999 Data Set
    5. A First Take on Clustering
    6. Choosing k
    7. Visualization in R
    8. Feature Normalization
    9. Categorical Variables
    10. Using Labels with Entropy
    11. Clustering in Action
    12. Where to Go from Here
  8. 6. Understanding Wikipedia with Latent Semantic Analysis
    1. The Term-Document Matrix
    2. Getting the Data
    3. Parsing and Preparing the Data
    4. Lemmatization
    5. Computing the TF-IDFs
    6. Singular Value Decomposition
    7. Finding Important Concepts
    8. Querying and Scoring with the Low-Dimensional Representation
    9. Term-Term Relevance
    10. Document-Document Relevance
    11. Term-Document Relevance
    12. Multiple-Term Queries
    13. Where to Go from Here
  9. 7. Analyzing Co-occurrence Networks <span xmlns="" xmlns:epub="" class="keep-together">with GraphX</span>
    1. The MEDLINE Citation Index: A Network Analysis
    2. Getting the Data
    3. Parsing XML Documents with Scala’s XML Library
    4. Analyzing the MeSH Major Topics and Their <span xmlns="" xmlns:epub="" class="keep-together">Co-occurrences</span>
    5. Constructing a Co-occurrence Network with GraphX
    6. Understanding the Structure of Networks
      1. Connected Components
      2. Degree Distribution
    7. Filtering Out Noisy Edges
      1. Processing EdgeTriplets
      2. Analyzing the Filtered Graph
    8. Small-World Networks
      1. Cliques and Clustering Coefficients
      2. Computing Average Path Length with Pregel
    9. Where to Go from Here
  10. 8. Geospatial and Temporal Data Analysis on the New York City Taxi Trip Data
    1. Getting the Data
    2. Working with Temporal and Geospatial Data in Spark
    3. Temporal Data with JodaTime and NScalaTime
    4. Geospatial Data with the Esri Geometry API and Spray
      1. Exploring the Esri Geometry API
      2. Intro to GeoJSON
    5. Preparing the New York City Taxi Trip Data
      1. Handling Invalid Records at Scale
      2. Geospatial Analysis
    6. Sessionization in Spark
      1. Building Sessions: Secondary Sorts in Spark
    7. Where to Go from Here
  11. 9. Estimating Financial Risk <span xmlns="" xmlns:epub="" class="keep-together">through Monte Carlo Simulation</span>
    1. Terminology
    2. Methods for Calculating VaR
      1. Variance-Covariance
      2. Historical Simulation
      3. Monte Carlo Simulation
    3. Our Model
    4. Getting the Data
    5. Preprocessing
    6. Determining the Factor Weights
    7. Sampling
      1. The Multivariate Normal Distribution
    8. Running the Trials
    9. Visualizing the Distribution of Returns
    10. Evaluating Our Results
    11. Where to Go from Here
  12. 10. Analyzing Genomics Data <span xmlns="" xmlns:epub="" class="keep-together">and the BDG Project</span>
    1. Decoupling Storage from Modeling
    2. Ingesting Genomics Data with the ADAM CLI
      1. Parquet Format and Columnar Storage
    3. Predicting Transcription Factor Binding Sites from <span xmlns="" xmlns:epub="" class="keep-together">ENCODE Data</span>
    4. Querying Genotypes from the 1000 Genomes Project
    5. Where to Go from Here
  13. 11. Analyzing Neuroimaging Data with PySpark and Thunder
    1. Overview of PySpark
      1. PySpark Internals
    2. Overview and Installation of the Thunder Library
    3. Loading Data with Thunder
      1. Thunder Core Data Types
    4. Categorizing Neuron Types with Thunder
    5. Where to Go from Here
  14. A. Deeper into Spark
    1. Serialization
    2. Accumulators
    3. Spark and the Data Scientist’s Workflow
    4. File Formats
    5. Spark Subprojects
      1. MLlib
      2. Spark Streaming
      3. Spark SQL
      4. GraphX
  15. B. Upcoming MLlib Pipelines API
    1. Beyond Mere Modeling
    2. The Pipelines API
    3. Text Classification Example Walkthrough
  16. Index