You are previewing Machine Learning with Spark.
O'Reilly logo
Machine Learning with Spark

Book Description

Create scalable machine learning applications to power a modern data-driven business using Spark

In Detail

Apache Spark is a framework for distributed computing that is designed from the ground up to be optimized for low latency tasks and in-memory data storage. It is one of the few frameworks for parallel computing that combines speed, scalability, in-memory processing, and fault tolerance with ease of programming and a flexible, expressive, and powerful API design.

This book guides you through the basics of Spark's API used to load and process data and prepare the data to use as input to the various machine learning models. There are detailed examples and real-world use cases for you to explore common machine learning models including recommender systems, classification, regression, clustering, and dimensionality reduction. You will cover advanced topics such as working with large-scale text data, and methods for online machine learning and model evaluation using Spark Streaming.

What You Will Learn

  • Create your first Spark program in Scala, Java, and Python

  • Set up and configure a development environment for Spark on your own computer, as well as on Amazon EC2

  • Access public machine learning datasets and use Spark to load, process, clean, and transform data

  • Use Spark's machine learning library to implement programs utilizing well-known machine learning models including collaborative filtering, classification, regression, clustering, and dimensionality reduction

  • Write Spark functions to evaluate the performance of your machine learning models

  • Deal with large-scale text data, including feature extraction and using text data as input to your machine learning models

  • Explore online learning methods and use Spark Streaming for online learning and model evaluation

  • Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

    Table of Contents

    1. Machine Learning with Spark
      1. Table of Contents
      2. Machine Learning with Spark
      3. Credits
      4. About the Author
      5. Acknowledgments
      6. About the Reviewers
      7. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      8. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      9. 1. Getting Up and Running with Spark
        1. Installing and setting up Spark locally
        2. Spark clusters
        3. The Spark programming model
          1. SparkContext and SparkConf
          2. The Spark shell
          3. Resilient Distributed Datasets
            1. Creating RDDs
            2. Spark operations
            3. Caching RDDs
          4. Broadcast variables and accumulators
        4. The first step to a Spark program in Scala
        5. The first step to a Spark program in Java
        6. The first step to a Spark program in Python
        7. Getting Spark running on Amazon EC2
          1. Launching an EC2 Spark cluster
        8. Summary
      10. 2. Designing a Machine Learning System
        1. Introducing MovieStream
        2. Business use cases for a machine learning system
          1. Personalization
          2. Targeted marketing and customer segmentation
          3. Predictive modeling and analytics
        3. Types of machine learning models
        4. The components of a data-driven machine learning system
          1. Data ingestion and storage
          2. Data cleansing and transformation
          3. Model training and testing loop
          4. Model deployment and integration
          5. Model monitoring and feedback
          6. Batch versus real time
        5. An architecture for a machine learning system
          1. Practical exercise
        6. Summary
      11. 3. Obtaining, Processing, and Preparing Data with Spark
        1. Accessing publicly available datasets
          1. The MovieLens 100k dataset
        2. Exploring and visualizing your data
          1. Exploring the user dataset
          2. Exploring the movie dataset
          3. Exploring the rating dataset
        3. Processing and transforming your data
          1. Filling in bad or missing data
        4. Extracting useful features from your data
          1. Numerical features
          2. Categorical features
          3. Derived features
            1. Transforming timestamps into categorical features
          4. Text features
            1. Simple text feature extraction
          5. Normalizing features
            1. Using MLlib for feature normalization
          6. Using packages for feature extraction
        5. Summary
      12. 4. Building a Recommendation Engine with Spark
        1. Types of recommendation models
          1. Content-based filtering
          2. Collaborative filtering
            1. Matrix factorization
              1. Explicit matrix factorization
              2. Implicit matrix factorization
              3. Alternating least squares
        2. Extracting the right features from your data
          1. Extracting features from the MovieLens 100k dataset
        3. Training the recommendation model
          1. Training a model on the MovieLens 100k dataset
            1. Training a model using implicit feedback data
        4. Using the recommendation model
          1. User recommendations
            1. Generating movie recommendations from the MovieLens 100k dataset
              1. Inspecting the recommendations
          2. Item recommendations
            1. Generating similar movies for the MovieLens 100k dataset
              1. Inspecting the similar items
        5. Evaluating the performance of recommendation models
          1. Mean Squared Error
          2. Mean average precision at K
          3. Using MLlib's built-in evaluation functions
            1. RMSE and MSE
            2. MAP
        6. Summary
      13. 5. Building a Classification Model with Spark
        1. Types of classification models
          1. Linear models
            1. Logistic regression
            2. Linear support vector machines
          2. The naïve Bayes model
          3. Decision trees
        2. Extracting the right features from your data
          1. Extracting features from the Kaggle/StumbleUpon evergreen classification dataset
        3. Training classification models
          1. Training a classification model on the Kaggle/StumbleUpon evergreen classification dataset
        4. Using classification models
          1. Generating predictions for the Kaggle/StumbleUpon evergreen classification dataset
        5. Evaluating the performance of classification models
          1. Accuracy and prediction error
          2. Precision and recall
          3. ROC curve and AUC
        6. Improving model performance and tuning parameters
          1. Feature standardization
          2. Additional features
          3. Using the correct form of data
          4. Tuning model parameters
            1. Linear models
              1. Iterations
              2. Step size
              3. Regularization
            2. Decision trees
              1. Tuning tree depth and impurity
            3. The naïve Bayes model
          5. Cross-validation
        7. Summary
      14. 6. Building a Regression Model with Spark
        1. Types of regression models
          1. Least squares regression
          2. Decision trees for regression
        2. Extracting the right features from your data
          1. Extracting features from the bike sharing dataset
            1. Creating feature vectors for the linear model
            2. Creating feature vectors for the decision tree
        3. Training and using regression models
          1. Training a regression model on the bike sharing dataset
        4. Evaluating the performance of regression models
          1. Mean Squared Error and Root Mean Squared Error
          2. Mean Absolute Error
          3. Root Mean Squared Log Error
          4. The R-squared coefficient
          5. Computing performance metrics on the bike sharing dataset
            1. Linear model
            2. Decision tree
        5. Improving model performance and tuning parameters
          1. Transforming the target variable
            1. Impact of training on log-transformed targets
          2. Tuning model parameters
            1. Creating training and testing sets to evaluate parameters
            2. The impact of parameter settings for linear models
              1. Iterations
              2. Step size
              3. L2 regularization
              4. L1 regularization
              5. Intercept
            3. The impact of parameter settings for the decision tree
              1. Tree depth
              2. Maximum bins
        6. Summary
      15. 7. Building a Clustering Model with Spark
        1. Types of clustering models
          1. K-means clustering
            1. Initialization methods
            2. Variants
          2. Mixture models
          3. Hierarchical clustering
        2. Extracting the right features from your data
          1. Extracting features from the MovieLens dataset
            1. Extracting movie genre labels
            2. Training the recommendation model
            3. Normalization
        3. Training a clustering model
          1. Training a clustering model on the MovieLens dataset
        4. Making predictions using a clustering model
          1. Interpreting cluster predictions on the MovieLens dataset
            1. Interpreting the movie clusters
        5. Evaluating the performance of clustering models
          1. Internal evaluation metrics
          2. External evaluation metrics
          3. Computing performance metrics on the MovieLens dataset
        6. Tuning parameters for clustering models
          1. Selecting K through cross-validation
        7. Summary
      16. 8. Dimensionality Reduction with Spark
        1. Types of dimensionality reduction
          1. Principal Components Analysis
          2. Singular Value Decomposition
          3. Relationship with matrix factorization
          4. Clustering as dimensionality reduction
        2. Extracting the right features from your data
          1. Extracting features from the LFW dataset
            1. Exploring the face data
            2. Visualizing the face data
            3. Extracting facial images as vectors
              1. Loading images
              2. Converting to grayscale and resizing the images
              3. Extracting feature vectors
            4. Normalization
        3. Training a dimensionality reduction model
          1. Running PCA on the LFW dataset
            1. Visualizing the Eigenfaces
            2. Interpreting the Eigenfaces
        4. Using a dimensionality reduction model
          1. Projecting data using PCA on the LFW dataset
          2. The relationship between PCA and SVD
        5. Evaluating dimensionality reduction models
          1. Evaluating k for SVD on the LFW dataset
        6. Summary
      17. 9. Advanced Text Processing with Spark
        1. What's so special about text data?
        2. Extracting the right features from your data
          1. Term weighting schemes
          2. Feature hashing
          3. Extracting the TF-IDF features from the 20 Newsgroups dataset
            1. Exploring the 20 Newsgroups data
            2. Applying basic tokenization
            3. Improving our tokenization
            4. Removing stop words
            5. Excluding terms based on frequency
            6. A note about stemming
            7. Training a TF-IDF model
            8. Analyzing the TF-IDF weightings
        3. Using a TF-IDF model
          1. Document similarity with the 20 Newsgroups dataset and TF-IDF features
          2. Training a text classifier on the 20 Newsgroups dataset using TF-IDF
        4. Evaluating the impact of text processing
          1. Comparing raw features with processed TF-IDF features on the 20 Newsgroups dataset
        5. Word2Vec models
          1. Word2Vec on the 20 Newsgroups dataset
        6. Summary
      18. 10. Real-time Machine Learning with Spark Streaming
        1. Online learning
        2. Stream processing
          1. An introduction to Spark Streaming
            1. Input sources
            2. Transformations
              1. Keeping track of state
              2. General transformations
            3. Actions
            4. Window operators
          2. Caching and fault tolerance with Spark Streaming
        3. Creating a Spark Streaming application
          1. The producer application
          2. Creating a basic streaming application
          3. Streaming analytics
          4. Stateful streaming
        4. Online learning with Spark Streaming
          1. Streaming regression
          2. A simple streaming regression program
            1. Creating a streaming data producer
            2. Creating a streaming regression model
          3. Streaming K-means
        5. Online model evaluation
          1. Comparing model performance with Spark Streaming
        6. Summary
      19. Index