O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Scala Machine Learning Projects

Book Description

Powerful smart applications using deep learning algorithms to dominate numerical computing, deep learning, and functional programming.

About This Book

  • Explore machine learning techniques with prominent open source Scala libraries such as Spark ML, H2O, MXNet, Zeppelin, and DeepLearning4j
  • Solve real-world machine learning problems by delving complex numerical computing with Scala functional programming in a scalable and faster way
  • Cover all key aspects such as collection, storing, processing, analyzing, and evaluation required to build and deploy machine models on computing clusters using Scala Play framework.

Who This Book Is For

If you want to leverage the power of both Scala and Spark to make sense of Big Data, then this book is for you. If you are well versed with machine learning concepts and wants to expand your knowledge by delving into the practical implementation using the power of Scala, then this book is what you need! Strong understanding of Scala Programming language is recommended. Basic familiarity with machine Learning techniques will be more helpful.

What You Will Learn

  • Apply advanced regression techniques to boost the performance of predictive models
  • Use different classification algorithms for business analytics
  • Generate trading strategies for Bitcoin and stock trading using ensemble techniques
  • Train Deep Neural Networks (DNN) using H2O and Spark ML
  • Utilize NLP to build scalable machine learning models
  • Learn how to apply reinforcement learning algorithms such as Q-learning for developing ML application
  • Learn how to use autoencoders to develop a fraud detection application
  • Implement LSTM and CNN models using DeepLearning4j and MXNet

In Detail

Machine learning has had a huge impact on academia and industry by turning data into actionable information. Scala has seen a steady rise in adoption over the past few years, especially in the fields of data science and analytics. This book is for data scientists, data engineers, and deep learning enthusiasts who have a background in complex numerical computing and want to know more hands-on machine learning application development.

If you're well versed in machine learning concepts and want to expand your knowledge by delving into the practical implementation of these concepts using the power of Scala, then this book is what you need! Through 11 end-to-end projects, you will be acquainted with popular machine learning libraries such as Spark ML, H2O, DeepLearning4j, and MXNet.

At the end, you will be able to use numerical computing and functional programming to carry out complex numerical tasks to develop, build, and deploy research or commercial projects in a production-ready environment.

Style and approach

Leverage the power of machine learning and deep learning in different domains, giving best practices and tips from a real world case studies and help you to avoid pitfalls and fallacies towards decision making based on predictive analytics with ML models.

Downloading the example code for this book You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the example code files
      2. Download the color images
      3. Conventions used
    4. Get in touch
      1. Reviews
  2. Analyzing Insurance Severity Claims
    1. Machine learning and learning workflow
      1. Typical machine learning workflow
    2. Hyperparameter tuning and cross-validation
    3. Analyzing and predicting insurance severity claims
      1. Motivation
      2. Description of the dataset
      3. Exploratory analysis of the dataset
      4. Data preprocessing
    4. LR for predicting insurance severity claims
      1. Developing insurance severity claims predictive model using LR
    5. GBT regressor for predicting insurance severity claims
    6. Boosting the performance using random forest regressor
      1. Random Forest for classification and regression
    7. Comparative analysis and model deployment
      1. Spark-based model deployment for large-scale dataset
    8. Summary
  3. Analyzing and Predicting Telecommunication Churn
    1. Why do we perform churn analysis, and how do we do it?
    2. Developing a churn analytics pipeline
      1. Description of the dataset
      2. Exploratory analysis and feature engineering
    3. LR for churn prediction
    4. SVM for churn prediction
    5. DTs for churn prediction
    6. Random Forest for churn prediction
    7. Selecting the best model for deployment
    8. Summary
  4. High Frequency Bitcoin Price Prediction from Historical and Live Data
    1. Bitcoin, cryptocurrency, and online trading
      1. State-of-the-art automated trading of Bitcoin
        1. Training
        2. Prediction
    2. High-level data pipeline of the prototype
    3. Historical and live-price data collection
      1. Historical data collection
      2. Transformation of historical data into a time series
        1. Assumptions and design choices
        2. Data preprocessing
      3. Real-time data through the Cryptocompare API
    4. Model training for prediction
    5. Scala Play web service
      1. Concurrency through Akka actors
      2. Web service workflow
        1. JobModule
        2. Scheduler
        3. SchedulerActor
        4. PredictionActor and the prediction step
        5. TraderActor
    6. Predicting prices and evaluating the model
    7. Demo prediction using Scala Play framework
      1. Why RESTful architecture?
      2. Project structure
      3. Running the Scala Play web app
    8. Summary
  5. Population-Scale Clustering and Ethnicity Prediction
    1. Population scale clustering and geographic ethnicity
      1. Machine learning for genetic variants
    2. 1000 Genomes Projects dataset description
    3. Algorithms, tools, and techniques
      1. H2O and Sparkling water
      2. ADAM for large-scale genomics data processing
      3. Unsupervised machine learning
        1. Population genomics and clustering
      4. How does K-means work?
      5. DNNs for geographic ethnicity prediction
    4. Configuring programming environment
    5. Data pre-processing and feature engineering
      1. Model training and hyperparameter tuning
        1. Spark-based K-means for population-scale clustering
        2. Determining the number of optimal clusters
        3. Using H2O for ethnicity prediction
      2. Using random forest for ethnicity prediction
    6. Summary
  6. Topic Modeling - A Better Insight into Large-Scale Texts
    1. Topic modeling and text clustering
      1. How does LDA algorithm work?
    2. Topic modeling with Spark MLlib and Stanford NLP
      1. Implementation
        1. Step 1 - Creating a Spark session
        2. Step 2 - Creating vocabulary and tokens count to train the LDA after text pre-processing
        3. Step 3 - Instantiate the LDA model before training
        4. Step 4 - Set the NLP optimizer
        5. Step 5 - Training the LDA model
        6. Step 6 - Prepare the topics of interest
        7. Step 7 - Topic modelling 
        8. Step 8 - Measuring the likelihood of two documents
    3. Other topic models versus the scalability of LDA
    4. Deploying the trained LDA model
    5. Summary
  7. Developing Model-based Movie Recommendation Engines
    1. Recommendation system
      1. Collaborative filtering approaches
        1. Content-based filtering approaches
        2. Hybrid recommender systems
        3. Model-based collaborative filtering
      2. The utility matrix
    2. Spark-based movie recommendation systems
      1. Item-based collaborative filtering for movie similarity
        1. Step 1 - Importing necessary libraries and creating a Spark session
        2. Step 2 - Reading and parsing the dataset
        3. Step 3 - Computing similarity
        4. Step 4 - Testing the model
      2. Model-based recommendation with Spark
        1. Data exploration
        2. Movie recommendation using ALS
          1. Step 1 - Import packages, load, parse, and explore the movie and rating dataset
          2. Step 2 - Register both DataFrames as temp tables to make querying easier
          3. Step 3 - Explore and query for related statistics
          4. Step 4 - Prepare training and test rating data and check the counts
          5. Step 5 - Prepare the data for building the recommendation model using ALS
          6. Step 6 - Build an ALS user product matrix
          7. Step 7 - Making predictions
          8. Step 8 - Evaluating the model
    3. Selecting and deploying the best model 
    4. Summary
  8. Options Trading Using Q-learning and Scala Play Framework
    1. Reinforcement versus supervised and unsupervised learning
      1. Using RL
      2. Notation, policy, and utility in RL
        1. Policy
        2. Utility
    2. A simple Q-learning implementation
      1. Components of the Q-learning algorithm
        1. States and actions in QLearning
        2. The search space
        3. The policy and action-value
        4. QLearning model creation and training
      2. QLearning model validation
      3. Making predictions using the trained model
    3. Developing an options trading web app using Q-learning
      1. Problem description
      2. Implementating an options trading web application
        1. Creating an option property
        2. Creating an option model
        3. Putting it altogether
      3. Evaluating the model
      4. Wrapping up the options trading app as a Scala web app
        1. The backend
        2. The frontend
      5. Running and Deployment Instructions
      6. Model deployment
    4. Summary
  9. Clients Subscription Assessment for Bank Telemarketing using Deep Neural Networks
    1. Client subscription assessment through telemarketing
      1. Dataset description
      2. Installing and getting started with Apache Zeppelin
        1. Building from the source
        2. Starting and stopping Apache Zeppelin
        3. Creating notebooks
      3. Exploratory analysis of the dataset
        1. Label distribution
        2. Job distribution
        3. Marital distribution
        4. Education distribution
        5. Default distribution
        6. Housing distribution
        7. Loan distribution
        8. Contact distribution
        9. Month distribution
        10. Day distribution
        11. Previous outcome distribution
        12. Age feature
        13. Duration distribution
        14. Campaign distribution
        15. Pdays distribution
        16. Previous distribution
        17. emp_var_rate distributions
        18. cons_price_idx features
        19. cons_conf_idx distribution
        20. Euribor3m distribution
        21. nr_employed distribution
      4. Statistics of numeric features
      5. Implementing a client subscription assessment model
      6. Hyperparameter tuning and feature selection
        1. Number of hidden layers
        2. Number of neurons per hidden layer
        3. Activation functions
        4. Weight and bias initialization
        5. Regularization
    2. Summary
  10. Fraud Analytics Using Autoencoders and Anomaly Detection
    1. Outlier and anomaly detection
    2. Autoencoders and unsupervised learning
      1. Working principles of an autoencoder
      2. Efficient data representation with autoencoders
    3. Developing a fraud analytics model
      1. Description of the dataset and using linear models
      2. Problem description
      3. Preparing programming environment
        1. Step 1 - Loading required packages and libraries
        2. Step 2 - Creating a Spark session and importing implicits
        3. Step 3 - Loading and parsing input data
        4. Step 4 - Exploratory analysis of the input data
        5. Step 5 - Preparing the H2O DataFrame
        6. Step 6 - Unsupervised pre-training using autoencoder
        7. Step 7 - Dimensionality reduction with hidden layers
        8. Step 8 - Anomaly detection
        9. Step 9 - Pre-trained supervised model
        10. Step 10 - Model evaluation on the highly-imbalanced data
        11. Step 11 - Stopping the Spark session and H2O context
      4. Auxiliary classes and methods
    4. Hyperparameter tuning and feature selection
    5. Summary
  11. Human Activity Recognition using Recurrent Neural Networks
    1. Working with RNNs
      1. Contextual information and the architecture of RNNs
      2. RNN and the long-term dependency problem
      3. LSTM networks
    2. Human activity recognition using the LSTM model
      1. Dataset description
      2. Setting and configuring MXNet for Scala
    3. Implementing an LSTM model for HAR
      1. Step 1 - Importing necessary libraries and packages
      2. Step 2 - Creating MXNet context
      3. Step 3 - Loading and parsing the training and test set
      4. Step 4 - Exploratory analysis of the dataset
      5. Step 5 - Defining internal RNN structure and LSTM hyperparameters
      6. Step 6 - LSTM network construction
      7. Step 7 - Setting up an optimizer
      8. Step 8 - Training the LSTM network
      9. Step 9 - Evaluating the model
    4. Tuning LSTM hyperparameters and GRU
    5. Summary
  12. Image Classification using Convolutional Neural Networks
    1. Image classification and drawbacks of DNNs
    2. CNN architecture
      1. Convolutional operations
      2. Pooling layer and padding operations
        1. Subsampling operations
      3. Convolutional and subsampling operations in DL4j
        1. Configuring DL4j, ND4s, and ND4j
        2. Convolutional and subsampling operations in DL4j
    3. Large-scale image classification using CNN
      1. Problem description
      2. Description of the image dataset
      3. Workflow of the overall project
      4. Implementing CNNs for image classification
        1. Image processing
        2. Extracting image metadata
        3. Image feature extraction
        4. Preparing the ND4j dataset
        5. Training the CNNs and saving the trained models
        6. Evaluating the model
        7. Wrapping up by executing the main() method
    4. Tuning and optimizing CNN hyperparameters
    5. Summary
  13. Other Books You May Enjoy
    1. Leave a review - let other readers know what you think