Book description
NoneTable of contents
- Title Page
- Copyright and Credits
- Packt Upsell
- Contributors
- Preface
-
Predict the Class of a Flower from the Iris Dataset
- A multivariate classification problem
- Project overview – problem formulation
- Getting started with Spark
-
Implementing the Iris pipeline 
-
Iris pipeline implementation objectives
- Step 1 – getting the Iris dataset from the UCI Machine Learning Repository
- Step 2 – preliminary EDA
- Step 3 – creating an SBT project
- Step 4 – creating Scala files in SBT project
- Step 5 – preprocessing, data transformation, and DataFrame creation
- Step 6 – creating, training, and testing data
- Step 7 – creating a Random Forest classifier
- Step 8 – training the Random Forest classifier
- Step 9 – applying the Random Forest classifier to test data
- Step 10 – evaluate Random Forest classifier 
- Step 11 – running the pipeline as an SBT application
- Step 12 – packaging the application
- Step 13 – submitting the pipeline application to Spark local
-
Iris pipeline implementation objectives
- Summary
- Questions
-
Build a Breast Cancer Prognosis Pipeline with the Power of Spark and Scala
- Breast cancer classification problem
-
Getting started
- Setting up prerequisite software
-
Implementation objectives
- Implementation objective 1 – getting the breast cancer dataset
- Implementation objective 2 – deriving a dataframe for EDA
- Step 1 – conducting preliminary EDA 
- Step 2 – loading data and converting it to an RDD[String]
- Step 3 – splitting the resilient distributed dataset and reorganizing individual rows into an array
- Step 4 – purging the dataset of rows containing question mark characters
- Step 5 – running a count after purging the dataset of rows with questionable characters
- Step 6 – getting rid of header
- Step 7 – creating a two-column DataFrame
- Step 8 – creating the final DataFrame
-
Random Forest breast cancer pipeline
- Step 1 – creating an RDD and preprocessing the data
- Step 2 – creating training and test data
- Step 3 – training the Random Forest classifier
- Step 4 – applying the classifier to the test data
- Step 5 – evaluating the classifier
- Step 6 – running the pipeline as an SBT application
- Step 7 – packaging the application
- Step 8 – deploying the pipeline app into Spark local
-
LR breast cancer pipeline
-
Implementation objectives
- Implementation objectives 1 and 2
- Implementation objective 3 – Spark ML workflow for the breast cancer classification task
-
Implementation objective 4 – coding steps for building the indexer and logit machine learning model
- Extending our pipeline object with the WisconsinWrapper trait
- Importing the StringIndexer algorithm and using it
- Splitting the DataFrame into training and test datasets
- Creating a LogisticRegression classifier and setting hyperparameters on it
- Running the LR model on the test dataset
- Building a breast cancer pipeline with two stages
- Implementation objective 5 – evaluating the binary classifier's performance
-
Implementation objectives
- Summary
- Questions
-
Stock Price Predictions
- Stock price binary classification problem
-
Getting started
- Support for hardware virtualization
- Installing the supported virtualization application 
- Downloading the HDP Sandbox and importing it
- Turning on the virtual machine and powering up the Sandbox
- Setting up SSH access for data transfer between Sandbox and the host machine
- Updating the default Python required by Zeppelin
- Updating our Zeppelin instance
-
Implementation objectives
-
List of implementation goals
- Step 1 – creating a Scala representation of the path to the dataset file
- Step 2 – creating an RDD[String]
- Step 3 – splitting the RDD around the newline character in the dataset
- Step 4 – transforming the RDD[String] 
-
Step 5 – carrying out preliminary data analysis
- Creating DataFrame from the original dataset
- Dropping the Date and Label columns from the DataFrame
- Having Spark describe the DataFrame
- Adding a new column to the DataFrame and deriving Vector out of it
- Removing stop words – a preprocessing step 
- Transforming the merged DataFrame
- Transforming a DataFrame into an array of NGrams
- Adding a new column to the DataFrame, devoid of stop words
- Constructing a vocabulary from our dataset corpus
- Training CountVectorizer
- Using StringIndexer to transform our input label column
- Dropping the input label column
- Adding a new column to our DataFrame 
- Dividing the DataSet into training and test sets
- Creating labelIndexer to index the indexedLabel column
- Creating StringIndexer to index a column label
- Creating RandomForestClassifier
- Creating a new data pipeline with three stages
- Creating a new data pipeline with hyperparameters
- Training our new data pipeline
- Generating stock price predictions
-
List of implementation goals
- Summary
- Questions
-
Building a Spam Classification Pipeline
- Spam classification problem
- Project overview – problem formulation
- Getting started
-
Spam classification pipeline 
-
Implementation steps
- Step 1 – setting up your project folder
- Step 2 – upgrading your build.sbt file
- Step 3 – creating a trait called SpamWrapper
- Step 4 – describing the dataset
- Step 5 – creating a new spam classifier class
- Step 6 – listing the data preprocessing steps
- Step 7 – regex to remove punctuation marks and whitespaces
- Step 8 – creating a ham dataframe with punctuation removed
- Step 9 – creating a spam dataframe devoid of punctuation
- Step 10 – joining the spam and ham datasets
- Step 11 – tokenizing our features
- Step 12 – removing stop words
- Step 13 – feature extraction
- Step 14 – creating training and test datasets
-
Implementation steps
- Summary
- Questions
- Further reading
-
Build a Fraud Detection System
- Fraud detection problem
- Project overview – problem formulation
- Getting started
-
Implementation steps
- Create the FraudDetection trait
- Broadcasting mean and standard deviation vectors
- Calculating PDFs
- Calculating the best error term and best F1 score
- Generating predictions – outliers that represent fraud
- Generating the best error term and best F1 measure
- Preparing to compute precision and recall
- Function to calculate false positives
- Summary
- Questions
- Further reading
- Build Flights Performance Prediction Model
-
Building a Recommendation Engine
- Problem overviews
- Detailed overview
-
Implementation and deployment 
-
Implementation
- Step 1 – creating the Scala project
- Step 2 – creating the AirlineWrapper definition
- Step 3 – creating a weapon sales orders schema
- Step 4 – creating a weapon sales leads schema
- Step 5 – building a weapon sales order dataframe
- Step 6 – displaying the weapons sales dataframe
- Step 7 – displaying the customer-weapons-system dataframe
- Step 8 – generating predictions
- Step 9 – displaying predictions
- Compilation and deployment
-
Implementation
- Summary
- Other Books You May Enjoy
Product information
- Title: Modern Scala Projects
- Author(s):
- Release date:
- Publisher(s): Packt Publishing
- ISBN: None
You might also like
book
Scala Programming Projects
Discover unique features and powerful capabilities of Scala Programming as you build projects in a wide …
book
Scala Machine Learning Projects
Powerful smart applications using deep learning algorithms to dominate numerical computing, deep learning, and functional programming. …
book
Scala Reactive Programming
Build fault-tolerant, robust, and distributed applications in Scala About This Book Understand and use the concepts …
book
Learning Scala
Why learn Scala? You don’t need to be a data scientist or distributed computing expert to …