You are previewing Scala Data Analysis Cookbook.
O'Reilly logo
Scala Data Analysis Cookbook

Book Description

Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes

About This Book

  • Implement Scala in your data analysis using features from Spark, Breeze, and Zeppelin

  • Scale up your data anlytics infrastructure with practical recipes for Scala machine learning

  • Recipes for every stage of the data analysis process, from reading and collecting data to distributed analytics

  • Who This Book Is For

    This book shows data scientists and analysts how to leverage their existing knowledge of Scala for quality and scalable data analysis.

    What You Will Learn

  • Familiarize and set up the Breeze and Spark libraries and use data structures

  • Import data from a host of possible sources and create dataframes from CSV

  • Clean, validate and transform data using Scala to pre-process numerical and string data

  • Integrate quintessential machine learning algorithms using Scala stack

  • Bundle and scale up Spark jobs by deploying them into a variety of cluster managers

  • Run streaming and graph analytics in Spark to visualize data, enabling exploratory analysis

  • In Detail

    This book will introduce you to the most popular Scala tools, libraries, and frameworks through practical recipes around loading, manipulating, and preparing your data. It will also help you explore and make sense of your data using stunning and insightfulvisualizations, and machine learning toolkits.

    Starting with introductory recipes on utilizing the Breeze and Spark libraries, get to grips withhow to import data from a host of possible sources and how to pre-process numerical, string, and date data. Next, you’ll get an understanding of concepts that will help you visualize data using the Apache Zeppelin and Bokeh bindings in Scala, enabling exploratory data analysis. iscover how to program quintessential machine learning algorithms using Spark ML library. Work through steps to scale your machine learning models and deploy them into a standalone cluster, EC2, YARN, and Mesos. Finally dip into the powerful options presented by Spark Streaming, and machine learning for streaming data, as well as utilizing Spark GraphX.

    Style and approach

    This book contains a rich set of recipes that covers the full spectrum of interesting data analysis tasks and will help you revolutionize your data analysis skills using Scala and Spark.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

    Table of Contents

    1. Scala Data Analysis Cookbook
      1. Table of Contents
      2. Scala Data Analysis Cookbook
      3. Credits
      4. About the Author
      5. About the Reviewers
      6. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why Subscribe?
          2. Free Access for Packt account holders
      7. Preface
        1. Apache Flink
        2. Scalding
        3. Saddle
        4. Spire
        5. Akka
        6. Accord
        7. What this book covers
        8. What you need for this book
        9. Who this book is for
        10. Sections
          1. Getting ready
          2. How to do it…
          3. How it works…
          4. There's more…
          5. See also
        11. Conventions
        12. Reader feedback
        13. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. Questions
      8. 1. Getting Started with Breeze
        1. Introduction
        2. Getting Breeze – the linear algebra library
          1. How to do it...
          2. There's more...
            1. The org.scalanlp.breeze dependency
            2. The org.scalanlp.breeze-natives package
        3. Working with vectors
          1. Getting ready
          2. How to do it...
            1. Creating vectors
            2. Constructing a vector from values
              1. Creating a zero vector
            3. Creating a vector out of a function
            4. Creating a vector of linearly spaced values
            5. Creating a vector with values in a specific range
            6. Creating an entire vector with a single value
            7. Slicing a sub-vector from a bigger vector
            8. Creating a Breeze Vector from a Scala Vector
            9. Vector arithmetic
            10. Scalar operations
            11. Calculating the dot product of two vectors
            12. Creating a new vector by adding two vectors together
            13. Appending vectors and converting a vector of one type to another
            14. Concatenating two vectors
              1. Converting a vector of Int to a vector of Double
              2. Computing basic statistics
              3. Mean and variance
            15. Standard deviation
            16. Find the largest value in a vector
            17. Finding the sum, square root and log of all the values in the vector
              1. The Sqrt function
              2. The Log function
        4. Working with matrices
          1. How to do it...
            1. Creating matrices
              1. Creating a matrix from values
              2. Creating a zero matrix
              3. Creating a matrix out of a function
              4. Creating an identity matrix
              5. Creating a matrix from random numbers
              6. Creating from a Scala collection
            2. Matrix arithmetic
              1. Addition
              2. Multiplication
            3. Appending and conversion
              1. Concatenating matrices – vertically
              2. Concatenating matrices – horizontally
              3. Converting a matrix of Int to a matrix of Double
            4. Data manipulation operations
              1. Getting column vectors out of the matrix
              2. Getting row vectors out of the matrix
              3. Getting values inside the matrix
              4. Getting the inverse and transpose of a matrix
            5. Computing basic statistics
              1. Mean and variance
              2. Standard deviation
              3. Finding the largest value in a matrix
              4. Finding the sum, square root and log of all the values in the matrix
              5. Sqrt
              6. Log
              7. Calculating the eigenvectors and eigenvalues of a matrix
          2. How it works...
        5. Vectors and matrices with randomly distributed values
          1. How it works...
            1. Creating vectors with uniformly distributed random values
            2. Creating vectors with normally distributed random values
            3. Creating vectors with random values that have a Poisson distribution
            4. Creating a matrix with uniformly random values
            5. Creating a matrix with normally distributed random values
            6. Creating a matrix with random values that has a Poisson distribution
        6. Reading and writing CSV files
          1. How it works...
      9. 2. Getting Started with Apache Spark DataFrames
        1. Introduction
        2. Getting Apache Spark
          1. How to do it...
        3. Creating a DataFrame from CSV
          1. How to do it...
          2. How it works...
          3. There's more…
        4. Manipulating DataFrames
          1. How to do it...
            1. Printing the schema of the DataFrame
            2. Sampling the data in the DataFrame
            3. Selecting DataFrame columns
            4. Filtering data by condition
            5. Sorting data in the frame
            6. Renaming columns
            7. Treating the DataFrame as a relational table
            8. Joining two DataFrames
              1. Inner join
              2. Right outer join
              3. Left outer join
            9. Saving the DataFrame as a file
        5. Creating a DataFrame from Scala case classes
          1. How to do it...
          2. How it works...
      10. 3. Loading and Preparing Data – DataFrame
        1. Introduction
        2. Loading more than 22 features into classes
          1. How to do it...
          2. How it works...
          3. There's more…
        3. Loading JSON into DataFrames
          1. How to do it…
            1. Reading a JSON file using SQLContext.jsonFile
            2. Reading a text file and converting it to JSON RDD
            3. Explicitly specifying your schema
          2. There's more…
        4. Storing data as Parquet files
          1. How to do it…
            1. Load a simple CSV file, convert it to case classes, and create a DataFrame from it
            2. Save it as a Parquet file
            3. Install Parquet tools
            4. Using the tools to inspect the Parquet file
            5. Enable compression for the Parquet file
        5. Using the Avro data model in Parquet
          1. How to do it…
            1. Creation of the Avro model
            2. Generation of Avro objects using the sbt-avro plugin
            3. Constructing an RDD of our generated object from Students.csv
            4. Saving RDD[StudentAvro] in a Parquet file
            5. Reading the file back for verification
            6. Using Parquet tools for verification
        6. Loading from RDBMS
          1. How to do it…
        7. Preparing data in Dataframes
          1. How to do it...
      11. 4. Data Visualization
        1. Introduction
        2. Visualizing using Zeppelin
          1. How to do it...
            1. Installing Zeppelin
            2. Customizing Zeppelin's server and websocket port
            3. Visualizing data on HDFS – parameterizing inputs
            4. Running custom functions
            5. Adding external dependencies to Zeppelin
            6. Pointing to an external Spark cluster
        3. Creating scatter plots with Bokeh-Scala
          1. How to do it...
            1. Preparing our data
            2. Creating Plot and Document objects
            3. Creating a marker object
            4. Setting the X and Y axes' data range for the plot
            5. Drawing the x and the y axes
            6. Viewing flower species with varying colors
            7. Adding grid lines
            8. Adding a legend to the plot
        4. Creating a time series MultiPlot with Bokeh-Scala
          1. How to do it...
            1. Preparing our data
            2. Creating a plot
            3. Creating a line that joins all the data points
            4. Setting the x and y axes' data range for the plot
            5. Drawing the axes and the grids
            6. Adding tools
            7. Adding a legend to the plot
            8. Multiple plots in the document
      12. 5. Learning from Data
        1. Introduction
        2. Supervised and unsupervised learning
        3. Gradient descent
        4. Predicting continuous values using linear regression
          1. How to do it...
            1. Importing the data
            2. Converting each instance into a LabeledPoint
            3. Preparing the training and test data
            4. Scaling the features
            5. Training the model
            6. Predicting against test data
            7. Evaluating the model
            8. Regularizing the parameters
            9. Mini batching
        5. Binary classification using LogisticRegression and SVM
          1. How to do it...
            1. Importing the data
            2. Tokenizing the data and converting it into LabeledPoints
            3. Factoring the inverse document frequency
            4. Prepare the training and test data
            5. Constructing the algorithm
            6. Training the model and predicting the test data
            7. Evaluating the model
        6. Binary classification using LogisticRegression with Pipeline API
          1. How to do it...
            1. Importing and splitting data as test and training sets
            2. Construct the participants of the Pipeline
            3. Preparing a pipeline and training a model
            4. Predicting against test data
            5. Evaluating a model without cross-validation
            6. Constructing parameters for cross-validation
            7. Constructing cross-validator and fit the best model
            8. Evaluating the model with cross-validation
        7. Clustering using K-means
          1. How to do it...
            1. KMeans.RANDOM
            2. KMeans.PARALLEL
              1. K-means++
              2. K-means||
            3. Max iterations
            4. Epsilon
            5. Importing the data and converting it into a vector
            6. Feature scaling the data
            7. Deriving the number of clusters
            8. Constructing the model
            9. Evaluating the model
        8. Feature reduction using principal component analysis
          1. How to do it...
            1. Dimensionality reduction of data for supervised learning
            2. Mean-normalizing the training data
            3. Extracting the principal components
            4. Preparing the labeled data
            5. Preparing the test data
            6. Classify and evaluate the metrics
            7. Dimensionality reduction of data for unsupervised learning
            8. Mean-normalizing the training data
            9. Extracting the principal components
            10. Arriving at the number of components
            11. Evaluating the metrics
      13. 6. Scaling Up
        1. Introduction
        2. Building the Uber JAR
          1. How to do it...
            1. Transitive dependency stated explicitly in the SBT dependency
              1. Two different libraries depend on the same external library
        3. Submitting jobs to the Spark cluster (local)
          1. How to do it...
            1. Downloading Spark
            2. Running HDFS on Pseudo-clustered mode
            3. Running the Spark master and slave locally
            4. Pushing data into HDFS
            5. Submitting the Spark application on the cluster
        4. Running the Spark Standalone cluster on EC2
          1. How to do it...
            1. Creating the AccessKey and pem file
            2. Setting the environment variables
            3. Running the launch script
            4. Verifying installation
            5. Making changes to the code
            6. Transferring the data and job files
            7. Loading the dataset into HDFS
            8. Running the job
            9. Destroying the cluster
        5. Running the Spark Job on Mesos (local)
          1. How to do it...
            1. Installing Mesos
            2. Starting the Mesos master and slave
            3. Uploading the Spark binary package and the dataset to HDFS
            4. Running the job
        6. Running the Spark Job on YARN (local)
          1. How to do it...
            1. Installing the Hadoop cluster
            2. Starting HDFS and YARN
            3. Pushing Spark assembly and dataset to HDFS
            4. Running a Spark job in yarn-client mode
            5. Running Spark job in yarn-cluster mode
      14. 7. Going Further
        1. Introduction
        2. Using Spark Streaming to subscribe to a Twitter stream
          1. How to do it...
        3. Using Spark as an ETL tool
          1. How to do it...
        4. Using StreamingLogisticRegression to classify a Twitter stream using Kafka as a training stream
          1. How to do it...
        5. Using GraphX to analyze Twitter data
          1. How to do it...
      15. Index