O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Mastering Scala Machine Learning

Book Description

Advance your skills in efficient data analysis and data processing using the powerful tools of Scala, Spark, and Hadoop

About This Book

  • This is a primer on functional-programming-style techniques to help you efficiently process and analyze all of your data

  • Get acquainted with the best and newest tools available such as Scala, Spark, Parquet and MLlib for machine learning

  • Learn the best practices to incorporate new Big Data machine learning in your data-driven enterprise to gain future scalability and maintainability

  • Who This Book Is For

    Mastering Scala Machine Learning is intended for enthusiasts who want to plunge into the new pool of emerging techniques for machine learning. Some familiarity with standard statistical techniques is required.

    What You Will Learn

  • Sharpen your functional programming skills in Scala using REPL

  • Apply standard and advanced machine learning techniques using Scala

  • Get acquainted with Big Data technologies and grasp why we need a functional approach to Big Data

  • Discover new data structures, algorithms, approaches, and habits that will allow you to work effectively with large amounts of data

  • Understand the principles of supervised and unsupervised learning in machine learning

  • Work with unstructured data and serialize it using Kryo, Protobuf, Avro, and AvroParquet

  • Construct reliable and robust data pipelines and manage data in a data-driven enterprise

  • Implement scalable model monitoring and alerts with Scala

  • In Detail

    Since the advent of object-oriented programming, new technologies related to Big Data are constantly popping up on the market. One such technology is Scala, which is considered to be a successor to Java in the area of Big Data by many, like Java was to C/C++ in the area of distributed programing.

    This book aims to take your knowledge to next level and help you impart that knowledge to build advanced applications such as social media mining, intelligent news portals, and more. After a quick refresher on functional programming concepts using REPL, you will see some practical examples of setting up the development environment and tinkering with data. We will then explore working with Spark and MLlib using k-means and decision trees.

    Most of the data that we produce today is unstructured and raw, and you will learn to tackle this type of data with advanced topics such as regression, classification, integration, and working with graph algorithms. Finally, you will discover at how to use Scala to perform complex concept analysis, to monitor model performance, and to build a model repository. By the end of this book, you will have gained expertise in performing Scala machine learning and will be able to build complex machine learning projects using Scala.

    Style and approach

    This hands-on guide dives straight into implementing Scala for machine learning without delving much into mathematical proofs or validations. There are ample code examples and tricks that will help you sail through using the standard techniques and libraries. This book provides practical examples from the field on how to correctly tackle data analysis problems, particularly for modern Big Data datasets.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

    Table of Contents

    1. Mastering Scala Machine Learning
      1. Table of Contents
      2. Mastering Scala Machine Learning
      3. Credits
      4. About the Author
      5. Acknowlegement
      6. www.PacktPub.com
        1. eBooks, discount offers, and more
          1. Why subscribe?
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Downloading the color images of this book
          3. Errata
          4. Piracy
          5. Questions
      8. 1. Exploratory Data Analysis
        1. Getting started with Scala
        2. Distinct values of a categorical field
        3. Summarization of a numeric field
          1. Grepping across multiple fields
        4. Basic, stratified, and consistent sampling
        5. Working with Scala and Spark Notebooks
        6. Basic correlations
        7. Summary
      9. 2. Data Pipelines and Modeling
        1. Influence diagrams
        2. Sequential trials and dealing with risk
        3. Exploration and exploitation
        4. Unknown unknowns
        5. Basic components of a data-driven system
          1. Data ingest
          2. Data transformation layer
          3. Data analytics and machine learning
          4. UI component
          5. Actions engine
          6. Correlation engine
          7. Monitoring
        6. Optimization and interactivity
          1. Feedback loops
        7. Summary
      10. 3. Working with Spark and MLlib
        1. Setting up Spark
        2. Understanding Spark architecture
          1. Task scheduling
          2. Spark components
          3. MQTT, ZeroMQ, Flume, and Kafka
          4. HDFS, Cassandra, S3, and Tachyon
          5. Mesos, YARN, and Standalone
        3. Applications
          1. Word count
          2. Streaming word count
          3. Spark SQL and DataFrame
        4. ML libraries
          1. SparkR
          2. Graph algorithms – GraphX and GraphFrames
        5. Spark performance tuning
        6. Running Hadoop HDFS
        7. Summary
      11. 4. Supervised and Unsupervised Learning
        1. Records and supervised learning
          1. Iris dataset
          2. Labeled point
          3. SVMWithSGD
          4. Logistic regression
          5. Decision tree
          6. Bagging and boosting – ensemble learning methods
        2. Unsupervised learning
        3. Problem dimensionality
        4. Summary
      12. 5. Regression and Classification
        1. What regression stands for?
        2. Continuous space and metrics
        3. Linear regression
        4. Logistic regression
        5. Regularization
        6. Multivariate regression
        7. Heteroscedasticity
        8. Regression trees
        9. Classification metrics
        10. Multiclass problems
        11. Perceptron
        12. Generalization error and overfitting
        13. Summary
      13. 6. Working with Unstructured Data
        1. Nested data
        2. Other serialization formats
        3. Hive and Impala
        4. Sessionization
        5. Working with traits
        6. Working with pattern matching
        7. Other uses of unstructured data
        8. Probabilistic structures
        9. Projections
        10. Summary
      14. 7. Working with Graph Algorithms
        1. A quick introduction to graphs
        2. SBT
        3. Graph for Scala
          1. Adding nodes and edges
          2. Graph constraints
          3. JSON
        4. GraphX
          1. Who is getting e-mails?
          2. Connected components
          3. Triangle counting
          4. Strongly connected components
          5. PageRank
          6. SVD++
        5. Summary
      15. 8. Integrating Scala with R and Python
        1. Integrating with R
          1. Setting up R and SparkR
            1. Linux
            2. Mac OS
            3. Windows
            4. Running SparkR via scripts
            5. Running Spark via R's command line
          2. DataFrames
          3. Linear models
          4. Generalized linear model
          5. Reading JSON files in SparkR
          6. Writing Parquet files in SparkR
          7. Invoking Scala from R
            1. Using Rserve
        2. Integrating with Python
          1. Setting up Python
          2. PySpark
          3. Calling Python from Java/Scala
            1. Using sys.process._
            2. Spark pipe
            3. Jython and JSR 223
        3. Summary
      16. 9. NLP in Scala
        1. Text analysis pipeline
          1. Simple text analysis
        2. MLlib algorithms in Spark
          1. TF-IDF
          2. LDA
        3. Segmentation, annotation, and chunking
        4. POS tagging
        5. Using word2vec to find word relationships
          1. A Porter Stemmer implementation of the code
        6. Summary
      17. 10. Advanced Model Monitoring
        1. System monitoring
          1. Process monitoring
          2. Model monitoring
            1. Performance over time
            2. Criteria for model retiring
            3. A/B testing
        2. Summary
      18. Index