You are previewing Spark for Data Science.
O'Reilly logo
Spark for Data Science

Book Description

Analyze your data and delve deep into the world of machine learning with the latest Spark version, 2.0

About This Book

  • Perform data analysis and build predictive models on huge datasets that leverage Apache Spark

  • Learn to integrate data science algorithms and techniques with the fast and scalable computing features of Spark to address big data challenges

  • Work through practical examples on real-world problems with sample code snippets

  • Who This Book Is For

    This book is for anyone who wants to leverage Apache Spark for data science and machine learning. If you are a technologist who wants to expand your knowledge to perform data science operations in Spark, or a data scientist who wants to understand how algorithms are implemented in Spark, or a newbie with minimal development experience who wants to learn about Big Data Analytics, this book is for you!

    What You Will Learn

  • Consolidate, clean, and transform your data acquired from various data sources

  • Perform statistical analysis of data to find hidden insights

  • Explore graphical techniques to see what your data looks like

  • Use machine learning techniques to build predictive models

  • Build scalable data products and solutions

  • Start programming using the RDD, DataFrame and Dataset APIs

  • Become an expert by improving your data analytical skills

  • In Detail

    This is the era of Big Data. The words ‘Big Data’ implies big innovation and enables a competitive advantage for businesses. Apache Spark was designed to perform Big Data analytics at scale, and so Spark is equipped with the necessary algorithms and supports multiple programming languages.

    Whether you are a technologist, a data scientist, or a beginner to Big Data analytics, this book will provide you with all the skills necessary to perform statistical data analysis, data visualization, predictive modeling, and build scalable data products or solutions using Python, Scala, and R.

    With ample case studies and real-world examples, Spark for Data Science will help you ensure the successful execution of your data science projects.

    Style and approach

    This book takes a step-by-step approach to statistical analysis and machine learning, and is explained in a conversational and easy-to-follow style. Each topic is explained sequentially with a focus on the fundamentals as well as the advanced concepts of algorithms and techniques. Real-world examples with sample code snippets are also included.

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

    Table of Contents

    1. Spark for Data Science
      1. Spark for Data Science
      2. Credits
      3. Foreword
      4. About the Authors
      5. About the Reviewers
      6. www.PacktPub.com
        1. Why subscribe?
      7. Preface
        1. What this book covers
        2. What you need for this book
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Downloading the color images of this book
          3. Errata
          4. Piracy
          5. Questions
      8. 1. Big Data and Data Science – An Introduction
        1. Big data overview
        2. Challenges with big data analytics
          1. Computational challenges
          2. Analytical challenges
        3. Evolution of big data analytics
        4. Spark for data analytics
        5. The Spark stack
          1. Spark core
          2. Spark SQL
          3. Spark streaming
          4. MLlib
          5. GraphX
          6. SparkR
        6. Summary
        7. References
      9. 2. The Spark Programming Model
        1. The programming paradigm
          1. Supported programming languages
            1. Scala
            2. Java
            3. Python
            4. R
          2. Choosing the right language
        2. The Spark engine
          1. Driver program
          2. The Spark shell
          3. SparkContext
          4. Worker nodes
          5. Executors
          6. Shared variables
          7. Flow of execution
        3. The RDD API
          1. RDD basics
          2. Persistence
        4. RDD operations
          1. Creating RDDs
          2. Transformations on normal RDDs
            1. The filter operation
            2. The distinct operation
            3. The intersection operation
            4. The union operation
            5. The map operation
            6. The flatMap operation
            7. The keys operation
            8. The cartesian operation
          3. Transformations on pair RDDs
            1. The groupByKey operation
            2. The join operation
            3. The reduceByKey operation
            4. The aggregate operation
          4. Actions
            1. The collect() function
            2. The count() function
            3. The take(n) function
            4. The first() function
            5. The takeSample() function
            6. The countByKey() function
        5. Summary
        6. References
      10. 3. Introduction to DataFrames
        1. Why DataFrames?
        2. Spark SQL
          1. The Catalyst optimizer
        3. The DataFrame API
          1. DataFrame basics
          2. RDDs versus DataFrames
            1. Similarities
            2. Differences
        4. Creating DataFrames
          1. Creating DataFrames from RDDs
          2. Creating DataFrames from JSON
          3. Creating DataFrames from databases using JDBC
          4. Creating DataFrames from Apache Parquet
          5. Creating DataFrames from other data sources
        5. DataFrame operations
          1. Under the hood
        6. Summary
        7. References
      11. 4. Unified Data Access
        1. Data abstractions in Apache Spark
        2. Datasets
          1. Working with Datasets
            1. Creating Datasets from JSON
          2. Datasets API's limitations
        3. Spark SQL
          1. SQL operations
          2. Under the hood
        4. Structured Streaming
          1. The Spark streaming programming model
          2. Under the hood
          3. Comparison with other streaming engines
        5. Continuous applications
        6. Summary
        7. References
      12. 5. Data Analysis on Spark
        1. Data analytics life cycle
        2. Data acquisition
        3. Data preparation
          1. Data consolidation
          2. Data cleansing
            1. Missing value treatment
            2. Outlier treatment
            3. Duplicate values treatment
          3. Data transformation
        4. Basics of statistics
          1. Sampling
            1. Simple random sample
            2. Systematic sampling
            3. Stratified sampling
          2. Data distributions
            1. Frequency distributions
            2. Probability distributions
        5. Descriptive statistics
          1. Measures of location
            1. Mean
            2. Median
            3. Mode
          2. Measures of spread
            1. Range
            2. Variance
            3. Standard deviation
          3. Summary statistics
          4. Graphical techniques
        6. Inferential statistics
          1. Discrete probability distributions
            1. Bernoulli distribution
            2. Binomial distribution
              1. Sample problem
            3. Poisson distribution
              1. Sample problem
          2. Continuous probability distributions
            1. Normal distribution
            2. Standard normal distribution
            3. Chi-square distribution
              1. Sample problem
            4. Student's t-distribution
            5. F-distribution
          3. Standard error
          4. Confidence level
          5. Margin of error and confidence interval
          6. Variability in the population
          7. Estimating sample size
          8. Hypothesis testing
            1. Null and alternate hypotheses
            2. Chi-square test
            3. F-test
              1. Problem:
            4. Correlations
        7. Summary
        8. References
      13. 6. Machine Learning
        1. Introduction
          1. The evolution
          2. Supervised learning
          3. Unsupervised learning
        2. MLlib and the Pipeline API
          1. MLlib
          2. ML pipeline
            1. Transformer
            2. Estimator
        3. Introduction to machine learning
          1. Parametric methods
          2. Non-parametric methods
        4. Regression methods
          1. Linear regression
            1. Loss function
            2. Optimization
          2. Regularizations on regression
            1. Ridge regression
            2. Lasso regression
            3. Elastic net regression
        5. Classification methods
          1. Logistic regression
        6. Linear Support Vector Machines (SVM)
          1. Linear kernel
          2. Polynomial kernel
          3. Radial Basis Function kernel
          4. Sigmoid kernel
        7. Training an SVM
        8. Decision trees
          1. Impurity measures
            1. Gini Index
            2. Entropy
            3. Variance
          2. Stopping rule
          3. Split candidates
            1. Categorical features
            2. Continuous features
          4. Advantages of decision trees
          5. Disadvantages of decision trees
          6. Example
        9. Ensembles
          1. Random forests
            1. Advantages of random forests
          2. Gradient-Boosted Trees
        10. Multilayer perceptron classifier
        11. Clustering techniques
          1. K-means clustering
            1. Disadvantages of k-means
            2. Example
        12. Summary
        13. References
      14. 7. Extending Spark with SparkR
        1. SparkR basics
          1. Accessing SparkR from the R environment
          2. RDDs and DataFrames
          3. Getting started
        2. Advantages and limitations
        3. Programming with SparkR
          1. Function name masking
          2. Subsetting data
          3. Column functions
          4. Grouped data
        4. SparkR DataFrames
          1. SQL operations
          2. Set operations
          3. Merging DataFrames
        5. Machine learning
          1. The Naive Bayes model
          2. The Gaussian GLM model
        6. Summary
        7. References
      15. 8. Analyzing Unstructured Data
        1. Sources of unstructured data
        2. Processing unstructured data
          1. Count vectorizer
          2. TF-IDF
          3. Stop-word removal
          4. Normalization/scaling
          5. Word2Vec
          6. n-gram modelling
        3. Text classification
          1. Naive Bayes classifier
        4. Text clustering
          1. K-means
        5. Dimensionality reduction
        6. Singular Value Decomposition
          1. Principal Component Analysis
        7. Summary
        8. References:
      16. 9. Visualizing Big Data
        1. Why visualize data?
          1. A data engineer's perspective
          2. A data scientist's perspective
          3. A business user's perspective
        2. Data visualization tools
          1. IPython notebook
          2. Apache Zeppelin
          3. Third-party tools
        3. Data visualization techniques
          1. Summarizing and visualizing
          2. Subsetting and visualizing
          3. Sampling and visualizing
          4. Modeling and visualizing
        4. Summary
        5. References
          1. Data source citations
      17. 10. Putting It All Together
        1. A quick recap
        2. Introducing a case study
        3. The business problem
        4. Data acquisition and data cleansing
        5. Developing the hypothesis
        6. Data exploration
        7. Data preparation
          1. Too many levels in a categorical variable
          2. Numerical variables with too much variation
            1. Missing data
            2. Continuous data
            3. Categorical data
            4. Preparing the data
        8. Model building
        9. Data visualization
        10. Communicating the results to business users
        11. Summary
        12. References
      18. 11. Building Data Science Applications
        1. Scope of development
          1. Expectations
          2. Presentation options
            1. Interactive notebooks
              1. References
            2. Web API
              1. References
            3. PMML and PFA
              1. References
          3. Development and testing
            1. References
          4. Data quality management
        2. The Scala advantage
        3. Spark development status
          1. Spark 2.0's features and enhancements
            1. Unifying Datasets and DataFrames
            2. Structured Streaming
            3. Project Tungsten phase 2
          2. What's in store?
        4. The big data trends
        5. Summary
        6. References