You are previewing Scala for Data Science.
O'Reilly logo
Scala for Data Science

Book Description

Leverage the power of Scala with different tools to build scalable, robust data science applications

About This Book

  • A complete guide for scalable data science solutions, from data ingestion to data visualization

  • Deploy horizontally scalable data processing pipelines and take advantage of web frameworks to build engaging visualizations

  • Build functional, type-safe routines to interact with relational and NoSQL databases with the help of tutorials and examples provided

  • Who This Book Is For

    If you are a Scala developer or data scientist, or if you want to enter the field of data science, then this book will give you all the tools you need to implement data science solutions.

    What You Will Learn

  • Transform and filter tabular data to extract features for machine learning

  • Implement your own algorithms or take advantage of MLLib’s extensive suite of models to build distributed machine learning pipelines

  • Read, transform, and write data to both SQL and NoSQL databases in a functional manner

  • Write robust routines to query web APIs

  • Read data from web APIs such as the GitHub or Twitter API

  • Use Scala to interact with MongoDB, which offers high performance and helps to store large data sets with uncertain query requirements

  • Create Scala web applications that couple with JavaScript libraries such as D3 to create compelling interactive visualizations

  • Deploy scalable parallel applications using Apache Spark, loading data from HDFS or Hive

  • In Detail

    Scala is a multi-paradigm programming language (it supports both object-oriented and functional programming) and scripting language used to build applications for the JVM. Languages such as R, Python, Java, and so on are mostly used for data science. It is particularly good at analyzing large sets of data without any significant impact on performance and thus Scala is being adopted by many developers and data scientists. Data scientists might be aware that building applications that are truly scalable is hard. Scala, with its powerful functional libraries for interacting with databases and building scalable frameworks will give you the tools to construct robust data pipelines.

    This book will introduce you to the libraries for ingesting, storing, manipulating, processing, and visualizing data in Scala.

    Packed with real-world examples and interesting data sets, this book will teach you to ingest data from flat files and web APIs and store it in a SQL or NoSQL database. It will show you how to design scalable architectures to process and modelling your data, starting from simple concurrency constructs such as parallel collections and futures, through to actor systems and Apache Spark. As well as Scala’s emphasis on functional structures and immutability, you will learn how to use the right parallel construct for the job at hand, minimizing development time without compromising scalability. Finally, you will learn how to build beautiful interactive visualizations using web frameworks.

    This book gives tutorials on some of the most common Scala libraries for data science, allowing you to quickly get up to speed with building data science and data engineering solutions.

    Style and approach

    A tutorial with complete examples, this book will give you the tools to start building useful data engineering and data science solutions straightaway

    Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

    Table of Contents

    1. Scala for Data Science
      1. Table of Contents
      2. Scala for Data Science
      3. Credits
      4. About the Author
      5. About the Reviewers
      6. www.PacktPub.com
        1. Support files, eBooks, discount offers, and more
          1. Why subscribe?
          2. Free access for Packt account holders
      7. Preface
        1. What this book covers
        2. What you need for this book
          1. Installing the JDK
          2. Installing and using SBT
        3. Who this book is for
        4. Conventions
        5. Reader feedback
        6. Customer support
          1. Downloading the example code
          2. Errata
          3. Piracy
          4. eBooks, discount offers, and more
          5. Questions
      8. 1. Scala and Data Science
        1. Data science
        2. Programming in data science
        3. Why Scala?
          1. Static typing and type inference
          2. Scala encourages immutability
          3. Scala and functional programs
          4. Null pointer uncertainty
          5. Easier parallelism
          6. Interoperability with Java
        4. When not to use Scala
        5. Summary
        6. References
      9. 2. Manipulating Data with Breeze
        1. Code examples
        2. Installing Breeze
        3. Getting help on Breeze
        4. Basic Breeze data types
          1. Vectors
          2. Dense and sparse vectors and the vector trait
          3. Matrices
          4. Building vectors and matrices
          5. Advanced indexing and slicing
          6. Mutating vectors and matrices
          7. Matrix multiplication, transposition, and the orientation of vectors
          8. Data preprocessing and feature engineering
          9. Breeze – function optimization
          10. Numerical derivatives
          11. Regularization
        5. An example – logistic regression
        6. Towards re-usable code
        7. Alternatives to Breeze
        8. Summary
        9. References
      10. 3. Plotting with breeze-viz
        1. Diving into Breeze
        2. Customizing plots
        3. Customizing the line type
        4. More advanced scatter plots
        5. Multi-plot example – scatterplot matrix plots
        6. Managing without documentation
        7. Breeze-viz reference
        8. Data visualization beyond breeze-viz
        9. Summary
      11. 4. Parallel Collections and Futures
        1. Parallel collections
          1. Limitations of parallel collections
          2. Error handling
          3. Setting the parallelism level
          4. An example – cross-validation with parallel collections
        2. Futures
          1. Future composition – using a future's result
          2. Blocking until completion
          3. Controlling parallel execution with execution contexts
          4. Futures example – stock price fetcher
        3. Summary
        4. References
      12. 5. Scala and SQL through JDBC
        1. Interacting with JDBC
        2. First steps with JDBC
          1. Connecting to a database server
          2. Creating tables
          3. Inserting data
          4. Reading data
        3. JDBC summary
        4. Functional wrappers for JDBC
        5. Safer JDBC connections with the loan pattern
        6. Enriching JDBC statements with the "pimp my library" pattern
        7. Wrapping result sets in a stream
        8. Looser coupling with type classes
          1. Type classes
          2. Coding against type classes
          3. When to use type classes
          4. Benefits of type classes
        9. Creating a data access layer
        10. Summary
        11. References
      13. 6. Slick – A Functional Interface for SQL
        1. FEC data
          1. Importing Slick
          2. Defining the schema
          3. Connecting to the database
          4. Creating tables
          5. Inserting data
          6. Querying data
        2. Invokers
        3. Operations on columns
        4. Aggregations with "Group by"
        5. Accessing database metadata
        6. Slick versus JDBC
        7. Summary
        8. References
      14. 7. Web APIs
        1. A whirlwind tour of JSON
        2. Querying web APIs
        3. JSON in Scala – an exercise in pattern matching
          1. JSON4S types
          2. Extracting fields using XPath
        4. Extraction using case classes
        5. Concurrency and exception handling with futures
        6. Authentication – adding HTTP headers
          1. HTTP – a whirlwind overview
          2. Adding headers to HTTP requests in Scala
        7. Summary
        8. References
      15. 8. Scala and MongoDB
        1. MongoDB
        2. Connecting to MongoDB with Casbah
          1. Connecting with authentication
        3. Inserting documents
        4. Extracting objects from the database
        5. Complex queries
        6. Casbah query DSL
        7. Custom type serialization
        8. Beyond Casbah
        9. Summary
        10. References
      16. 9. Concurrency with Akka
        1. GitHub follower graph
        2. Actors as people
        3. Hello world with Akka
        4. Case classes as messages
        5. Actor construction
        6. Anatomy of an actor
        7. Follower network crawler
        8. Fetcher actors
        9. Routing
        10. Message passing between actors
        11. Queue control and the pull pattern
        12. Accessing the sender of a message
        13. Stateful actors
        14. Follower network crawler
        15. Fault tolerance
        16. Custom supervisor strategies
        17. Life-cycle hooks
        18. What we have not talked about
        19. Summary
        20. References
      17. 10. Distributed Batch Processing with Spark
        1. Installing Spark
        2. Acquiring the example data
        3. Resilient distributed datasets
          1. RDDs are immutable
          2. RDDs are lazy
          3. RDDs know their lineage
          4. RDDs are resilient
          5. RDDs are distributed
          6. Transformations and actions on RDDs
          7. Persisting RDDs
          8. Key-value RDDs
          9. Double RDDs
        4. Building and running standalone programs
          1. Running Spark applications locally
          2. Reducing logging output and Spark configuration
          3. Running Spark applications on EC2
        5. Spam filtering
        6. Lifting the hood
        7. Data shuffling and partitions
        8. Summary
        9. Reference
      18. 11. Spark SQL and DataFrames
        1. DataFrames – a whirlwind introduction
        2. Aggregation operations
        3. Joining DataFrames together
        4. Custom functions on DataFrames
        5. DataFrame immutability and persistence
        6. SQL statements on DataFrames
        7. Complex data types – arrays, maps, and structs
          1. Structs
          2. Arrays
          3. Maps
        8. Interacting with data sources
          1. JSON files
          2. Parquet files
        9. Standalone programs
        10. Summary
        11. References
      19. 12. Distributed Machine Learning with MLlib
        1. Introducing MLlib – Spam classification
        2. Pipeline components
          1. Transformers
          2. Estimators
        3. Evaluation
        4. Regularization in logistic regression
        5. Cross-validation and model selection
        6. Beyond logistic regression
        7. Summary
        8. References
      20. 13. Web APIs with Play
        1. Client-server applications
        2. Introduction to web frameworks
        3. Model-View-Controller architecture
        4. Single page applications
        5. Building an application
        6. The Play framework
        7. Dynamic routing
        8. Actions
          1. Composing the response
          2. Understanding and parsing the request
        9. Interacting with JSON
        10. Querying external APIs and consuming JSON
          1. Calling external web services
          2. Parsing JSON
          3. Asynchronous actions
        11. Creating APIs with Play: a summary
        12. Rest APIs: best practice
        13. Summary
        14. References
      21. 14. Visualization with D3 and the Play Framework
        1. GitHub user data
        2. Do I need a backend?
        3. JavaScript dependencies through web-jars
        4. Towards a web application: HTML templates
        5. Modular JavaScript through RequireJS
        6. Bootstrapping the applications
        7. Client-side program architecture
          1. Designing the model
          2. The event bus
          3. AJAX calls through JQuery
          4. Response views
        8. Drawing plots with NVD3
        9. Summary
        10. References
      22. A. Pattern Matching and Extractors
        1. Pattern matching in for comprehensions
        2. Pattern matching internals
        3. Extracting sequences
        4. Summary
        5. Reference
      23. Index