O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Data Science with Java

Book Description

Data Science is booming thanks to R and Python, but Java brings the robustness, convenience, and ability to scale critical to today’s data science applications. With this practical book, Java software engineers looking to add data science skills will take a logical journey through the data science pipeline. Author Michael Brzustowicz explains the basic math theory behind each step of the data science process, as well as how to apply these concepts with Java.

You’ll learn the critical roles that data IO, linear algebra, statistics, data operations, learning and prediction, and Hadoop MapReduce play in the process. Throughout this book, you’ll find code examples you can use in your applications.

  • Examine methods for obtaining, cleaning, and arranging data into its purest form
  • Understand the matrix structure that your data should take
  • Learn basic concepts for testing the origin and validity of data
  • Transform your data into stable and usable numerical values
  • Understand supervised and unsupervised learning algorithms, and methods for evaluating their success
  • Get up and running with MapReduce, using customized components suitable for data science algorithms

Table of Contents

  1. Preface
    1. Who Should Read This Book
    2. Why I Wrote This Book
    3. A Word on Data Science Today
    4. Navigating This Book
    5. Conventions Used in This Book
    6. Using Code Examples
    7. O’Reilly Safari
    8. How to Contact Us
    9. Acknowledgments
  2. 1. Data I/O
    1. What Is Data, Anyway?
    2. Data Models
      1. Univariate Arrays
      2. Multivariate Arrays
      3. Data Objects
      4. Matrices and Vectors
      5. JSON
    3. Dealing with Real Data
      1. Nulls
      2. Blank Spaces
      3. Parse Errors
      4. Outliers
    4. Managing Data Files
      1. Understanding File Contents First
      2. Reading from a Text File
      3. Reading from a JSON File
      4. Reading from an Image File
      5. Writing to a Text File
    5. Mastering Database Operations
      1. Command-Line Clients
      2. Structured Query Language
      3. Java Database Connectivity
    6. Visualizing Data with Plots
      1. Creating Simple Plots
      2. Plotting Mixed Chart Types
      3. Saving a Plot to a File
  3. 2. Linear Algebra
    1. Building Vectors and Matrices
      1. Array Storage
      2. Block Storage
      3. Map Storage
      4. Accessing Elements
      5. Working with Submatrices
      6. Randomization
    2. Operating on Vectors and Matrices
      1. Scaling
      2. Transposing
      3. Addition and Subtraction
      4. Length
      5. Distances
      6. Multiplication
      7. Inner Product
      8. Outer Product
      9. Entrywise Product
      10. Compound Operations
      11. Affine Transformation
      12. Mapping a Function
    3. Decomposing Matrices
      1. Cholesky Decomposition
      2. LU Decomposition
      3. QR Decomposition
      4. Singular Value Decomposition
      5. Eigen Decomposition
      6. Determinant
      7. Inverse
    4. Solving Linear Systems
  4. 3. Statistics
    1. The Probabilistic Origins of Data
      1. Probability Density
      2. Cumulative Probability
      3. Statistical Moments
      4. Entropy
      5. Continuous Distributions
      6. Discrete Distributions
    2. Characterizing Datasets
      1. Calculating Moments
      2. Descriptive Statistics
      3. Multivariate Statistics
      4. Covariance and Correlation
      5. Regression
    3. Working with Large Datasets
      1. Accumulating Statistics
      2. Merging Statistics
      3. Regression
    4. Using Built-in Database Functions
  5. 4. Data Operations
    1. Transforming Text Data
      1. Extracting Tokens from a Document
      2. Utilizing Dictionaries
      3. Vectorizing a Document
    2. Scaling and Regularizing Numeric Data
      1. Scaling Columns
      2. Scaling Rows
      3. Matrix Scaling Operator
    3. Reducing Data to Principal Components
      1. Covariance Method
      2. SVD Method
    4. Creating Training, Validation, and Test Sets
      1. Index-Based Resampling
      2. List-Based Resampling
      3. Mini-Batches
    5. Encoding Labels
      1. A Generic Encoder
      2. One-Hot Encoding
  6. 5. Learning and Prediction
    1. Learning Algorithms
      1. Iterative Learning Procedure
      2. Gradient Descent Optimizer
    2. Evaluating Learning Processes
      1. Minimizing a Loss Function
      2. Minimizing the Sum of Variances
      3. Silhouette Coefficient
      4. Log-Likelihood
      5. Classifier Accuracy
    3. Unsupervised Learning
      1. k-Means Clustering
      2. DBSCAN
      3. Gaussian Mixtures
    4. Supervised Learning
      1. Naive Bayes
      2. Linear Models
      3. Deep Networks
  7. 6. Hadoop MapReduce
    1. Hadoop Distributed File System
    2. MapReduce Architecture
    3. Writing MapReduce Applications
      1. Anatomy of a MapReduce Job
      2. Hadoop Data Types
      3. Mappers
      4. Reducers
      5. The Simplicity of a JSON String as Text
      6. Deployment Wizardry
    4. MapReduce Examples
      1. Word Count
      2. Custom Word Count
      3. Sparse Linear Algebra
  8. A. Datasets
    1. Anscombe’s Quartet
    2. Sentiment
    3. Gaussian Mixtures
    4. Iris
    5. MNIST
  9. Index