O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Data Science with Java

Book Description

A good data scientist knows how to do something really well, but a great data scientist can do "something of everything." From raw data all the way to shining in front of C-level executives, a great data scientist has the skills to architect data systems, build applications, perform modeling and machine learning and wrap up the results in a clear (and quickly iterable) manner. From data models to ETL to databases to distributed algorithms and learning, this book has you covered.

Table of Contents

  1. 1. Data IO
    1. What is data anyway?
    2. Data Models
      1. Univariate Arrays
      2. Multivariate Arrays
      3. Data Objects
      4. Matrices and Vectors
      5. JSON
    3. Dealing with Real Data
      1. Nulls
      2. Blank Spaces
      3. Parse Errors
      4. Outliers
    4. Managing Data Files
      1. Understanding the File Contents
      2. Reading From a Text File
      3. Reading a JSON File
      4. Reading From an Image File
      5. Writing to a File
    5. Mastering Database Operations
      1. Command Line Clients
      2. Structured Query Language (SQL)
      3. Java Database Connectivity (JDBC)
    6. Visualizing Data with Plots
      1. Creating Simple Plots
      2. Plotting Mixed Chart Types
      3. Saving a Plot to a File
  2. 2. Linear Algebra
    1. Building Vectors and Matrices
      1. Real Vectors and Matrices
      2. Block Matrices
      3. Sparse Vectors and Matrices
      4. Accessing Vector and Matrix Elements
      5. Working with Sub-Matrices
      6. Randomized Matrices and Vectors
    2. Operating on Vectors and Matrices
      1. Scaling
      2. Transposing
      3. Addition and Subtraction
      4. Length
      5. Distances
      6. Multiplication
      7. Inner Product
      8. Outer Product
      9. Entrywise Product
      10. Compound Operations
      11. Affine Transformation
      12. Mapping a Function
    3. Decomposing Matrices
      1. Cholesky Decomposition
      2. LU Decomposition
      3. QR Decomposition
      4. Singular Value Decomposition (SVD)
      5. Eigen Decomposition
      6. Determinant
      7. Inverse
    4. Solving Linear Systems
  3. 3. Statistics
    1. The Probabilistic Origins of Data
      1. Probability Density
      2. Cumulative Probability
      3. Statistical Moments
      4. Entropy
      5. Continuous Distributions
      6. Discrete Distributions
    2. Characterizing Datasets
      1. Calculating Moments
      2. Descriptive Statistics
      3. Multivariate Statistics
      4. Covariance and Correlation
      5. Regression
    3. Working with Large Datasets
      1. Accumulating Statistics
      2. Merging Statistics
      3. Regression
    4. Using Built-In Database Functions
  4. 4. Data Operations
    1. Transforming Text Data
      1. Extracting Tokens from a Document
      2. Utilizing Dictionaries
      3. Vectorizing a Document
    2. Scaling and Regularizing Numeric Data
      1. Scaling Columns
      2. Scaling Rows
      3. Matrix Scaling Operator
    3. Reducing Data to Principal Components
      1. Covariance Method
      2. SVD Method
    4. Creating Training, Validation and Test Sets
      1. Index-Based Resampling
      2. List-Based Resampling
      3. Mini-Batches
    5. Encoding Labels
      1. A Generic Encoder
      2. One Hot Encoding
  5. 5. Learning and Prediction
    1. Learning Algorithms
      1. Iterative Learning Procedure
      2. Gradient Descent Optimizer
    2. Evaluating Learning Processes
      1. Minimizing a Loss Function
      2. Minimizing the Sum of Variances
      3. Silhouette Coefficient
      4. LogLikelihood
      5. Classifier Accuracy
    3. Unsupervised Learning
      1. K Means Clustering
      2. DBSCAN
      3. Gaussian Mixtures
    4. Supervised Learning
      1. Naive Bayes
      2. Linear Models
      3. Deep Networks
  6. 6. Hadoop MapReduce
    1. Hadoop Distributed File System (HDFS)
    2. MapReduce Architecture
    3. Writing MapReduce Applications
      1. Anatomy of a MapReduce Job
      2. Mapreduce IO
      3. Hadoop Data Types
      4. Mappers
      5. Reducers
      6. Comparitors
      7. The Simplicity of a JSON String as Text
      8. Mixing-In External Data
      9. Deployment Wizardry
    4. MapReduce Examples
      1. Word Count
      2. In-Line Random Sampler
      3. PCA training and test set?
      4. Dictionary
      5. NGram Counter
      6. Sparse Linear Algebra