Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

O'Reilly logo
Thoughtful Machine Learning

Book Description

Learn how to apply test-driven development (TDD) to machine-learning algorithms—and catch mistakes that could sink your analysis. In this practical guide, author Matthew Kirk takes you through the principles of TDD and machine learning, and shows you how to apply TDD to several machine-learning algorithms, including Naive Bayesian classifiers and Neural Networks.

Table of Contents

  1. Preface
    1. What to Expect from This Book
    2. How to Read This Book
    3. Who This Book Is For
    4. How to Contact Me
    5. Conventions Used in This Book
    6. Using Code Examples
    7. Safari® Books Online
    8. How to Contact Us
    9. Acknowledgments
  2. 1. Test-Driven Machine Learning
    1. History of Test-Driven Development
    2. TDD and the Scientific Method
      1. TDD Makes a Logical Proposition of Validity
      2. TDD Involves Writing Your Assumptions Down on Paper or in Code
      3. TDD and Scientific Method Work in Feedback Loops
    3. Risks with Machine Learning
      1. Unstable Data
      2. Underfitting
      3. Overfitting
      4. Unpredictable Future
    4. What to Test for to Reduce Risks
      1. Mitigate Unstable Data with Seam Testing
      2. Check Fit by Cross-Validating
      3. Reduce Overfitting Risk by Testing the Speed of Training
      4. Monitor for Future Shifts with Precision and Recall
    5. Conclusion
  3. 2. A Quick Introduction to Machine Learning
    1. What Is Machine Learning?
      1. Supervised Learning
      2. Unsupervised Learning
      3. Reinforcement Learning
    2. What Can Machine Learning Accomplish?
    3. Mathematical Notation Used Throughout the Book
    4. Conclusion
  4. 3. K-Nearest Neighbors Classification
    1. History of K-Nearest Neighbors Classification
    2. House Happiness Based on a Neighborhood
    3. How Do You Pick K?
      1. Guessing K
      2. Heuristics for Picking K
      3. Algorithms for Picking K
    4. What Makes a Neighbor “Near”?
      1. Minkowski Distance
      2. Mahalanobis Distance
    5. Determining Classes
    6. Beard and Glasses Detection Using KNN and OpenCV
      1. The Class Diagram
      2. Raw Image to Avatar
      3. The Face Class
      4. The Neighborhood Class
    7. Conclusion
  5. 4. Naive Bayesian Classification
    1. Using Bayes’s Theorem to Find Fraudulent Orders
      1. Conditional Probabilities
      2. Inverse Conditional Probability (aka Bayes’s Theorem)
    2. Naive Bayesian Classifier
      1. The Chain Rule
      2. Naivety in Bayesian Reasoning
      3. Pseudocount
    3. Spam Filter
      1. The Class Diagram
      2. Data Source
      3. Email Class
      4. Tokenization and Context
      5. The SpamTrainer
      6. Error Minimization Through Cross-Validation
    4. Conclusion
  6. 5. Hidden Markov Models
    1. Tracking User Behavior Using State Machines
      1. Emissions/Observations of Underlying States
      2. Simplification through the Markov Assumption
      3. Using Markov Chains Instead of a Finite State Machine
      4. Hidden Markov Model
    2. Evaluation: Forward-Backward Algorithm
      1. Using User Behavior
    3. The Decoding Problem through the Viterbi Algorithm
    4. The Learning Problem
    5. Part-of-Speech Tagging with the Brown Corpus
      1. The Seam of Our Part-of-Speech Tagger: CorpusParser
      2. Writing the Part-of-Speech Tagger
      3. Cross-Validating to Get Confidence in the Model
      4. How to Make This Model Better
    6. Conclusion
  7. 6. Support Vector Machines
    1. Solving the Loyalty Mapping Problem
    2. Derivation of SVM
    3. Nonlinear Data
      1. The Kernel Trick
      2. Soft Margins
    4. Using SVM to Determine Sentiment
      1. The Class Diagram
      2. Corpus Class
      3. Return a Unique Set of Words from the Corpus
      4. The CorpusSet Class
      5. The SentimentClassifier Class
      6. Improving Results Over Time
    5. Conclusion
  8. 7. Neural Networks
    1. History of Neural Networks
    2. What Is an Artificial Neural Network?
      1. Input Layer
      2. Hidden Layers
      3. Neurons
      4. Output Layer
      5. Training Algorithms
    3. Building Neural Networks
      1. How Many Hidden Layers?
      2. How Many Neurons for Each Layer?
      3. Tolerance for Error and Max Epochs
    4. Using a Neural Network to Classify a Language
      1. Writing the Seam Test for Language
      2. Cross-Validating Our Way to a Network Class
      3. Tuning the Neural Network
      4. Convergence Testing
      5. Precision and Recall for Neural Networks
      6. Wrap-Up of Example
    5. Conclusion
  9. 8. Clustering
    1. User Cohorts
    2. K-Means Clustering
      1. The K-Means Algorithm
      2. The Downside of K-Means Clustering
    3. Expectation Maximization (EM) Clustering
    4. The Impossibility Theorem
    5. Categorizing Music
      1. Gathering the Data
      2. Analyzing the Data with K-Means
      3. EM Clustering
      4. EM Jazz Clustering Results
    6. Conclusion
  10. 9. Kernel Ridge Regression
    1. Collaborative Filtering
    2. Linear Regression Applied to Collaborative Filtering
    3. Introducing Regularization, or Ridge Regression
    4. Kernel Ridge Regression
    5. Wrap-Up of Theory
    6. Collaborative Filtering with Beer Styles
      1. Data Set
      2. The Tools We Will Need
      3. Reviewer
      4. Writing the Code to Figure Out Someone’s Preference
      5. Collaborative Filtering with User Preferences
    7. Conclusion
  11. 10. Improving Models and Data Extraction
    1. The Problem with the Curse of Dimensionality
    2. Feature Selection
    3. Feature Transformation
    4. Principal Component Analysis (PCA)
    5. Independent Component Analysis (ICA)
    6. Monitoring Machine Learning Algorithms
      1. Precision and Recall: Spam Filter
      2. The Confusion Matrix
    7. Mean Squared Error
    8. The Wilds of Production Environments
    9. Conclusion
  12. 11. Putting It All Together
    1. Machine Learning Algorithms Revisited
    2. How to Use This Information for Solving Problems
    3. What’s Next for You?
  13. Index