You are previewing Mastering Feature Engineering.
O'Reilly logo
Mastering Feature Engineering

Book Description

Feature engineering is essential to applied machine learning, but using domain knowledge to strengthen your predictive models can be difficult and expensive. To help fill the information gap on feature engineering, this complete hands-on guide teaches beginning-to-intermediate data scientists how to work with this widely practiced but little discussed topic.

Table of Contents

  1. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. Safari® Books Online
    4. How to Contact Us
    5. Acknowledgments
  2. 1. Introduction
    1. The Machine Learning Pipeline
      1. Data
      2. Tasks
      3. Models
      4. Features
  3. 2. Basic Feature Engineering for Text Data: Flatten and Filter
    1. Turning Natural Text into Flat Vectors
      1. Bag-of-words
      2. Implementing bag-of-words: parsing and tokenization
      3. Bag-of-N-Grams
      4. Collocation Extraction for Phrase Detection
      5. Quick summary
    2. Filtering for Cleaner Features
      1. Stopwords
      2. Frequency-based filtering
      3. Stemming
    3. Summary
  4. 3. The Effects of Feature Scaling: From Bag-of-Words to Tf-Idf
    1. Tf-Idf : A Simple Twist on Bag-of-Words
    2. Feature Scaling
      1. Min-max scaling
      2. Standardization (variance scaling)
      3. L2 normalization
    3. Putting it to the Test
      1. Creating a classification dataset
      2. Implementing tf-idf and feature scaling
      3. First try: plain logistic regression
      4. Second try: logistic regression with regularization
      5. Discussion of results
    4. Deep Dive: What is Happening?
    5. Summary
  5. 4. Dimensionality Reduction: Squashing the Data Pancake with PCA
    1. Intuition
    2. Derivation
      1. Tips and notations
      2. Linear projection
      3. Variance and empirical variance
      4. Principal components: first formulation
      5. Principal components: matrix-vector formulation
      6. General solution of the principal components
      7. Transforming features
      8. Implementing PCA
    3. PCA in Action
    4. Whitening and ZCA
    5. Considerations and Limitations of PCA
    6. Use Cases
    7. Summary
  6. 5. Non-Linear Featurization and Model Stacking
    1. K-means Clustering
    2. Clustering as surface tiling
    3. K-means featurization for classification
      1. Alternative dense featurization
    4. Concluding Remarks
  7. Index
  8. A. Linear Modeling and Linear Algebra Basics
    1. Overview of Linear Classification
    2. The Anatomy of a Matrix
      1. From vectors to subspaces
      2. Singular value decomposition (SVD)
      3. The four fundamental subspaces of the data matrix
    3. Solving a Linear System