O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Feature Engineering for Machine Learning

Book Description

With Early Release ebooks, you get books in their earliest form—the author's raw and unedited content as he or she writes—so you can take advantage of these technologies long before the official release of these titles. You'll also receive updates when significant changes are made, new chapters are available, and the final ebook bundle is released.

Feature engineering is essential to applied machine learning, but using domain knowledge to strengthen your predictive models can be difficult and expensive. To help fill the information gap on feature engineering, this complete hands-on guide teaches beginning-to-intermediate data scientists how to work with this widely practiced but little discussed topic.

Author Alice Zheng explains common practices and mathematical principles to help engineer features for new data and tasks. If you understand basic machine learning concepts like supervised and unsupervised learning, you’re ready to get started. Not only will you learn how to implement feature engineering in a systematic and principled way, you’ll also learn how to practice better data science.

  • Learn exactly what feature engineering is, why it’s important, and how to do it well
  • Explore various techniques such as feature scaling, bin-counting, and frequent sequence mining
  • Understand what is unsupervised feature learning and how it works in deep learning
  • See the methods in action for text mining, image tagging, churn prediction, and targeting advertising

Table of Contents

  1. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. O’Reilly Safari
    4. How to Contact Us
    5. Acknowledgments
  2. 1. Introduction
    1. The Machine Learning Pipeline
      1. Data
      2. Tasks
      3. Models
      4. Features
      5. Model evaluation
  3. 2. Fancy Tricks with Simple Numbers
    1. Scalars, vectors, and spaces
    2. Dealing with Counts
      1. Binarization
      2. Quantization or binning
    3. Log transformation
      1. Log Transform in Action
      2. Power Transforms: Generalization of the Log Transform
    4. Feature Scaling or Normalization
      1. Min-max scaling
      2. Standardization (variance scaling)
      3. L2 normalization
    5. Interaction Features
    6. Feature Selection
    7. Summary
    8. Bibliography
  4. 3. Text Data: Flatten, Filtering, and Chunking
    1. Bag of X: Turning Natural Text into Flat Vectors
      1. Bag-of-words
      2. Bag-of-N-Grams
    2. Filtering for Cleaner Features
      1. Stopwords
      2. Frequency-based filtering
      3. Stemming
    3. Atoms of Meaning: From Words to N-Grams to Phrases
      1. Parsing and tokenization
      2. Collocation Extraction for Phrase Detection
    4. Summary
    5. Bibliography
  5. 4. The Effects of Feature Scaling: From Bag-of-Words to Tf-Idf
    1. Tf-Idf : A Simple Twist on Bag-of-Words
    2. Putting it to the Test
      1. Creating a classification dataset
      2. Scaling bag-of-words with tf-idf transformation
      3. Classification with logistic regression
      4. Tuning logistic regression with regularization
    3. Deep Dive: What is Happening?
    4. Summary
    5. Bibliography
  6. 5. Categorical Variables: Counting Eggs in the Age of Robotic Chickens
    1. Encoding Categorical Variables
      1. One-hot encoding
      2. Dummy coding
      3. Effect coding
      4. Pros and cons of categorical variable encodings
    2. Dealing with Large Categorical Variables
      1. Feature hashing
      2. Bin-counting
      3. Summary
  7. 6. Dimensionality Reduction: Squashing the Data Pancake with PCA
    1. Intuition
    2. Derivation
      1. Tips and notations
      2. Linear projection
      3. Variance and empirical variance
      4. Principal components: first formulation
      5. Principal components: matrix-vector formulation
      6. General solution of the principal components
      7. Transforming features
      8. Implementing PCA
    3. PCA in Action
    4. Whitening and ZCA
    5. Considerations and Limitations of PCA
    6. Use Cases
    7. Summary
    8. Bibliography
  8. 7. Non-Linear Featurization and Model Stacking
    1. K-means Clustering
    2. Clustering as surface tiling
    3. K-means featurization for classification
      1. Alternative dense featurization
    4. Summary
  9. 8. Automating the Featurizer: Image Feature Extraction and Deep Learning
    1. Simplest Image Features (and Why They Don’t Work)
    2. Manual Feature Extraction: SIFT and HOG
      1. Image gradient
      2. Gradient orientation histogram
      3. SIFT architecture
    3. Learning Image Features with Deep Neural Networks
      1. Fully connected layer
      2. Convolutional layer
      3. Rectified Linear Unit (ReLU) transformation
      4. Response normalization layer
      5. Pooling layer
      6. Structure of AlexNet
    4. Summary
  10. 9. Back to the Feature: Putting it All Together
    1. Academic Paper Recommender
      1. Item Based Collaborative Filtering
      2. First Pass: Data Import, Cleaning and Feature Parsing
      3. Second Pass: More Engineering and Smarter Model
      4. Third Pass: More Features = More Information
    2. Summary
  11. A. Linear Modeling and Linear Algebra Basics
    1. Overview of Linear Classification
    2. The Anatomy of a Matrix
      1. From vectors to subspaces
      2. Singular value decomposition (SVD)
      3. The four fundamental subspaces of the data matrix
    3. Solving a Linear System
      1. Overview of Classifiers
  12. Index