You are previewing Machine Learning for Hackers.

Machine Learning for Hackers

Cover of Machine Learning for Hackers by Drew Conway... Published by O'Reilly Media, Inc.
  1. Machine Learning for Hackers
  2. Preface
    1. Machine Learning for Hackers
    2. How This Book Is Organized
    3. Conventions Used in This Book
    4. Using Code Examples
    5. Safari® Books Online
    6. How to Contact Us
    7. Acknowledgements
  3. 1. Using R
    1. R for Machine Learning
      1. Downloading and Installing R
      2. IDEs and Text Editors
      3. Loading and Installing R Packages
      4. R Basics for Machine Learning
      5. Further Reading on R
  4. 2. Data Exploration
    1. Exploration versus Confirmation
    2. What Is Data?
    3. Inferring the Types of Columns in Your Data
    4. Inferring Meaning
    5. Numeric Summaries
    6. Means, Medians, and Modes
    7. Quantiles
    8. Standard Deviations and Variances
    9. Exploratory Data Visualization
    10. Visualizing the Relationships Between Columns
  5. 3. Classification: Spam Filtering
    1. This or That: Binary Classification
    2. Moving Gently into Conditional Probability
    3. Writing Our First Bayesian Spam Classifier
      1. Defining the Classifier and Testing It with Hard Ham
      2. Testing the Classifier Against All Email Types
      3. Improving the Results
  6. 4. Ranking: Priority Inbox
    1. How Do You Sort Something When You Don’t Know the Order?
    2. Ordering Email Messages by Priority
      1. Priority Features of Email
    3. Writing a Priority Inbox
      1. Functions for Extracting the Feature Set
      2. Creating a Weighting Scheme for Ranking
      3. Weighting from Email Thread Activity
      4. Training and Testing the Ranker
  7. 5. Regression: Predicting Page Views
    1. Introducing Regression
      1. The Baseline Model
      2. Regression Using Dummy Variables
      3. Linear Regression in a Nutshell
    2. Predicting Web Traffic
    3. Defining Correlation
  8. 6. Regularization: Text Regression
    1. Nonlinear Relationships Between Columns: Beyond Straight Lines
      1. Introducing Polynomial Regression
    2. Methods for Preventing Overfitting
      1. Preventing Overfitting with Regularization
    3. Text Regression
      1. Logistic Regression to the Rescue
  9. 7. Optimization: Breaking Codes
    1. Introduction to Optimization
    2. Ridge Regression
    3. Code Breaking as Optimization
  10. 8. PCA: Building a Market Index
    1. Unsupervised Learning
  11. 9. MDS: Visually Exploring US Senator Similarity
    1. Clustering Based on Similarity
      1. A Brief Introduction to Distance Metrics and Multidirectional Scaling
    2. How Do US Senators Cluster?
      1. Analyzing US Senator Roll Call Data (101st–111th Congresses)
  12. 10. kNN: Recommendation Systems
    1. The k-Nearest Neighbors Algorithm
    2. R Package Installation Data
  13. 11. Analyzing Social Graphs
    1. Social Network Analysis
      1. Thinking Graphically
    2. Hacking Twitter Social Graph Data
      1. Working with the Google SocialGraph API
    3. Analyzing Twitter Networks
      1. Local Community Structure
      2. Visualizing the Clustered Twitter Network with Gephi
      3. Building Your Own “Who to Follow” Engine
  14. 12. Model Comparison
    1. SVMs: The Support Vector Machine
    2. Comparing Algorithms
  15. Works Cited
    1. Books
    2. Articles
  16. Index
  17. About the Authors
  18. Colophon
  19. Copyright
O'Reilly logo

Chapter 8. PCA: Building a Market Index

Unsupervised Learning

So far, all of our work with data has been based on prediction tasks: we’ve tried to classify emails or web page views where we had a training set of examples for which we knew the correct answer. As we mentioned early on in this book, learning from data when we have a training sample with the correct answer is called supervised learning: we find structure in our data using a signal that tells us whether or not we’re doing a good job of discovering real patterns.

But often we want to find structure without having any answers available to us about how well we’re doing; we call this unsupervised learning. For example, we might want to perform dimensionality reduction, which happens when we shrink a table with a huge number of columns into a table with a small number of columns. If you have too many columns to deal with, this dimensionality reduction goes a long way toward making your data set comprehensible. Although you clearly lose information when you replace many columns with a single column, the gains in understanding are often valuable, especially when you’re exploring a new data set.

One place where this type of dimensionality reduction is particularly helpful is when dealing with stock market data. For example, we might have data that looks like the real historical prices shown in Table 8-1 for 25 stocks over the period from January 2, 2002 until May 25, 2011.

Table 8-1. Historical stock prices

DateADCAFL...UTR
2002-01-02 ...

The best content for your career. Discover unlimited learning on demand for around $1/day.