You are previewing Programming Collective Intelligence.

Programming Collective Intelligence

Cover of Programming Collective Intelligence by Toby Segaran Published by O'Reilly Media, Inc.
  1. Programming Collective Intelligence
    1. SPECIAL OFFER: Upgrade this ebook with O’Reilly
    2. A Note Regarding Supplemental Files
    3. Praise for Programming Collective Intelligence
    4. Preface
      1. Prerequisites
      2. Style of Examples
      3. Why Python?
      4. Open APIs
      5. Overview of the Chapters
      6. Conventions
      7. Using Code Examples
      8. How to Contact Us
      9. Safari® Books Online
      10. Acknowledgments
    5. 1. Introduction to Collective Intelligence
      1. What Is Collective Intelligence?
      2. What Is Machine Learning?
      3. Limits of Machine Learning
      4. Real-Life Examples
      5. Other Uses for Learning Algorithms
    6. 2. Making Recommendations
      1. Collaborative Filtering
      2. Collecting Preferences
      3. Finding Similar Users
      4. Recommending Items
      5. Matching Products
      6. Building a del.icio.us Link Recommender
      7. Item-Based Filtering
      8. Using the MovieLens Dataset
      9. User-Based or Item-Based Filtering?
      10. Exercises
    7. 3. Discovering Groups
      1. Supervised versus Unsupervised Learning
      2. Word Vectors
      3. Hierarchical Clustering
      4. Drawing the Dendrogram
      5. Column Clustering
      6. K-Means Clustering
      7. Clusters of Preferences
      8. Viewing Data in Two Dimensions
      9. Other Things to Cluster
      10. Exercises
    8. 4. Searching and Ranking
      1. What's in a Search Engine?
      2. A Simple Crawler
      3. Building the Index
      4. Querying
      5. Content-Based Ranking
      6. Using Inbound Links
      7. Learning from Clicks
      8. Exercises
    9. 5. Optimization
      1. Group Travel
      2. Representing Solutions
      3. The Cost Function
      4. Random Searching
      5. Hill Climbing
      6. Simulated Annealing
      7. Genetic Algorithms
      8. Real Flight Searches
      9. Optimizing for Preferences
      10. Network Visualization
      11. Other Possibilities
      12. Exercises
    10. 6. Document Filtering
      1. Filtering Spam
      2. Documents and Words
      3. Training the Classifier
      4. Calculating Probabilities
      5. A Naïve Classifier
      6. The Fisher Method
      7. Persisting the Trained Classifiers
      8. Filtering Blog Feeds
      9. Improving Feature Detection
      10. Using Akismet
      11. Alternative Methods
      12. Exercises
    11. 7. Modeling with Decision Trees
      1. Predicting Signups
      2. Introducing Decision Trees
      3. Training the Tree
      4. Choosing the Best Split
      5. Recursive Tree Building
      6. Displaying the Tree
      7. Classifying New Observations
      8. Pruning the Tree
      9. Dealing with Missing Data
      10. Dealing with Numerical Outcomes
      11. Modeling Home Prices
      12. Modeling "Hotness"
      13. When to Use Decision Trees
      14. Exercises
    12. 8. Building Price Models
      1. Building a Sample Dataset
      2. k-Nearest Neighbors
      3. Weighted Neighbors
      4. Cross-Validation
      5. Heterogeneous Variables
      6. Optimizing the Scale
      7. Uneven Distributions
      8. Using Real Data—the eBay API
      9. When to Use k-Nearest Neighbors
      10. Exercises
    13. 9. Advanced Classification: Kernel Methods and SVMs
      1. Matchmaker Dataset
      2. Difficulties with the Data
      3. Basic Linear Classification
      4. Categorical Features
      5. Scaling the Data
      6. Understanding Kernel Methods
      7. Support-Vector Machines
      8. Using LIBSVM
      9. Matching on Facebook
      10. Exercises
    14. 10. Finding Independent Features
      1. A Corpus of News
      2. Previous Approaches
      3. Non-Negative Matrix Factorization
      4. Displaying the Results
      5. Using Stock Market Data
      6. Exercises
    15. 11. EVOLVING INTELLIGENCE
      1. What Is Genetic Programming?
      2. Programs As Trees
      3. Creating the Initial Population
      4. Testing a Solution
      5. Mutating Programs
      6. Crossover
      7. Building the Environment
      8. A Simple Game
      9. Further Possibilities
      10. Exercises
    16. 12. Algorithm Summary
      1. Bayesian Classifier
      2. Decision Tree Classifier
      3. Neural Networks
      4. Support-Vector Machines
      5. k-Nearest Neighbors
      6. Clustering
      7. Multidimensional Scaling
      8. Non-Negative Matrix Factorization
      9. Optimization
    17. A. Third-Party Libraries
      1. Universal Feed Parser
      2. Python Imaging Library
      3. Beautiful Soup
      4. pysqlite
      5. NumPy
      6. matplotlib
      7. pydelicious
    18. B. Mathematical Formulas
      1. Euclidean Distance
      2. Pearson Correlation Coefficient
      3. Weighted Mean
      4. Tanimoto Coefficient
      5. Conditional Probability
      6. Gini Impurity
      7. Entropy
      8. Variance
      9. Gaussian Function
      10. Dot-Products
    19. Index
    20. About the Author
    21. Colophon
    22. SPECIAL OFFER: Upgrade this ebook with O’Reilly
O'Reilly logo

Uneven Distributions

So far we've been assuming that if you take an average or weighted average of the data, you'll get a pretty good estimate of the final price. In many cases this will be accurate, but in some situations there may be an unmeasured variable that can have a big effect on the outcome. Imagine that in the wine example there were buyers from two separate groups: people who bought from the liquor store, and people who bought from a discount store and received a 40 percent discount. Unfortunately, this information isn't tracked in the dataset.

The createhiddendataset function creates a dataset that simulates these properties. It drops some of the complicating variables and just focuses on the original ones. Add this function to numpredict.py:

def wineset3(  ):
  rows=wineset1(  )
  for row in rows:
    if random(  )<0.5:
      # Wine was bought at a discount store
      row['result']*=0.6
  return rows

Consider what will happen if you ask for an estimate of the price of a different item using the kNN or weighted kNN algorithms. Since the dataset doesn't actually contain any information about whether the buyer bought from the liquor store or a discount store, the algorithm won't be able to take this into account, so it will bring in the nearest neighbors regardless of where the purchase was made. The result is that it will give the average of items from both groups, perhaps representing a 25 percent discount. You can verify this by trying it in your Python session:

>>>reload(numpredict) <module 'numpredict' ...

The best content for your career. Discover unlimited learning on demand for around $1/day.