You are previewing Mining of Massive Datasets.
O'Reilly logo
Mining of Massive Datasets

Book Description

The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets. It begins with a discussion of the map-reduce framework, an important tool for parallelizing algorithms automatically. The authors explain the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing. The PageRank idea and related tricks for organizing the Web are covered next. Other chapters cover the problems of finding frequent itemsets and clustering. The final chapters cover two applications: recommendation systems and Web advertising, each vital in e-commerce. Written by two authorities in database and Web technologies, this book is essential reading for students and practitioners alike.

Table of Contents

  1. Cover
  2. Title
  3. Copyright
  4. Contents
  5. Preface
  6. 1 Data Mining
    1. 1.1 What is Data Mining?
    2. 1.2 Statistical Limits on Data Mining
    3. 1.3 Things Useful to Know
    4. 1.4 Outline of the Book
    5. 1.5 Summary of Chapter 1
    6. 1.6 References for Chapter 1
  7. 2 Large-Scale File Systems and Map-Reduce
    1. 2.1 Distributed File Systems
    2. 2.2 Map-Reduce
    3. 2.3 Algorithms Using Map-Reduce
    4. 2.4 Extensions to Map-Reduce
    5. 2.5 Efficiency of Cluster-Computing Algorithms
    6. 2.6 Summary of Chapter 2
    7. 2.7 References for Chapter 2
  8. 3 Finding Similar Items
    1. 3.1 Applications of Near-Neighbor Search
    2. 3.2 Shingling of Documents
    3. 3.3 Similarity-Preserving Summaries of Sets
    4. 3.4 Locality-Sensitive Hashing for Documents
    5. 3.5 Distance Measures
    6. 3.6 The Theory of Locality-Sensitive Functions
    7. 3.7 LSH Families for Other Distance Measures
    8. 3.8 Applications of Locality-Sensitive Hashing
    9. 3.9 Methods for High Degrees of Similarity
    10. 3.10 Summary of Chapter 3
    11. 3.11 References for Chapter 3
  9. 4 Mining Data Streams
    1. 4.1 The Stream Data Model
    2. 4.2 Sampling Data in a Stream
    3. 4.3 Filtering Streams
    4. 4.4 Counting Distinct Elements in a Stream
    5. 4.5 Estimating Moments
    6. 4.6 Counting Ones in a Window
    7. 4.7 Decaying Windows
    8. 4.8 Summary of Chapter 4
    9. 4.9 References for Chapter 4
  10. 5 Link Analysis
    1. 5.1 PageRank
    2. 5.2 Efficient Computation of PageRank
    3. 5.3 Topic-Sensitive PageRank
    4. 5.4 Link Spam
    5. 5.5 Hubs and Authorities
    6. 5.6 Summary of Chapter 5
    7. 5.7 References for Chapter 5
  11. 6 Frequent Itemsets
    1. 6.1 The Market-Basket Model
    2. 6.2 Market Baskets and the A-Priori Algorithm
    3. 6.3 Handling Larger Datasets in Main Memory
    4. 6.4 Limited-Pass Algorithms
    5. 6.5 Counting Frequent Items in a Stream
    6. 6.6 Summary of Chapter 6
    7. 6.7 References for Chapter 6
  12. 7 Clustering
    1. 7.1 Introduction to Clustering Techniques
    2. 7.2 Hierarchical Clustering
    3. 7.3 K-means Algorithms
    4. 7.4 The CURE Algorithm
    5. 7.5 Clustering in Non-Euclidean Spaces
    6. 7.6 Clustering for Streams and Parallelism
    7. 7.7 Summary of Chapter 7
    8. 7.8 References for Chapter 7
  13. 8 Advertising on the Web
    1. 8.1 Issues in On-Line Advertising
    2. 8.2 On-Line Algorithms
    3. 8.3 The Matching Problem
    4. 8.4 The Adwords Problem
    5. 8.5 Adwords Implementation
    6. 8.6 Summary of Chapter 8
    7. 8.7 References for Chapter 8
  14. 9 Recommendation Systems
    1. 9.1 A Model for Recommendation Systems
    2. 9.2 Content-Based Recommendations
    3. 9.3 Collaborative Filtering
    4. 9.4 Dimensionality Reduction
    5. 9.5 The NetFlix Challenge
    6. 9.6 Summary of Chapter 9
    7. 9.7 References for Chapter 9
  15. Index