You are previewing Mining of Massive Datasets, Second Edition.
O'Reilly logo
Mining of Massive Datasets, Second Edition

Book Description

Written by leading authorities in database and Web technologies, this book is essential reading for students and practitioners alike. The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and can be applied successfully to even the largest datasets. It begins with a discussion of the map-reduce framework, an important tool for parallelizing algorithms automatically. The authors explain the tricks of locality-sensitive hashing and stream processing algorithms for mining data that arrives too fast for exhaustive processing. Other chapters cover the PageRank idea and related tricks for organizing the Web, the problems of finding frequent itemsets and clustering. This second edition includes new and extended coverage on social networks, machine learning and dimensionality reduction.

Table of Contents

  1. Coverpage
  2. Halftitle page
  3. Title page
  4. Copyright page
  5. Contents
  6. Preface
  7. 1 Data Mining
    1. 1.1 What is Data Mining?
    2. 1.2 Statistical Limits on Data Mining
    3. 1.3 Things Useful to Know
    4. 1.4 Outline of the Book
    5. 1.5 Summary of Chapter 1
    6. 1.6 References for Chapter 1
  8. 2 MapReduce and the New Software Stack
    1. 2.1 Distributed File Systems
    2. 2.2 MapReduce
    3. 2.3 Algorithms Using MapReduce
    4. 2.4 Extensions to MapReduce
    5. 2.5 The Communication Cost Model
    6. 2.6 Complexity Theory for MapReduce
    7. 2.7 Summary of Chapter 2
    8. 2.8 References for Chapter 2
  9. 3 Finding Similar Items
    1. 3.1 Applications of Near-Neighbor Search
    2. 3.2 Shingling of Documents
    3. 3.3 Similarity-Preserving Summaries of Sets
    4. 3.4 Locality-Sensitive Hashing for Documents
    5. 3.5 Distance Measures
    6. 3.6 The Theory of Locality-Sensitive Functions
    7. 3.7 LSH Families for Other Distance Measures
    8. 3.8 Applications of Locality-Sensitive Hashing
    9. 3.9 Methods for High Degrees of Similarity
    10. 3.10 Summary of Chapter 3
    11. 3.11 References for Chapter 3
  10. 4 Mining Data Streams
    1. 4.1 The Stream Data Model
    2. 4.2 Sampling Data in a Stream
    3. 4.3 Filtering Streams
    4. 4.4 Counting Distinct Elements in a Stream
    5. 4.5 Estimating Moments
    6. 4.6 Counting Ones in a Window
    7. 4.7 Decaying Windows
    8. 4.8 Summary of Chapter 4
    9. 4.9 References for Chapter 4
  11. 5 Link Analysis
    1. 5.1 PageRank
    2. 5.2 Efficient Computation of PageRank
    3. 5.3 Topic-Sensitive PageRank
    4. 5.4 Link Spam
    5. 5.5 Hubs and Authorities
    6. 5.6 Summary of Chapter 5
    7. 5.7 References for Chapter 5
  12. 6 Frequent Itemsets
    1. 6.1 The Market-Basket Model
    2. 6.2 Market Baskets and the A-Priori Algorithm
    3. 6.3 Handling Larger Datasets in Main Memory
    4. 6.4 Limited-Pass Algorithms
    5. 6.5 Counting Frequent Items in a Stream
    6. 6.6 Summary of Chapter 6
    7. 6.7 References for Chapter 6
  13. 7 Clustering
    1. 7.1 Introduction to Clustering Techniques
    2. 7.2 Hierarchical Clustering
    3. 7.3 K-means Algorithms
    4. 7.4 The CURE Algorithm
    5. 7.5 Clustering in Non-Euclidean Spaces
    6. 7.6 Clustering for Streams and Parallelism
    7. 7.7 Summary of Chapter 7
    8. 7.8 References for Chapter 7
  14. 8 Advertising on the Web
    1. 8.1 Issues in On-Line Advertising
    2. 8.2 On-Line Algorithms
    3. 8.3 The Matching Problem
    4. 8.4 The Adwords Problem
    5. 8.5 Adwords Implementation
    6. 8.6 Summary of Chapter 8
    7. 8.7 References for Chapter 8
  15. 9 Recommendation Systems
    1. 9.1 A Model for Recommendation Systems
    2. 9.2 Content-Based Recommendations
    3. 9.3 Collaborative Filtering
    4. 9.4 Dimensionality Reduction
    5. 9.5 The NetFlix Challenge
    6. 9.6 Summary of Chapter 9
    7. 9.7 References for Chapter 9
  16. 10 Mining Social-Network Graphs
    1. 10.1 Social Networks as Graphs
    2. 10.2 Clustering of Social-Network Graphs
    3. 10.3 Direct Discovery of Communities
    4. 10.4 Partitioning of Graphs
    5. 10.5 Finding Overlapping Communities
    6. 10.6 Simrank
    7. 10.7 Counting Triangles
    8. 10.8 Neighborhood Properties of Graphs
    9. 10.9 Summary of Chapter 10
    10. 10.10 References for Chapter 10
  17. 11 Dimensionality Reduction
    1. 11.1 Eigenvalues and Eigenvectors
    2. 11.2 Principal-Component Analysis
    3. 11.3 Singular-Value Decomposition
    4. 11.4 CUR Decomposition
    5. 11.5 Summary of Chapter 11
    6. 11.6 References for Chapter 11
  18. 12 Large-Scale Machine Learning
    1. 12.1 The Machine-Learning Model
    2. 12.2 Perceptrons
    3. 12.3 Support-Vector Machines
    4. 12.4 Learning from Nearest Neighbors
    5. 12.5 Comparison of Learning Methods
    6. 12.6 Summary of Chapter 12
    7. 12.7 References for Chapter 12
  19. Index