You are previewing Data Mining and Analysis.
O'Reilly logo
Data Mining and Analysis

Book Description

The fundamental algorithms in data mining and analysis form the basis for the emerging field of data science, which includes automated methods to analyze patterns and models for all kinds of data, with applications ranging from scientific discovery to business intelligence and analytics. This textbook for senior undergraduate and graduate data mining courses provides a broad yet in-depth overview of data mining, integrating related concepts from machine learning and statistics. The main parts of the book include exploratory data analysis, pattern mining, clustering, and classification. The book lays the basic foundations of these tasks, and also covers cutting-edge topics such as kernel methods, high-dimensional data analysis, and complex graphs and networks. With its comprehensive coverage, algorithmic perspective, and wealth of examples, this book offers solid guidance in data mining for students, researchers, and practitioners alike. Key features: • Covers both core methods and cutting-edge research • Algorithmic approach with open-source implementations • Minimal prerequisites: all key mathematical concepts are presented, as is the intuition behind the formulas • Short, self-contained chapters with class-tested examples and exercises allow for flexibility in designing a course and for easy reference • Supplementary website with lecture slides, videos, project ideas, and more.

Table of Contents

  1. Cover
  2. Half-title page
  3. Title
  4. Copyright
  5. Contents
  6. Preface
  7. 1 Data Mining and Analysis
    1. 1.1 Data Matrix
    2. 1.2 Attributes
    3. 1.3 Data: Algebraic and Geometric View
    4. 1.4 Data: Probabilistic View
    5. 1.5 Data Mining
    6. 1.6 Further Reading
    7. 1.7 Exercises
  8. PART ONE: DATA ANALYSIS FOUNDATIONS
    1. 2 Numeric Attributes
      1. 2.1 Univariate Analysis
      2. 2.2 Bivariate Analysis
      3. 2.3 Multivariate Analysis
      4. 2.4 Data Normalization
      5. 2.5 Normal Distribution
      6. 2.6 Further Reading
      7. 2.7 Exercises
    2. 3 Categorical Attributes
      1. 3.1 Univariate Analysis
      2. 3.2 Bivariate Analysis
      3. 3.3 Multivariate Analysis
      4. 3.4 Distance and Angle
      5. 3.5 Discretization
      6. 3.6 Further Reading
      7. 3.7 Exercises
    3. 4 Graph Data
      1. 4.1 Graph Concepts
      2. 4.2 Topological Attributes
      3. 4.3 Centrality Analysis
      4. 4.4 Graph Models
      5. 4.5 Further Reading
      6. 4.6 Exercises
    4. 5 Kernel Methods
      1. 5.1 Kernel Matrix
      2. 5.2 Vector Kernels
      3. 5.3 Basic Kernel Operations in Feature Space
      4. 5.4 Kernels for Complex Objects
      5. 5.5 Further Reading
      6. 5.6 Exercises
    5. 6 High-dimensional Data
      1. 6.1 High-dimensional Objects
      2. 6.2 High-dimensional Volumes
      3. 6.3 Hypersphere Inscribed within Hypercube
      4. 6.4 Volume of Thin Hypersphere Shell
      5. 6.5 Diagonals in Hyperspace
      6. 6.6 Density of the Multivariate Normal
      7. 6.7 Appendix: Derivation of Hypersphere Volume
      8. 6.8 Further Reading
      9. 6.9 Exercises
    6. 7 Dimensionality Reduction
      1. 7.1 Background
      2. 7.2 Principal Component Analysis
      3. 7.3 Kernel Principal Component Analysis
      4. 7.4 Singular Value Decomposition
      5. 7.5 Further Reading
      6. 7.6 Exercises
  9. PART TWO: FREQUENT PATTERN MINING
    1. 8 Itemset Mining
      1. 8.1 Frequent Itemsets and Association Rules
      2. 8.2 Itemset Mining Algorithms
      3. 8.3 Generating Association Rules
      4. 8.4 Further Reading
      5. 8.5 Exercises
    2. 9 Summarizing Itemsets
      1. 9.1 Maximal and Closed Frequent Itemsets
      2. 9.2 Mining Maximal Frequent Itemsets: GenMaxAlgorithm
      3. 9.3 Mining Closed Frequent Itemsets: Charm Algorithm
      4. 9.4 Nonderivable Itemsets
      5. 9.5 Further Reading
      6. 9.6 Exercises
    3. 10 Sequence Mining
      1. 10.1 Frequent Sequences
      2. 10.2 Mining Frequent Sequences
      3. 10.3 Substring Mining via SuffixTrees
      4. 10.4 Further Reading
      5. 10.5 Exercises
    4. 11 Graph Pattern Mining
      1. 11.1 Isomorphism and Support
      2. 11.2 Candidate Generation
      3. 11.3 The gSpan Algorithm
      4. 11.4 Further Reading
      5. 11.5 Exercises
    5. 12 Pattern and Rule Assessment
      1. 12.1 Rule and Pattern Assessment Measures
      2. 12.2 Significance Testing and Confidence Intervals
      3. 12.3 Further Reading
      4. 12.4 Exercises
  10. PART THREE: CLUSTERING
    1. 13 Representative-based Clustering
      1. 13.1 K-means Algorithm
      2. 13.2 Kernel K-means
      3. 13.3 Expectation-Maximization Clustering
      4. 13.4 Further Reading
      5. 13.5 Exercises
    2. 14 Hierarchical Clustering
      1. 14.1 Preliminaries
      2. 14.2 Agglomerative Hierarchical Clustering
      3. 14.3 Further Reading
      4. 14.4 Exercises and Projects
    3. 15 Density-based Clustering
      1. 15.1 The DBSCAN Algorithm
      2. 15.2 Kernel Density Estimation
      3. 15.3 Density-based Clustering: DENCLUE
      4. 15.4 Further Reading
      5. 15.5 Exercises
    4. 16 Spectral and Graph Clustering
      1. 16.1 Graphs and Matrices
      2. 16.2 Clustering as Graph Cuts
      3. 16.3 Markov Clustering
      4. 16.4 Further Reading
      5. 16.5 Exercises
    5. 17 Clustering Validation
      1. 17.1 External Measures
      2. 17.2 Internal Measures
      3. 17.3 Relative Measures
      4. 17.4 Further Reading
      5. 17.5 Exercises
  11. PART FOUR: CLASSIFICATION
    1. 18 Probabilistic Classification
      1. 18.1 Bayes Classifier
      2. 18.2 Naive Bayes Classifier
      3. 18.3 K Nearest Neighbors Classifier
      4. 18.4 Further Reading
      5. 18.5 Exercises
    2. 19 Decision Tree Classifier
      1. 19.1 Decision Trees
      2. 19.2 Decision Tree Algorithm
      3. 19.3 Further Reading
      4. 19.4 Exercises
    3. 20 Linear Discriminant Analysis
      1. 20.1 Optimal Linear Discriminant
      2. 20.2 Kernel Discriminant Analysis
      3. 20.3 Further Reading
      4. 20.4 Exercises
    4. 21 Support Vector Machines
      1. 21.1 Support Vectors and Margins
      2. 21.2 SVM: Linear and Separable Case
      3. 21.3 Soft Margin SVM: Linear and Nonseparable Case
      4. 21.4 Kernel SVM: Nonlinear Case
      5. 21.5 SVM Training Algorithms
      6. 21.6 Further Reading
      7. 21.7 Exercises
    5. 22 Classification Assessment
      1. 22.1 Classification Performance Measures
      2. 22.2 Classifier Evaluation
      3. 22.3 Bias-Variance Decomposition
      4. 22.4 Further Reading
      5. 22.5 Exercises
  12. Index