You are previewing Analysis of Multivariate and High-Dimensional Data.
O'Reilly logo
Analysis of Multivariate and High-Dimensional Data

Book Description

Big data' poses challenges that require both classical multivariate methods and contemporary techniques from machine learning and engineering. This modern text equips you for the new world - integrating the old and the new, fusing theory and practice and bridging the gap to statistical learning. The theoretical framework includes formal statements that set out clearly the guaranteed 'safe operating zone' for the methods and allow you to assess whether data is in the zone, or near enough. Extensive examples showcase the strengths and limitations of different methods with small classical data, data from medicine, biology, marketing and finance, high-dimensional data from bioinformatics, functional data from proteomics, and simulated data. High-dimension low-sample-size data gets special attention. Several data sets are revisited repeatedly to allow comparison of methods. Generous use of colour, algorithms, Matlab code, and problem sets complete the package. Suitable for master's/graduate students in statistics and researchers in data-rich disciplines.

Table of Contents

  1. Cover
  2. Title Page
  3. Copyright
  4. Dedication
  5. Contents
  6. List of Algorithms
  7. Notation
  8. Preface
  9. I CLASSICAL METHODS
    1. 1 Multidimensional Data
      1. 1.1 Multivariate and High-Dimensional Problems
      2. 1.2 Visualisation
        1. 1.2.1 Three-Dimensional Visualisation
        2. 1.2.2 Parallel Coordinate Plots
      3. 1.3 Multivariate Random Vectors and Data
        1. 1.3.1 The Population Case
        2. 1.3.2 The Random Sample Case
      4. 1.4 Gaussian Random Vectors
        1. 1.4.1 The Multivariate Normal Distribution and the Maximum Likelihood Estimator
        2. 1.4.2 Marginal and Conditional Normal Distributions
      5. 1.5 Similarity, Spectral and Singular Value Decomposition
        1. 1.5.1 Similar Matrices
        2. 1.5.2 Spectral Decomposition for the Population Case
        3. 1.5.3 Decompositions for the Sample Case
    2. 2 Principal Component Analysis
      1. 2.1 Introduction
      2. 2.2 Population Principal Components
      3. 2.3 Sample Principal Components
      4. 2.4 Visualising Principal Components
        1. 2.4.1 Scree, Eigenvalue and Variance Plots
        2. 2.4.2 Two- and Three-Dimensional PC Score Plots
        3. 2.4.3 Projection Plots and Estimates of the Density of the Scores
      5. 2.5 Properties of Principal Components
        1. 2.5.1 Correlation Structure of X and Its PCs
        2. 2.5.2 Optimality Properties of PCs
      6. 2.6 Standardised Data and High-Dimensional Data
        1. 2.6.1 Scaled and Sphered Data
        2. 2.6.2 High-Dimensional Data
      7. 2.7 Asymptotic Results
        1. 2.7.1 Classical Theory: Fixed Dimension d
        2. 2.7.2 Asymptotic Results when d Grows
      8. 2.8 Principal Component Analysis, the Number of Components and Regression
        1. 2.8.1 Number of Principal Components Based on the Likelihood
        2. 2.8.2 Principal Component Regression
    3. 3 Canonical Correlation Analysis
      1. 3.1 Introduction
      2. 3.2 Population Canonical Correlations
      3. 3.3 Sample Canonical Correlations
      4. 3.4 Properties of Canonical Correlations
      5. 3.5 Canonical Correlations and Transformed Data
        1. 3.5.1 Linear Transformations and Canonical Correlations
        2. 3.5.2 Transforms with Non-Singular Matrices
        3. 3.5.3 Canonical Correlations for Scaled Data
        4. 3.5.4 Maximum Covariance Analysis
      6. 3.6 Asymptotic Considerations and Tests for Correlation
      7. 3.7 Canonical Correlations and Regression
        1. 3.7.1 The Canonical Correlation Matrix in Regression
        2. 3.7.2 Canonical Correlation Regression
        3. 3.7.3 Partial Least Squares
        4. 3.7.4 The Generalised Eigenvalue Problem
    4. 4 Discriminant Analysis
      1. 4.1 Introduction
      2. 4.2 Classes, Labels, Rules and Decision Functions
      3. 4.3 Linear Discriminant Rules
        1. 4.3.1 Fisher’s Discriminant Rule for the Population
        2. 4.3.2 Fisher’s Discriminant Rule for the Sample
        3. 4.3.3 Linear Discrimination for Two Normal Populations or Classes
      4. 4.4 Evaluation of Rules and Probability of Misclassification
        1. 4.4.1 Boundaries and Discriminant Regions
        2. 4.4.2 Evaluation of Discriminant Rules
      5. 4.5 Discrimination under Gaussian Assumptions
        1. 4.5.1 Two and More Normal Classes
        2. 4.5.2 Gaussian Quadratic Discriminant Analysis
      6. 4.6 Bayesian Discrimination
        1. 4.6.1 Bayes Discriminant Rule
        2. 4.6.2 Loss and Bayes Risk
      7. 4.7 Non-Linear, Non-Parametric and Regularised Rules
        1. 4.7.1 Nearest-Neighbour Discrimination
        2. 4.7.2 Logistic Regression and Discrimination
        3. 4.7.3 Regularised Discriminant Rules
        4. 4.7.4 Support Vector Machines
      8. 4.8 Principal Component Analysis, Discrimination and Regression
        1. 4.8.1 Discriminant Analysis and Linear Regression
        2. 4.8.2 Principal Component Discriminant Analysis
        3. 4.8.3 Variable Ranking for Discriminant Analysis
    5. Problems for Part I
  10. II FACTORS AND GROUPINGS
    1. 5 Norms, Proximities, Features and Dualities
      1. 5.1 Introduction
      2. 5.2 Vector and Matrix Norms
      3. 5.3 Measures of Proximity
        1. 5.3.1 Distances
        2. 5.3.2 Dissimilarities
        3. 5.3.3 Similarities
      4. 5.4 Features and Feature Maps
      5. 5.5 Dualities for X and XT
    2. 6 Cluster Analysis
      1. 6.1 Introduction
      2. 6.2 Hierarchical Agglomerative Clustering
      3. 6.3 k-Means Clustering
      4. 6.4 Second-Order Polynomial Histogram Estimators
      5. 6.5 Principal Components and Cluster Analysis
        1. 6.5.1 k-Means Clustering for Principal Component Data
        2. 6.5.2 Binary Clustering of Principal Component Scores and Variables
        3. 6.5.3 Clustering High-Dimensional Binary Data
      6. 6.6 Number of Clusters
        1. 6.6.1 Quotients of Variability Measures
        2. 6.6.2 The Gap Statistic
        3. 6.6.3 The Prediction Strength Approach
        4. 6.6.4 Comparison of k-Statistics
    3. 7 Factor Analysis
      1. 7.1 Introduction
      2. 7.2 Population k-Factor Model
      3. 7.3 Sample k-Factor Model
      4. 7.4 Factor Loadings
        1. 7.4.1 Principal Components and Factor Analysis
        2. 7.4.2 Maximum Likelihood and Gaussian Factors
      5. 7.5 Asymptotic Results and the Number of Factors
      6. 7.6 Factor Scores and Regression
        1. 7.6.1 Principal Component Factor Scores
        2. 7.6.2 Bartlett and Thompson Factor Scores
        3. 7.6.3 Canonical Correlations and Factor Scores
        4. 7.6.4 Regression-Based Factor Scores
        5. 7.6.5 Factor Scores in Practice
      7. 7.7 Principal Components, Factor Analysis and Beyond
    4. 8 Multidimensional Scaling
      1. 8.1 Introduction
      2. 8.2 Classical Scaling
        1. 8.2.1 Classical Scaling and Principal Coordinates
        2. 8.2.2 Classical Scaling with Strain
      3. 8.3 Metric Scaling
        1. 8.3.1 Metric Dissimilarities and Metric Stresses
        2. 8.3.2 Metric Strain
      4. 8.4 Non-Metric Scaling
        1. 8.4.1 Non-Metric Stress and the Shepard Diagram
        2. 8.4.2 Non-Metric Strain
      5. 8.5 Data and Their Configurations
        1. 8.5.1 HDLSS Data and the X and XT Duality
        2. 8.5.2 Procrustes Rotations
        3. 8.5.3 Individual Differences Scaling
      6. 8.6 Scaling for Grouped and Count Data
        1. 8.6.1 Correspondence Analysis
        2. 8.6.2 Analysis of Distance
        3. 8.6.3 Low-Dimensional Embeddings
    5. Problems for Part II
  11. III NON-GAUSSIAN ANALYSIS
    1. 9 Towards Non-Gaussianity
      1. 9.1 Introduction
      2. 9.2 Gaussianity and Independence
      3. 9.3 Skewness, Kurtosis and Cumulants
      4. 9.4 Entropy and Mutual Information
      5. 9.5 Training, Testing and Cross-Validation
        1. 9.5.1 Rules and Prediction
        2. 9.5.2 Evaluating Rules with the Cross-Validation Error
    2. 10 Independent Component Analysis
      1. 10.1 Introduction
      2. 10.2 Sources and Signals
        1. 10.2.1 Population Independent Components
        2. 10.2.2 Sample Independent Components
      3. 10.3 Identification of the Sources
      4. 10.4 Mutual Information and Gaussianity
        1. 10.4.1 Independence, Uncorrelatedness and Non-Gaussianity
        2. 10.4.2 Approximations to the Mutual Information
      5. 10.5 Estimation ofthe Mixing Matrix
        1. 10.5.1 An Estimating Function Approach
        2. 10.5.2 Properties of Estimating Functions
      6. 10.6 Non-Gaussianity and Independence in Practice
        1. 10.6.1 Independent Component Scores and Solutions
        2. 10.6.2 Independent Component Solutions for Real Data
        3. 10.6.3 Performance of J for Simulated Data
      7. 10.7 Low-Dimensional Projections of High-Dimensional Data
        1. 10.7.1 Dimension Reduction and Independent Component Scores
        2. 10.7.2 Properties of Low-Dimensional Projections
      8. 10.8 Dimension Selection with Independent Components
    3. 11 Projection Pursuit
      1. 11.1 Introduction
      2. 11.2 One-Dimensional Projections and Their Indices
        1. 11.2.1 Population Projection Pursuit
        2. 11.2.2 Sample Projection Pursuit
      3. 11.3 Projection Pursuit with Two- and Three-Dimensional Projections
        1. 11.3.1 Two-Dimensional Indices: QE, QC and QU
        2. 11.3.2 Bivariate Extension by Removal of Structure
        3. 11.3.3 A Three-Dimensional Cumulant Index
      4. 11.4 Projection Pursuit in Practice
        1. 11.4.1 Comparison of Projection Pursuit and Independent Component Analysis
        2. 11.4.2 From a Cumulant-Based Index to FastICA Scores
        3. 11.4.3 The Removal of Structure and FastICA
        4. 11.4.4 Projection Pursuit: A Continuing Pursuit
      5. 11.5 Theoretical Developments
        1. 11.5.1 Theory Relating to QR
        2. 11.5.2 Theory Relating to QU and QD
      6. 11.6 Projection Pursuit Density Estimation and Regression
        1. 11.6.1 Projection Pursuit Density Estimation
        2. 11.6.2 Projection Pursuit Regression
    4. 12 Kernel and More Independent Component Methods
      1. 12.1 Introduction
      2. 12.2 Kernel Component Analysis
        1. 12.2.1 Feature Spaces and Kernels
        2. 12.2.2 Kernel Principal Component Analysis
        3. 12.2.3 Kernel Canonical Correlation Analysis
      3. 12.3 Kernel Independent Component Analysis
        1. 12.3.1 The F-Correlation and Independence
        2. 12.3.2 Estimating the F-Correlation
        3. 12.3.3 Comparison of Non-Gaussian and Kernel Independent Components Approaches
      4. 12.4 Independent Components from Scatter Matrices (aka Invariant Coordinate Selection)
        1. 12.4.1 Scatter Matrices
        2. 12.4.2 Population Independent Components from Scatter Matrices
        3. 12.4.3 Sample Independent Components from Scatter Matrices
      5. 12.5 Non-Parametric Estimation of Independence Criteria
        1. 12.5.1 A Characteristic Function View of Independence
        2. 12.5.2 An Entropy Estimator Based on Order Statistics
        3. 12.5.3 Kernel Density Estimation of the Unmixing Matrix
    5. 13 Feature Selection and Principal Component Analysis Revisited
      1. 13.1 Introduction
      2. 13.2 Independent Components and Feature Selection
        1. 13.2.1 Feature Selection in Supervised Learning
        2. 13.2.2 Best Features and Unsupervised Decisions
        3. 13.2.3 Test of Gaussianity
      3. 13.3 Variable Ranking and Statistical Learning
        1. 13.3.1 Variable Ranking with the Canonical Correlation Matrix C
        2. 13.3.2 Prediction with a Selected Number of Principal Components
        3. 13.3.3 Variable Ranking for Discriminant Analysis Based on C
        4. 13.3.4 Properties of the Ranking Vectors of the Naive C when d Grows
      4. 13.4 Sparse Principal Component Analysis
        1. 13.4.1 The Lasso, SCoTLASS Directions and Sparse Principal Components
        2. 13.4.2 Elastic Nets and Sparse Principal Components
        3. 13.4.3 Rank One Approximations and Sparse Principal Components
      5. 13.5 (In)Consistency of Principal Components as the Dimension Grows
        1. 13.5.1 (In)Consistency for Single-Component Models
        2. 13.5.2 Behaviour of the Sample Eigenvalues, Eigenvectors and Principal Component Scores
        3. 13.5.3 Towards a General Asymptotic Framework for Principal Component Analysis
    6. Problems for Part III
  12. Bibliography
  13. Author Index
  14. Subject Index
  15. Data Index