You are previewing Bayesian Reasoning and Machine Learning.
O'Reilly logo
Bayesian Reasoning and Machine Learning

Book Description

Machine learning methods extract value from vast data sets quickly and with modest resources. They are established tools in a wide range of industrial applications, including search engines, DNA sequencing, stock market analysis, and robot locomotion, and their use is spreading rapidly. People who know the methods have their choice of rewarding jobs. This hands-on text opens these opportunities to computer science students with modest mathematical backgrounds. It is designed for final-year undergraduates and master's students with limited background in linear algebra and calculus. Comprehensive and coherent, it develops everything from basic reasoning to advanced techniques within the framework of graphical models. Students learn more than a menu of techniques, they develop analytical and problem-solving skills that equip them for the real world. Numerous examples and exercises, both computer based and theoretical, are included in every chapter. Resources for students and instructors, including a MATLAB toolbox, are available online.

Table of Contents

  1. Cover
  2. Half Title
  3. Title Page
  4. Copyright
  5. Contents
  6. Preface
  7. List of notation
  8. BRMLTOOLBOX
  9. I: Inference in probabilistic models
    1. 1. Probabilistic reasoning
      1. 1.1 Probability refresher
        1. 1.1.1 Interpreting conditional probability
        2. 1.1.2 Probability tables
      2. 1.2 Probabilistic reasoning
      3. 1.3 Prior, likelihood and posterior
        1. 1.3.1 Two dice: what were the individual scores?
      4. 1.4 Summary
      5. 1.5 Code
      6. 1.6 Exercises
    2. 2. Basic graph concepts
      1. 2.1 Graphs
      2. 2.2 Numerically encoding graphs
        1. 2.2.1 Edge list
        2. 2.2.2 Adjacency matrix
        3. 2.2.3 Clique matrix
      3. 2.3 Summary
      4. 2.4 Code
      5. 2.5 Exercises
    3. 3. Belief networks
      1. 3.1 The benefits of structure
        1. 3.1.1 Modelling independencies
        2. 3.1.2 Reducing the burden of specification
      2. 3.2 Uncertain and unreliable evidence
        1. 3.2.1 Uncertain evidence
        2. 3.2.2 Unreliable evidence
      3. 3.3 Belief networks
        1. 3.3.1 Conditional independence
        2. 3.3.2 The impact of collisions
        3. 3.3.3 Graphical path manipulations for independence
        4. 3.3.4 d-separation
        5. 3.3.5 Graphical and distributional in/dependence
        6. 3.3.6 Markov equivalence in belief networks
        7. 3.3.7 Belief networks have limited expressibility
      4. 3.4 Causality
        1. 3.4.1 Simpson’s paradox
        2. 3.4.2 The do-calculus
        3. 3.4.3 Influence diagrams and the do-calculus
      5. 3.5 Summary
      6. 3.6 Code
      7. 3.7 Exercises
    4. 4. Graphical models
      1. 4.1 Graphical models
      2. 4.2 Markov networks
        1. 4.2.1 Markov properties
        2. 4.2.2 Markov random fields
        3. 4.2.3 Hammersley–Clifford theorem
        4. 4.2.4 Conditional independence using Markov networks
        5. 4.2.5 Lattice models
      3. 4.3 Chain graphical models
      4. 4.4 Factor graphs
        1. 4.4.1 Conditional independence in factor graphs
      5. 4.5 Expressiveness of graphical models
      6. 4.6 Summary
      7. 4.7 Code
      8. 4.8 Exercises
    5. 5. Efficient inference in trees
      1. 5.1 Marginal inference
        1. 5.1.1 Variable elimination in a Markov chain and message passing
        2. 5.1.2 The sum-product algorithm on factor graphs
        3. 5.1.3 Dealing with evidence
        4. 5.1.4 Computing the marginal likelihood
        5. 5.1.5 The problem with loops
      2. 5.2 Other forms of inference
        1. 5.2.1 Max-product
        2. 5.2.2 Finding the N most probable states
        3. 5.2.3 Most probable path and shortest path
        4. 5.2.4 Mixed inference
      3. 5.3 Inference in multiply connected graphs
        1. 5.3.1 Bucket elimination
        2. 5.3.2 Loop-cut conditioning
      4. 5.4 Message passing for continuous distributions
      5. 5.5 Summary
      6. 5.6 Code
      7. 5.7 Exercises
    6. 6. The junction tree algorithm
      1. 6.1 Clustering variables
        1. 6.1.1 Reparameterisation
      2. 6.2 Clique graphs
        1. 6.2.1 Absorption
        2. 6.2.2 Absorption schedule on clique trees
      3. 6.3 Junction trees
        1. 6.3.1 The running intersection property
      4. 6.4 Constructing a junction tree for singly connected distributions
        1. 6.4.1 Moralisation
        2. 6.4.2 Forming the clique graph
        3. 6.4.3 Forming a junction tree from a clique graph
        4. 6.4.4 Assigning potentials to cliques
      5. 6.5 Junction trees for multiply connected distributions
        1. 6.5.1 Triangulation algorithms
      6. 6.6 The junction tree algorithm
        1. 6.6.1 Remarks on the JTA
        2. 6.6.2 Computing the normalisation constant of a distribution
        3. 6.6.3 The marginal likelihood
        4. 6.6.4 Some small JTA examples
        5. 6.6.5 Shafer–Shenoy propagation
      7. 6.7 Finding the most likely state
      8. 6.8 Reabsorption: converting a junction tree to a directed network
      9. 6.9 The need for approximations
        1. 6.9.1 Bounded width junction trees
      10. 6.10 Summary
      11. 6.11 Code
      12. 6.12 Exercises
    7. 7. Making decisions
      1. 7.1 Expected utility
        1. 7.1.1 Utility of money
      2. 7.2 Decision trees
      3. 7.3 Extending Bayesian networks for decisions
        1. 7.3.1 Syntax of influence diagrams
      4. 7.4 Solving influence diagrams
        1. 7.4.1 Messages on an ID
        2. 7.4.2 Using a junction tree
      5. 7.5 Markov decision processes
        1. 7.5.1 Maximising expected utility by message passing
        2. 7.5.2 Bellman’s equation
      6. 7.6 Temporally unbounded MDPs
        1. 7.6.1 Value iteration
        2. 7.6.2 Policy iteration
        3. 7.6.3 A curse of dimensionality
      7. 7.7 Variational inference and planning
      8. 7.8 Financial matters
        1. 7.8.1 Options pricing and expected utility
        2. 7.8.2 Binomial options pricing model
        3. 7.8.3 Optimal investment
      9. 7.9 Further topics
        1. 7.9.1 Partially observable MDPs
        2. 7.9.2 Reinforcement learning
      10. 7.10 Summary
      11. 7.11 Code
      12. 7.12 Exercises
  10. II: Learning in probabilistic models
    1. 8. Statistics for machine learning
      1. 8.1 Representing data
        1. 8.1.1 Categorical
        2. 8.1.2 Ordinal
        3. 8.1.3 Numerical
      2. 8.2 Distributions
        1. 8.2.1 The Kullback–Leibler divergence KL(q|p)
        2. 8.2.2 Entropy and information
      3. 8.3 Classical distributions
      4. 8.4 Multivariate Gaussian
        1. 8.4.1 Completing the square
        2. 8.4.2 Conditioning as system reversal
        3. 8.4.3 Whitening and centring
      5. 8.5 Exponential family
        1. 8.5.1 Conjugate priors
      6. 8.6 Learning distributions
      7. 8.7 Properties of maximum likelihood
        1. 8.7.1 Training assuming the correct model class
        2. 8.7.2 Training when the assumed model is incorrect
        3. 8.7.3 Maximum likelihood and the empirical distribution
      8. 8.8 Learning a Gaussian
        1. 8.8.1 Maximum likelihood training
        2. 8.8.2 Bayesian inference of the mean and variance
        3. 8.8.3 Gauss-gamma distribution
      9. 8.9 Summary
      10. 8.10 Code
      11. 8.11 Exercises
    2. 9. Learning as inference
      1. 9.1 Learning as inference
        1. 9.1.1 Learning the bias of a coin
        2. 9.1.2 Making decisions
        3. 9.1.3 A continuum of parameters
        4. 9.1.4 Decisions based on continuous intervals
      2. 9.2 Bayesian methods and ML-II
      3. 9.3 Maximum likelihood training of belief networks
      4. 9.4 Bayesian belief network training
        1. 9.4.1 Global and local parameter independence
        2. 9.4.2 Learning binary variable tables using a Beta prior
        3. 9.4.3 Learning multivariate discrete tables using a Dirichlet prior
      5. 9.5 Structure learning
        1. 9.5.1 PC algorithm
        2. 9.5.2 Empirical independence
        3. 9.5.3 Network scoring
        4. 9.5.4 Chow–Liu trees
      6. 9.6 Maximum likelihood for undirected models
        1. 9.6.1 The likelihood gradient
        2. 9.6.2 General tabular clique potentials
        3. 9.6.3 Decomposable Markov networks
        4. 9.6.4 Exponential form potentials
        5. 9.6.5 Conditional random fields
        6. 9.6.6 Pseudo likelihood
        7. 9.6.7 Learning the structure
      7. 9.7 Summary
      8. 9.8 Code
      9. 9.9 Exercises
    3. 10. Naive Bayes
      1. 10.1 Naive Bayes and conditional independence
      2. 10.2 Estimation using maximum likelihood
        1. 10.2.1 Binary attributes
        2. 10.2.2 Multi-state variables
        3. 10.2.3 Text classification
      3. 10.3 Bayesian naive Bayes
      4. 10.4 Tree augmented naive Bayes
        1. 10.4.1 Learning tree augmented naive Bayes networks
      5. 10.5 Summary
      6. 10.6 Code
      7. 10.7 Exercises
    4. 11. Learning with hidden variables
      1. 11.1 Hidden variables and missing data
        1. 11.1.1 Why hidden/missing variables can complicate proceedings
        2. 11.1.2 The missing at random assumption
        3. 11.1.3 Maximum likelihood
        4. 11.1.4 Identifiability issues
      2. 11.2 Expectation maximisation
        1. 11.2.1 Variational EM
        2. 11.2.2 Classical EM
        3. 11.2.3 Application to belief networks
        4. 11.2.4 General case
        5. 11.2.5 Convergence
        6. 11.2.6 Application to Markov networks
      3. 11.3 Extensions of EM
        1. 11.3.1 Partial M-step
        2. 11.3.2 Partial E-step
      4. 11.4 A failure case for EM
      5. 11.5 Variational Bayes
        1. 11.5.1 EM is a special case of variational Bayes
        2. 11.5.2 An example: VB for the Asbestos-Smoking-Cancer network
      6. 11.6 Optimising the likelihood by gradient methods
        1. 11.6.1 Undirected models
      7. 11.7 Summary
      8. 11.8 Code
      9. 11.9 Exercises
    5. 12. Bayesian model selection
      1. 12.1 Comparing models the Bayesian way
      2. 12.2 Illustrations: coin tossing
        1. 12.2.1 A discrete parameter space
        2. 12.2.2 A continuous parameter space
      3. 12.3 Occam’s razor and Bayesian complexity penalisation
      4. 12.4 A continuous example: curve fitting
      5. 12.5 Approximating the model likelihood
        1. 12.5.1 Laplace’s method
        2. 12.5.2 Bayes information criterion
      6. 12.6 Bayesian hypothesis testing for outcome analysis
        1. 12.6.1 Outcome analysis
        2. 12.6.2 Hindep: model likelihood
        3. 12.6.3 Hsame: model likelihood
        4. 12.6.4 Dependent outcome analysis
        5. 12.6.5 Is classifier A better than B?
      7. 12.7 Summary
      8. 12.8 Code
      9. 12.9 Exercises
  11. III: Machine learning
    1. 13. Machine learning concepts
      1. 13.1 Styles of learning
        1. 13.1.1 Supervised learning
        2. 13.1.2 Unsupervised learning
        3. 13.1.3 Anomaly detection
        4. 13.1.4 Online (sequential) learning
        5. 13.1.5 Interacting with the environment
        6. 13.1.6 Semi-supervised learning
      2. 13.2 Supervised learning
        1. 13.2.1 Utility and loss
        2. 13.2.2 Using the empirical distribution
        3. 13.2.3 Bayesian decision approach
      3. 13.3 Bayes versus empirical decisions
      4. 13.4 Summary
      5. 13.5 Exercises
    2. 14. Nearest neighbour classification
      1. 14.1 Do as your neighbour does
      2. 14.2 K-nearest neighbours
      3. 14.3 A probabilistic interpretation of nearest neighbours
        1. 14.3.1 When your nearest neighbour is far away
      4. 14.4 Summary
      5. 14.5 Code
      6. 14.6 Exercises
    3. 15. Unsupervised linear dimension reduction
      1. 15.1 High-dimensional spaces – low-dimensional manifolds
      2. 15.2 Principal components analysis
        1. 15.2.1 Deriving the optimal linear reconstruction
        2. 15.2.2 Maximum variance criterion
        3. 15.2.3 PCA algorithm
        4. 15.2.4 PCA and nearest neighbours classification
        5. 15.2.5 Comments on PCA
      3. 15.3 High-dimensional data
        1. 15.3.1 Eigen-decomposition for N < D
        2. 15.3.2 PCA via singular value decomposition
      4. 15.4 Latent semantic analysis
        1. 15.4.1 Information retrieval
      5. 15.5 PCA with missing data
        1. 15.5.1 Finding the principal directions
        2. 15.5.2 Collaborative filtering using PCA with missing data
      6. 15.6 Matrix decomposition methods
        1. 15.6.1 Probabilistic latent semantic analysis
        2. 15.6.2 Extensions and variations
        3. 15.6.3 Applications of PLSA/NMF
      7. 15.7 Kernel PCA
      8. 15.8 Canonical correlation analysis
        1. 15.8.1 SVD formulation
      9. 15.9 Summary
      10. 15.10 Code
      11. 15.11 Exercises
    4. Plate Section
    5. 16. Supervised linear dimension reduction
      1. 16.1 Supervised linear projections
      2. 16.2 Fisher’s linear discriminant
      3. 16.3 Canonical variates
        1. 16.3.1 Dealing with the nullspace
      4. 16.4 Summary
      5. 16.5 Code
      6. 16.6 Exercises
    6. 17. Linear models
      1. 17.1 Introduction: fitting a straight line
      2. 17.2 Linear parameter models for regression
        1. 17.2.1 Vector outputs
        2. 17.2.2 Regularisation
        3. 17.2.3 Radial basis functions
      3. 17.3 The dual representation and kernels
        1. 17.3.1 Regression in the dual space
      4. 17.4 Linear parameter models for classification
        1. 17.4.1 Logistic regression
        2. 17.4.2 Beyond first-order gradient ascent
        3. 17.4.3 Avoiding overconfident classification
        4. 17.4.4 Multiple classes
        5. 17.4.5 The kernel trick for classification
      5. 17.5 Support vector machines
        1. 17.5.1 Maximum margin linear classifier
        2. 17.5.2 Using kernels
        3. 17.5.3 Performing the optimisation
        4. 17.5.4 Probabilistic interpretation
      6. 17.6 Soft zero-one loss for outlier robustness
      7. 17.7 Summary
      8. 17.8 Code
      9. 17.9 Exercises
    7. 18. Bayesian linear models
      1. 18.1 Regression with additive Gaussian noise
        1. 18.1.1 Bayesian linear parameter models
        2. 18.1.2 Determining hyperparameters: ML-II
        3. 18.1.3 Learning the hyperparameters using EM
        4. 18.1.4 Hyperparameter optimisation: using the gradient
        5. 18.1.5 Validation likelihood
        6. 18.1.6 Prediction and model averaging
        7. 18.1.7 Sparse linear models
      2. 18.2 Classification
        1. 18.2.1 Hyperparameter optimisation
        2. 18.2.2 Laplace approximation
        3. 18.2.3 Variational Gaussian approximation
        4. 18.2.4 Local variational approximation
        5. 18.2.5 Relevance vector machine for classification
        6. 18.2.6 Multi-class case
      3. 18.3 Summary
      4. 18.4 Code
      5. 18.5 Exercises
    8. 19. Gaussian processes
      1. 19.1 Non-parametric prediction
        1. 19.1.1 From parametric to non-parametric
        2. 19.1.2 From Bayesian linear models to Gaussian processes
        3. 19.1.3 A prior on functions
      2. 19.2 Gaussian process prediction
        1. 19.2.1 Regression with noisy training outputs
      3. 19.3 Covariance functions
        1. 19.3.1 Making new covariance functions from old
        2. 19.3.2 Stationary covariance functions
        3. 19.3.3 Non-stationary covariance functions
      4. 19.4 Analysis of covariance functions
        1. 19.4.1 Smoothness of the functions
        2. 19.4.2 Mercer kernels
        3. 19.4.3 Fourier analysis for stationary kernels
      5. 19.5 Gaussian processes for classification
        1. 19.5.1 Binary classification
        2. 19.5.2 Laplace’s approximation
        3. 19.5.3 Hyperparameter optimisation
        4. 19.5.4 Multiple classes
      6. 19.6 Summary
      7. 19.7 Code
      8. 19.8 Exercises
    9. 20. Mixture models
      1. 20.1 Density estimation using mixtures
      2. 20.2 Expectation maximisation for mixture models
        1. 20.2.1 Unconstrained discrete tables
        2. 20.2.2 Mixture of product of Bernoulli distributions
      3. 20.3 The Gaussian mixture model
        1. 20.3.1 EM algorithm
        2. 20.3.2 Practical issues
        3. 20.3.3 Classification using Gaussian mixture models
        4. 20.3.4 The Parzen estimator
        5. 20.3.5 K-means
        6. 20.3.6 Bayesian mixture models
        7. 20.3.7 Semi-supervised learning
      4. 20.4 Mixture of experts
      5. 20.5 Indicator models
        1. 20.5.1 Joint indicator approach: factorised prior
        2. 20.5.2 Polya prior
      6. 20.6 Mixed membership models
        1. 20.6.1 Latent Dirichlet allocation
        2. 20.6.2 Graph-based representations of data
        3. 20.6.3 Dyadic data
        4. 20.6.4 Monadic data
        5. 20.6.5 Cliques and adjacency matrices for monadic binary data
      7. 20.7 Summary
      8. 20.8 Code
      9. 20.9 Exercises
    10. 21. Latent linear models
      1. 21.1 Factor analysis
        1. 21.1.1 Finding the optimal bias
      2. 21.2 Factor analysis: maximum likelihood
        1. 21.2.1 Eigen-approach likelihood optimisation
        2. 21.2.2 Expectation maximisation
      3. 21.3 Interlude: modelling faces
      4. 21.4 Probabilistic principal components analysis
      5. 21.5 Canonical correlation analysis and factor analysis
      6. 21.6 Independent components analysis
      7. 21.7 Summary
      8. 21.8 Code
      9. 21.9 Exercises
    11. 22. Latent ability models
      1. 22.1 The Rasch model
        1. 22.1.1 Maximum likelihood training
        2. 22.1.2 Bayesian Rasch models
      2. 22.2 Competition models
        1. 22.2.1 Bradley–Terry–Luce model
        2. 22.2.2 Elo ranking model
        3. 22.2.3 Glicko and TrueSkill
      3. 22.3 Summary
      4. 22.4 Code
      5. 22.5 Exercises
  12. IV: Dynamical models
    1. 23. Discrete-state Markov models
      1. 23.1 Markov models
        1. 23.1.1 Equilibrium and stationary distribution of a Markov chain
        2. 23.1.2 Fitting Markov models
        3. 23.1.3 Mixture of Markov models
      2. 23.2 Hidden Markov models
        1. 23.2.1 The classical inference problems
        2. 23.2.2 Filtering p(ht|v1:t)
        3. 23.2.3 Parallel smoothing p(ht|v1:T)
        4. 23.2.4 Correction smoothing
        5. 23.2.5 Sampling from p(h1:T |v1:T)
        6. 23.2.6 Most likely joint state
        7. 23.2.7 Prediction
        8. 23.2.8 Self-localisation and kidnapped robots
        9. 23.2.9 Natural language models
      3. 23.3 Learning HMMs
        1. 23.3.1 EM algorithm
        2. 23.3.2 Mixture emission
        3. 23.3.3 The HMM-GMM
        4. 23.3.4 Discriminative training
      4. 23.4 Related models
        1. 23.4.1 Explicit duration model
        2. 23.4.2 Input–output HMM
        3. 23.4.3 Linear chain CRFs
        4. 23.4.4 Dynamic Bayesian networks
      5. 23.5 Applications
        1. 23.5.1 Object tracking
        2. 23.5.2 Automatic speech recognition
        3. 23.5.3 Bioinformatics
        4. 23.5.4 Part-of-speech tagging
      6. 23.6 Summary
      7. 23.7 Code
      8. 23.8 Exercises
    2. 24. Continuous-state Markov models
      1. 24.1 Observed linear dynamical systems
        1. 24.1.1 Stationary distribution with noise
      2. 24.2 Auto-regressive models
        1. 24.2.1 Training an AR model
        2. 24.2.2 AR model as an OLDS
        3. 24.2.3 Time-varying AR model
        4. 24.2.4 Time-varying variance AR models
      3. 24.3 Latent linear dynamical systems
      4. 24.4 Inference
        1. 24.4.1 Filtering
        2. 24.4.2 Smoothing: Rauch–Tung–Striebel correction method
        3. 24.4.3 The likelihood
        4. 24.4.4 Most likely state
        5. 24.4.5 Time independence and Riccati equations
      5. 24.5 Learning linear dynamical systems
        1. 24.5.1 Identifiability issues
        2. 24.5.2 EM algorithm
        3. 24.5.3 Subspace methods
        4. 24.5.4 Structured LDSs
        5. 24.5.5 Bayesian LDSs
      6. 24.6 Switching auto-regressive models
        1. 24.6.1 Inference
        2. 24.6.2 Maximum likelihood learning using EM
      7. 24.7 Summary
      8. 24.8 Code
      9. 24.9 Exercises
    3. 25. Switching linear dynamical systems
      1. 25.1 Introduction
      2. 25.2 The switching LDS
        1. 25.2.1 Exact inference is computationally intractable
      3. 25.3 Gaussian sum filtering
        1. 25.3.1 Continuous filtering
        2. 25.3.2 Discrete filtering
        3. 25.3.3 The likelihood p(v1:T)
        4. 25.3.4 Collapsing Gaussians
        5. 25.3.5 Relation to other methods
      4. 25.4 Gaussian sum smoothing
        1. 25.4.1 Continuous smoothing
        2. 25.4.2 Discrete smoothing
        3. 25.4.3 Collapsing the mixture
        4. 25.4.4 Using mixtures in smoothing
        5. 25.4.5 Relation to other methods
      5. 25.5 Reset models
        1. 25.5.1 A Poisson reset model
        2. 25.5.2 Reset-HMM-LDS
      6. 25.6 Summary
      7. 25.7 Code
      8. 25.8 Exercises
    4. 26. Distributed computation
      1. 26.1 Introduction
      2. 26.2 Stochastic Hopfield networks
      3. 26.3 Learning sequences
        1. 26.3.1 A single sequence
        2. 26.3.2 Multiple sequences
        3. 26.3.3 Boolean networks
        4. 26.3.4 Sequence disambiguation
      4. 26.4 Tractable continuous latent variable models
        1. 26.4.1 Deterministic latent variables
        2. 26.4.2 An augmented Hopfield network
      5. 26.5 Neural models
        1. 26.5.1 Stochastically spiking neurons
        2. 26.5.2 Hopfield membrane potential
        3. 26.5.3 Dynamic synapses
        4. 26.5.4 Leaky integrate and fire models
      6. 26.6 Summary
      7. 26.7 Code
      8. 26.8 Exercises
  13. V: Approximate inference
    1. 27. Sampling
      1. 27.1 Introduction
        1. 27.1.1 Univariate sampling
        2. 27.1.2 Rejection sampling
        3. 27.1.3 Multivariate sampling
      2. 27.2 Ancestral sampling
        1. 27.2.1 Dealing with evidence
        2. 27.2.2 Perfect sampling for a Markov network
      3. 27.3 Gibbs sampling
        1. 27.3.1 Gibbs sampling as a Markov chain
        2. 27.3.2 Structured Gibbs sampling
        3. 27.3.3 Remarks
      4. 27.4 Markov chain Monte Carlo (MCMC)
        1. 27.4.1 Markov chains
        2. 27.4.2 Metropolis–Hastings sampling
      5. 27.5 Auxiliary variable methods
        1. 27.5.1 Hybrid Monte Carlo (HMC)
        2. 27.5.2 Swendson–Wang (SW)
        3. 27.5.3 Slice sampling
      6. 27.6 Importance sampling
        1. 27.6.1 Sequential importance sampling
        2. 27.6.2 Particle filtering as an approximate forward pass
      7. 27.7 Summary
      8. 27.8 Code
      9. 27.9 Exercises
    2. 28. Deterministic approximate inference
      1. 28.1 Introduction
      2. 28.2 The Laplace approximation
      3. 28.3 Properties of Kullback–Leibler variational inference
        1. 28.3.1 Bounding the normalisation constant
        2. 28.3.2 Bounding the marginal likelihood
        3. 28.3.3 Bounding marginal quantities
        4. 28.3.4 Gaussian approximations using KL divergence
        5. 28.3.5 Marginal and moment matching properties of minimising KL(p|q)
      4. 28.4 Variational bounding using KL(q|p)
        1. 28.4.1 Pairwise Markov random field
        2. 28.4.2 General mean-field equations
        3. 28.4.3 Asynchronous updating guarantees approximation improvement
        4. 28.4.4 Structured variational approximation
      5. 28.5 Local and KL variational approximations
        1. 28.5.1 Local approximation
        2. 28.5.2 KL variational approximation
      6. 28.6 Mutual information maximisation: a KL variational approach
        1. 28.6.1 The information maximisation algorithm
        2. 28.6.2 Linear Gaussian decoder
      7. 28.7 Loopy belief propagation
        1. 28.7.1 Classical BP on an undirected graph
        2. 28.7.2 Loopy BP as a variational procedure
      8. 28.8 Expectation propagation
      9. 28.9 MAP for Markov networks
        1. 28.9.1 Pairwise Markov networks
        2. 28.9.2 Attractive binary Markov networks
        3. 28.9.3 Potts model
      10. 28.10 Further reading
      11. 28.11 Summary
      12. 28.12 Code
      13. 28.13 Exercises
  14. Appendix A: Background mathematics
    1. A.1 Linear algebra
    2. A.2 Multivariate calculus
    3. A.3 Inequalities
    4. A.4 Optimisation
    5. A.5 Multivariate optimisation
    6. A.6 Constrained optimisation using Lagrange multipliers
  15. References
  16. Index