You are previewing Bayesian Speech and Language Processing.
O'Reilly logo
Bayesian Speech and Language Processing

Book Description

With this comprehensive guide you will learn how to apply Bayesian machine learning techniques systematically to solve various problems in speech and language processing. A range of statistical models is detailed, from hidden Markov models to Gaussian mixture models, n-gram models and latent topic models, along with applications including automatic speech recognition, speaker verification, and information retrieval. Approximate Bayesian inferences based on MAP, Evidence, Asymptotic, VB, and MCMC approximations are provided as well as full derivations of calculations, useful notations, formulas, and rules. The authors address the difficulties of straightforward applications and provide detailed examples and case studies to demonstrate how you can successfully use practical Bayesian inference methods to improve the performance of information systems. This is an invaluable resource for students, researchers, and industry practitioners working in machine learning, signal processing, and speech and language processing.

Table of Contents

  1. Cover
  2. Half-title page
  3. Title page
  4. Copyright page
  5. Contents
  6. Preface
  7. Notation and abbreviations
  8. Part I General discussion
    1. 1 Introduction
      1. 1.1 Machine learning and speech and language processing
      2. 1.2 Bayesian approach
      3. 1.3 History of Bayesian speech and language processing
      4. 1.4 Applications
      5. 1.5 Organization of this book
    2. 2 Bayesian approach
      1. 2.1 Bayesian probabilities
        1. 2.1.1 Sum and product rules
        2. 2.1.2 Prior and posterior distributions
        3. 2.1.3 Exponential family distributions
        4. 2.1.4 Conjugate distributions
        5. 2.1.5 Conditional independence
      2. 2.2 Graphical model representation
        1. 2.2.1 Directed graph
        2. 2.2.2 Conditional independence in graphical model
        3. 2.2.3 Observation, latent variable, non-probabilistic variable
        4. 2.2.4 Generative process
        5. 2.2.5 Undirected graph
        6. 2.2.6 Inference on graphs
      3. 2.3 Difference between ML and Bayes
        1. 2.3.1 Use of prior knowledge
        2. 2.3.2 Model selection
        3. 2.3.3 Marginalization
      4. 2.4 Summary
    3. 3 Statistical models in speech and language processing
      1. 3.1 Bayes decision for speech recognition
      2. 3.2 Hidden Markov model
        1. 3.2.1 Lexical unit for HMM
        2. 3.2.2 Likelihood function of HMM
        3. 3.2.3 Continuous density HMM
        4. 3.2.4 Gaussian mixture model
        5. 3.2.5 Graphical models and generative process of CDHMM
      3. 3.3 Forward–backward and Viterbi algorithms
        1. 3.3.1 Forward–backward algorithm
        2. 3.3.2 Viterbi algorithm
      4. 3.4 Maximum likelihood estimation and EM algorithm
        1. 3.4.1 Jensen’s inequality
        2. 3.4.2 Expectation step
        3. 3.4.3 Maximization step
      5. 3.5 Maximum likelihood linear regression for hidden Markov model
        1. 3.5.1 Linear regression for hidden Markov models
      6. 3.6 n-gram with smoothing techniques
        1. 3.6.1 Class-based model smoothing
        2. 3.6.2 Jelinek–Mercer smoothing
        3. 3.6.3 Witten–Bell smoothing
        4. 3.6.4 Absolute discounting
        5. 3.6.5 Katz smoothing
        6. 3.6.6 Kneser–Ney smoothing
      7. 3.7 Latent semantic information
        1. 3.7.1 Latent semantic analysis
        2. 3.7.2 LSA language model
        3. 3.7.3 Probabilistic latent semantic analysis
        4. 3.7.4 PLSA language model
      8. 3.8 Revisit of automatic speech recognition with Bayesian manner
        1. 3.8.1 Training and test (unseen) data for ASR
        2. 3.8.2 Bayesian manner
        3. 3.8.3 Learning generative models
        4. 3.8.4 Sum rule for model
        5. 3.8.5 Sum rule for model parameters and latent variables
        6. 3.8.6 Factorization by product rule and conditional independence
        7. 3.8.7 Posterior distributions
        8. 3.8.8 Difficulties in speech and language applications
  9. Part II Approximate inference
    1. 4 Maximum a-posteriori approximation
      1. 4.1 MAP criterion for model parameters
      2. 4.2 MAP extension of EM algorithm
        1. 4.2.1 Auxiliary function
        2. 4.2.2 A recipe
      3. 4.3 Continuous density hidden Markov model
        1. 4.3.1 Likelihood function
        2. 4.3.2 Conjugate priors (full covariance case)
        3. 4.3.3 Conjugate priors (diagonal covariance case)
        4. 4.3.4 Expectation step
        5. 4.3.5 Maximization step
        6. 4.3.6 Sufficient statistics
        7. 4.3.7 Meaning of the MAP solution
      4. 4.4 Speaker adaptation
        1. 4.4.1 Speaker adaptation by a transformation of CDHMM
        2. 4.4.2 MAP based speaker adaptation
      5. 4.5 Regularization in discriminative parameter estimation
        1. 4.5.1 Extended Baum–Welch algorithm
        2. 4.5.2 MAP interpretation of i-smoothing
      6. 4.6 Speaker recognition/verification
        1. 4.6.1 Universal background model
        2. 4.6.2 Gaussian super vector
      7. 4.7 n-gram adaptation
        1. 4.7.1 MAP estimation of n-gram parameters
        2. 4.7.2 Adaptation method
      8. 4.8 Adaptive topic model
        1. 4.8.1 MAP estimation for corrective training
        2. 4.8.2 Quasi-Bayes estimation for incremental learning
        3. 4.8.3 System performance
      9. 4.9 Summary
    2. 5 Evidence approximation
      1. 5.1 Evidence framework
        1. 5.1.1 Bayesian model comparison
        2. 5.1.2 Type-2 maximum likelihood estimation
        3. 5.1.3 Regularization in regression model
        4. 5.1.4 Evidence framework for HMM and SVM
      2. 5.2 Bayesian sensing HMMs
        1. 5.2.1 Basis representation
        2. 5.2.2 Model construction
        3. 5.2.3 Automatic relevance determination
        4. 5.2.4 Model inference
        5. 5.2.5 Evidence function or marginal likelihood
        6. 5.2.6 Maximum a-posteriori sensing weights
        7. 5.2.7 Optimal parameters and hyperparameters
        8. 5.2.8 Discriminative training
        9. 5.2.9 System performance
      3. 5.3 Hierarchical Dirichlet language model
        1. 5.3.1 n-gram smoothing revisited
        2. 5.3.2 Dirichlet prior and posterior
        3. 5.3.3 Evidence function
        4. 5.3.4 Bayesian smoothed language model
        5. 5.3.5 Optimal hyperparameters
    3. 6 Asymptotic approximation
      1. 6.1 Laplace approximation
      2. 6.2 Bayesian information criterion
      3. 6.3 Bayesian predictive classification
        1. 6.3.1 Robust decision rule
        2. 6.3.2 Laplace approximation for BPC decision
        3. 6.3.3 BPC decision considering uncertainty of HMM means
      4. 6.4 Neural network acoustic modeling
        1. 6.4.1 Neural network modeling and learning
        2. 6.4.2 Bayesian neural networks and hidden Markov models
        3. 6.4.3 Laplace approximation for Bayesian neural networks
      5. 6.5 Decision tree clustering
        1. 6.5.1 Decision tree clustering using ML criterion
        2. 6.5.2 Decision tree clustering using BIC
      6. 6.6 Speaker clustering/segmentation
        1. 6.6.1 Speaker segmentation
        2. 6.6.2 Speaker clustering
      7. 6.7 Summary
    4. 7 Variational Bayes
      1. 7.1 Variational inference in general
        1. 7.1.1 Joint posterior distribution
        2. 7.1.2 Factorized posterior distribution
        3. 7.1.3 Variational method
      2. 7.2 Variational inference for classification problems
        1. 7.2.1 VB posterior distributions for model parameters
        2. 7.2.2 VB posterior distributions for latent variables
        3. 7.2.3 VB–EM algorithm
        4. 7.2.4 VB posterior distribution for model structure
      3. 7.3 Continuous density hidden Markov model
        1. 7.3.1 Generative model
        2. 7.3.2 Prior distribution
        3. 7.3.3 VB Baum–Welch algorithm
        4. 7.3.4 Variational lower bound
        5. 7.3.5 VB posterior for Bayesian predictive classification
        6. 7.3.6 Decision tree clustering
        7. 7.3.7 Determination of HMM topology
      4. 7.4 Structural Bayesian linear regression for hidden Markov model
        1. 7.4.1 Variational Bayesian linear regression
        2. 7.4.2 Generative model
        3. 7.4.3 Variational lower bound
        4. 7.4.4 Optimization of hyperparameters and model structure
        5. 7.4.5 Hyperparameter optimization
      5. 7.5 Variational Bayesian speaker verification
        1. 7.5.1 Generative model
        2. 7.5.2 Prior distributions
        3. 7.5.3 Variational posteriors
        4. 7.5.4 Variational lower bound
      6. 7.6 Latent Dirichlet allocation
        1. 7.6.1 Model construction
        2. 7.6.2 VB inference: lower bound
        3. 7.6.3 VB inference: variational parameters
        4. 7.6.4 VB inference: model parameters
      7. 7.7 Latent topic language model
        1. 7.7.1 LDA language model
        2. 7.7.2 Dirichlet class language model
        3. 7.7.3 Model construction
        4. 7.7.4 VB inference: lower bound
        5. 7.7.5 VB inference: parameter estimation
        6. 7.7.6 Cache Dirichlet class language model
        7. 7.7.7 System performance
      8. 7.8 Summary
    5. 8 Markov chain Monte Carlo
      1. 8.1 Sampling methods
        1. 8.1.1 Importance sampling
        2. 8.1.2 Markov chain
        3. 8.1.3 The Metropolis–Hastings algorithm
        4. 8.1.4 Gibbs sampling
        5. 8.1.5 Slice sampling
      2. 8.2 Bayesian nonparametrics
        1. 8.2.1 Modeling via exchangeability
        2. 8.2.2 Dirichlet process
        3. 8.2.3 DP: Stick-breaking construction
        4. 8.2.4 DP: Chinese restaurant process
        5. 8.2.5 Dirichlet process mixture model
        6. 8.2.6 Hierarchical Dirichlet process
        7. 8.2.7 HDP: Stick-breaking construction
        8. 8.2.8 HDP: Chinese restaurant franchise
        9. 8.2.9 MCMC inference by Chinese restaurant franchise
        10. 8.2.10 MCMC inference by direct assignment
        11. 8.2.11 Relation of HDP to other methods
      3. 8.3 Gibbs sampling-based speaker clustering
        1. 8.3.1 Generative model
        2. 8.3.2 GMM marginal likelihood for complete data
        3. 8.3.3 GMM Gibbs sampler
        4. 8.3.4 Generative process and graphical model of multi-scale GMM
        5. 8.3.5 Marginal likelihood for the complete data
        6. 8.3.6 Gibbs sampler
      4. 8.4 Nonparametric Bayesian HMMs to acoustic unit discovery
        1. 8.4.1 Generative model and generative process
        2. 8.4.2 Inference
      5. 8.5 Hierarchical Pitman–Yor language model
        1. 8.5.1 Pitman–Yor process
        2. 8.5.2 Language model smoothing revisited
        3. 8.5.3 Hierarchical Pitman–Yor language model
        4. 8.5.4 MCMC inference for HPYLM
      6. 8.6 Summary
  10. Appendix A Basic formulas
  11. Appendix B Vector and matrix formulas
  12. Appendix C Probabilistic distribution functions
  13. References
  14. Index