In the previous chapters we introduced the notion of trainable statistical models for speech recognition, in particular focusing on the set of methods and constraints associated with hidden Markov models (HMMs). In both training and recognition phases, the key values that must be estimated from the acoustics are the emission probabilities, also referred to as the acoustic likelihoods. These values are used to derive likelihoods for each model of a complete utterance, in combination with statistical information about the *a priori* probability of word sequences. In other words, the probabilities that the local acoustic measurements were generated by each hypothesized state are ultimately integrated into a global probability that a complete utterance is generated by a complete HMM (either by considering all possible state sequences associated with a model, or by considering only the most likely).

In Chapter 26 we provided examples of two common approaches to the estimation of these acoustic probabilities: codebook tables associated with vector quantized features, giving probabilities for each feature value conditioned on the state; and Gaussians or mixtures of Gaussians associated with one or more states. For both of these examples, EM training is used to maximize the likelihood of the acoustic feature sequence's ...

Start Free Trial

No credit card required