Learning Structured Models for Phone Recognition
Slav Orlinov Petrov, Adam David Pauls and Daniel Klein
Modern speech recognition systems are very complex. A good model must account for the context-sensitive, time-dependent, and speaker-dependent nature of "phones," the basic phonological units of speech. Typically, this variation is manually encoded in the model using domain knowledge, or not modeled at all.
We present a maximally streamlined approach to learning HMM-based acoustic models for automatic speech recognition. In our approach, an initial monophone HMM is iteratively refined using a split-merge EM procedure which makes no assumptions about subphone structure or context-dependent structure, and which uses only a single Gaussian per HMM state. Despite the much simplified training process, our acoustic model achieves state-of-the-art results on phone classification (where it outperforms almost all other methods) and competitive performance on phone recognition (where it outperforms standard CD triphone / subphone / GMM approaches). We also present an analysis of what is and is not learned by our system.
- S. Petrov, A. Pauls, and D. Klein, "Learning Structured Models for Phone Recognition," Proceedings of EMNLP-CoNLL, 2007.