This work further explores the multi-band approach to automatic speech recognition (ASR) using probabilistic graphical models (PGM) to classify a set of intermediate speech attributes. The term "multi-band approach" refers to an approach that independently processes non-overlapping frequency channels, or bands, in speech, in contrast to the conventional "full-band approach" which looks at the entire bandwidth of speech as a basis for recognition. Typically, these multi-band systems involve processing speech independently on multiple frequency channels, or sub-bands, training classifiers on sub-band features to learn phone probabilities, and using these probabilities to form the best word hypothesis.
Unlike previous multi-band automatic speech recognition (ASR) systems, this work uses multi-band graphical models to classify a set of intermediate attributes of speech instead of phonemes directly. Intermediate attributes of speech can be linguistically motivated or derived automatically from the data. Both of these types of attributes are explored in this work with more emphasis on automatically-derived attributes.
All of the proposed elements of this system are aimed at the goal of improving the robustness of a speech recognizer to unseen noise conditions, i.e., environmental conditions for which the recognizer was not trained. Because humans exhibit a great amount of tolerance to noise compared to state of the art speech recognizers, it may be helpful to imitate certain characteristics of human hearing to improve the robustness of ASR systems. Multi-band processing takes its inspiration from the way humans exploit redundant cues found in multiple frequency regions to maintain robust recognition. Classification of intermediate speech attributes is motivated by studies that show how people can discern certain speech attributes, like voicing and nasality, despite confounding noise. We hypothesize that our proposed multi-band approach to speech recognition will significantly improve the performance of the speech recognizer to noisy speech.