Automatic speech recognition (ASR) provides a natural interface to small form-factor computers (such as PDAs) since keyboards and large displays are absent on these platforms. However, large vocabulary, robust ASR requires hardware resources far beyond those available on current PDAs. Emerging architectures, such as Vector IRAM at UC Berkeley, and Imagine at Stanford, provide a partial solution by delivering very high performance for relatively little expenditure of power. However, for speech recognition to take advantage of these architectures, the components of the system must be redesigned with the new systems in mind.
We are currently adapting the workstation-based ASR system used at ICSI to run efficiently on these architectures. Two out of the three major components of ICSI's speech system, the acoustic front-end and the phoneme probability estimator, contain computational kernels that are very regular (FFT and matrix-matrix multiply, respectively). These components run extremely efficiently on both architectures. The third component, the decoder, consists of a highly pruned (and therefore irregular) search through all possible utterances. Thus, the primary focus of our current effort is on this portion of the speech system.
Our initial implementation consists of a small vocabulary system. With a small vocabulary, it is not necessary to share state among similar words; rather, one can evaluate all the words separately. This allows an efficient, regular implementation. On IRAM, we arrange batches of words with total length equal to the vector length. On Imagine, we batch words such that the total length will fit in the cluster memory. We are in the process of analyzing the results of this approach.
Future work includes running a large vocabulary system on these architectures. This involves picking a search order that will maximize reuse of state from previous searches (e.g., if the word "architecture" has already been processed, most of the work can be reused for the word "architectural"). Language modeling, beam pruning, and least-upper-bound path calculations may also be accelerated on these architectures.
This work further explores the multi-band approach to automatic speech recognition (ASR) using probabilistic graphical models (PGM) to classify a set of intermediate speech attributes. The term "multi-band approach" refers to an approach that independently processes non-overlapping frequency channels, or bands, in speech, in contrast to the conventional "full-band approach" which looks at the entire bandwidth of speech as a basis for recognition. Typically, these multi-band systems involve processing speech independently on multiple frequency channels, or sub-bands, training classifiers on sub-band features to learn phone probabilities, and using these probabilities to form the best word hypothesis.
Unlike previous multi-band automatic speech recognition (ASR) systems, this work uses multi-band graphical models to classify a set of intermediate attributes of speech instead of phonemes directly. Intermediate attributes of speech can be linguistically motivated or derived automatically from the data. Both of these types of attributes are explored in this work with more emphasis on automatically-derived attributes.
All of the proposed elements of this system are aimed at the goal of improving the robustness of a speech recognizer to unseen noise conditions, i.e., environmental conditions for which the recognizer was not trained. Because humans exhibit a great amount of tolerance to noise compared to state of the art speech recognizers, it may be helpful to imitate certain characteristics of human hearing to improve the robustness of ASR systems. Multi-band processing takes its inspiration from the way humans exploit redundant cues found in multiple frequency regions to maintain robust recognition. Classification of intermediate speech attributes is motivated by studies that show how people can discern certain speech attributes, like voicing and nasality, despite confounding noise. We hypothesize that our proposed multi-band approach to speech recognition will significantly improve the performance of the speech recognizer to noisy speech.
Auditory researchers believe that the human auditory system computes many different representations of sound, reflecting different time and frequency resolutions. However, automatic speech recognition systems tend to be based on a single representation of the short-term speech spectrum.
We are attempting to improve the robustness of automatic speech recognition systems by using a set of two-dimensional Gabor filters with varying extents in time and frequency and varying ripple rates to analyze a spectrogram . These filters have some characteristics in common with the responses of neurons in the auditory cortex of primates, and can also be seen as two-dimensional frequency analyzers.
Promising results have been obtained in a noisy digit recognition task , especially when this analysis method was combined with more conventional analysis. Work is ongoing in the use of this approach for larger-vocabulary recognition tasks, and in the use of the Gabor filters in a multi-stream, multi-classifier architecture.
We investigate the use of prosody for the detection of frustration and annoyance in natural human-computer dialog. In addition to prosodic features, we examine the contribution of language model information and speaking "style." Results show that a prosodic model can predict whether an utterance is neutral versus "annoyed or frustrated" with an accuracy on par with that of human interlabeler agreement. Accuracy increases when discriminating only "frustrated" from other utterances, and when using only those utterances on which labelers originally agreed. Furthermore, prosodic model accuracy degrades only slightly when using recognized versus true words. Language model features, even if based on true words, are relatively poor predictors of frustration. Finally, we find that hyperarticulation is not a good predictor of emotion; the two phenomena often occur independently.
1Staff, ICSI, SRI International
2Staff, ICSI, SRI International