Abstracts for Nelson H. Morgan

The EECS Research Summary for 2003

Large Vocabulary Automatic Speech Recognition on Emerging Architectures

Adam Janin
(Professor Nelson H. Morgan)
(NSF) IIS-0121396 and Swiss Research Network IM2

Automatic speech recognition (ASR) provides a natural interface to small form-factor computers (such as PDAs) since keyboards and large displays are absent on these platforms. However, large vocabulary, robust ASR requires hardware resources far beyond those available on current PDAs. Emerging architectures, such as Vector IRAM at UC Berkeley, and Imagine at Stanford, provide a partial solution by delivering very high performance for relatively little expenditure of power. However, for speech recognition to take advantage of these architectures, the components of the system must be redesigned with the new systems in mind.

We are currently adapting the workstation-based ASR system used at ICSI to run efficiently on these architectures. Two out of the three major components of ICSI's speech system, the acoustic front-end and the phoneme probability estimator, contain computational kernels that are very regular (FFT and matrix-matrix multiply, respectively). These components run extremely efficiently on both architectures. The third component, the decoder, consists of a highly pruned (and therefore irregular) search through all possible utterances. Thus, the primary focus of our current effort is on this portion of the speech system.

Our initial implementation consists of a small vocabulary system. With a small vocabulary, it is not necessary to share state among similar words; rather, one can evaluate all the words separately. This allows an efficient, regular implementation. On IRAM, we arrange batches of words with total length equal to the vector length. On Imagine, we batch words such that the total length will fit in the cluster memory. We are in the process of analyzing the results of this approach.

Future work includes running a large vocabulary system on these architectures. This involves picking a search order that will maximize reuse of state from previous searches (e.g., if the word "architecture" has already been processed, most of the work can be reused for the word "architectural"). Language modeling, beam pruning, and least-upper-bound path calculations may also be accelerated on these architectures.

Send mail to the author : (janin@icsi.berkeley.edu)

A Multi-Band Approach to Robust Speech Recognition Using Graphical Models for Intermediate Classification

Barry Chen
(Professor Nelson H. Morgan)
DARPA: EARS Novel Approaches

This work further explores the multi-band approach to automatic speech recognition (ASR) using probabilistic graphical models (PGM) to classify a set of intermediate speech attributes. The term "multi-band approach" refers to an approach that independently processes non-overlapping frequency channels, or bands, in speech, in contrast to the conventional "full-band approach" which looks at the entire bandwidth of speech as a basis for recognition. Typically, these multi-band systems involve processing speech independently on multiple frequency channels, or sub-bands, training classifiers on sub-band features to learn phone probabilities, and using these probabilities to form the best word hypothesis.

Unlike previous multi-band automatic speech recognition (ASR) systems, this work uses multi-band graphical models to classify a set of intermediate attributes of speech instead of phonemes directly. Intermediate attributes of speech can be linguistically motivated or derived automatically from the data. Both of these types of attributes are explored in this work with more emphasis on automatically-derived attributes.

All of the proposed elements of this system are aimed at the goal of improving the robustness of a speech recognizer to unseen noise conditions, i.e., environmental conditions for which the recognizer was not trained. Because humans exhibit a great amount of tolerance to noise compared to state of the art speech recognizers, it may be helpful to imitate certain characteristics of human hearing to improve the robustness of ASR systems. Multi-band processing takes its inspiration from the way humans exploit redundant cues found in multiple frequency regions to maintain robust recognition. Classification of intermediate speech attributes is motivated by studies that show how people can discern certain speech attributes, like voicing and nasality, despite confounding noise. We hypothesize that our proposed multi-band approach to speech recognition will significantly improve the performance of the speech recognizer to noisy speech.

Send mail to the author : (byc@icsi.berkeley.edu)

Gabor Filter Analysis for Automatic Speech Recognition

David Gelbart and Michael Kleinschmidt1
(Professor Nelson H. Morgan)
Deutsche Forschungsgemeinschaft, Natural Sciences and Engineering Research Council of Canada, and German Ministry for Education and Research

Auditory researchers believe that the human auditory system computes many different representations of sound, reflecting different time and frequency resolutions. However, automatic speech recognition systems tend to be based on a single representation of the short-term speech spectrum.

We are attempting to improve the robustness of automatic speech recognition systems by using a set of two-dimensional Gabor filters with varying extents in time and frequency and varying ripple rates to analyze a spectrogram [1]. These filters have some characteristics in common with the responses of neurons in the auditory cortex of primates, and can also be seen as two-dimensional frequency analyzers.

Promising results have been obtained in a noisy digit recognition task [2], especially when this analysis method was combined with more conventional analysis. Work is ongoing in the use of this approach for larger-vocabulary recognition tasks, and in the use of the Gabor filters in a multi-stream, multi-classifier architecture.

M. Kleinschmidt, "Improving Word Accuracy with Gabor Feature Extraction," Forum Acusticum, Seville, Spain, September 2002.
M. Kleinschmidt and D. Gelbart, "Spectro-Temporal Gabor Features as a Front End for Automatic Speech Recognition," Int. Conf. Spoken Language Processing, Denver, CO, September 2002.
1Outside Adviser (non-EECS), University of Oldenburg

More information (http://www.icsi.berkeley.edu/~gelbart) or

Send mail to the author : (gelbart@eecs.berkeley.edu)

Prosody-based Automatic Detection of Annoyance and Frustration in Human-Computer Dialog

Jeremy Ang, Elizabeth Shriberg1, and Andreas Stolcke2
(Professor Nelson H. Morgan)
(DARPA) ROAR N66001-99-D-8504, DARPA Communicator Project at ICSI and University of Washington, (NASA) NCC 2-1256, and (NSF) IRI-9619921

We investigate the use of prosody for the detection of frustration and annoyance in natural human-computer dialog. In addition to prosodic features, we examine the contribution of language model information and speaking "style." Results show that a prosodic model can predict whether an utterance is neutral versus "annoyed or frustrated" with an accuracy on par with that of human interlabeler agreement. Accuracy increases when discriminating only "frustrated" from other utterances, and when using only those utterances on which labelers originally agreed. Furthermore, prosodic model accuracy degrades only slightly when using recognized versus true words. Language model features, even if based on true words, are relatively poor predictors of frustration. Finally, we find that hyperarticulation is not a good predictor of emotion; the two phenomena often occur independently.

1Staff, ICSI, SRI International
2Staff, ICSI, SRI International

More information (http://www.icsi.berkeley.edu/~jca) or

Send mail to the author : (jca@eecs.berkeley.edu)