Electrical Engineering
      and Computer Sciences

Electrical Engineering and Computer Sciences


UC Berkeley


2008 Research Summary

Phonetic- and Speaker-Discriminant Features for Speaker Recognition

View Current Project Information

Lara Lynn Stoll, Nikki Mirghafori1, Joe Frankel2 and Nelson Morgan

The speaker recognition task is that of deciding whether or not a (previously unseen) test utterance belongs to a given target speaker, for whom there is only a limited amount of training data available. One traditionally successful approach to speaker recognition involves using low-level cepstral features extracted from speech in a Gaussian mixture model (GMM) system. Instead of using such cepstral features directly, we use a multi-layer perceptron (MLP) to transform the cepstral features into discriminative features better suited for speaker recognition. Two types of MLP output targets are considered: phones (Tandem-MLP) and speakers (Speaker-MLP).

Originally developed for automatic speech recognition, Tandem/HATS MLP features incorporate longer term temporal information through the use of MLPs whose outputs are phone posteriors [1,2]. We use the output activations of the Tandem-MLP as features in a GMM speaker recognition system, with the idea that the phonetic information of a speaker can be used to distinguish that speaker from others. For the Speaker-MLP, we use the hidden activations as features in a support vector machine (SVM) system, with the intuition that these hidden activations represent a nonlinear mapping of the input cepstral features into a general set of speaker patterns. We observe that using a smaller set of MLP training speakers, chosen through clustering, yields system performance similar to that of a Speaker-MLP trained with many more speakers.

For the NIST Speaker Recognition Evaluation 2004, both the Tandem-GMM and Speaker-SVM systems improve upon a basic GMM baseline, but are unable to contribute in a score-level combination with a state-of-the-art cepstral GMM system. We believe that the application of normalizations and channel compensation techniques to the current state-of-the-art GMM has reduced channel mismatch errors to the point that the contributions of the MLP systems are no longer additive.

B. Chen, Q. Zhu, and N. Morgan, "Learning Long-Term Temporal Features in LVCSR Using Neural Networks," Proc. Int. Conf. Spoken Language Processing, October 2004.
Q. Zhu, B. Chen, N. Morgan, and A. Stolcke, "On Using MLP Features in LVCSR," Proc. Int. Conf. Spoken Language Processing, October 2004.

1International Computer Science Institute (ICSI)
2Centre for Speech Technology Research, Edinburgh, UK