Phonetic- and Speaker-Discriminant Features for Speaker Recognition
Lara Lynn Stoll, Nikki Mirghafori1, Joe Frankel2 and Nelson Morgan
The speaker recognition task is that of deciding whether or not a (previously unseen) test utterance belongs to a given target speaker, for whom there is only a limited amount of training data available. One traditionally successful approach to speaker recognition involves using low-level cepstral features extracted from speech in a Gaussian mixture model (GMM) system. Instead of using such cepstral features directly, we use a multi-layer perceptron (MLP) to transform the cepstral features into discriminative features better suited for speaker recognition. Two types of MLP output targets are considered: phones (Tandem-MLP) and speakers (Speaker-MLP).
Originally developed for automatic speech recognition, Tandem/HATS MLP features incorporate longer term temporal information through the use of MLPs whose outputs are phone posteriors [1,2]. We use the output activations of the Tandem-MLP as features in a GMM speaker recognition system, with the idea that the phonetic information of a speaker can be used to distinguish that speaker from others. For the Speaker-MLP, we use the hidden activations as features in a support vector machine (SVM) system, with the intuition that these hidden activations represent a nonlinear mapping of the input cepstral features into a general set of speaker patterns. We observe that using a smaller set of MLP training speakers, chosen through clustering, yields system performance similar to that of a Speaker-MLP trained with many more speakers.
For the NIST Speaker Recognition Evaluation 2004, both the Tandem-GMM and Speaker-SVM systems improve upon a basic GMM baseline, but are unable to contribute in a score-level combination with a state-of-the-art cepstral GMM system. We believe that the application of normalizations and channel compensation techniques to the current state-of-the-art GMM has reduced channel mismatch errors to the point that the contributions of the MLP systems are no longer additive.
- B. Chen, Q. Zhu, and N. Morgan, "Learning Long-Term Temporal Features in LVCSR Using Neural Networks," Proc. Int. Conf. Spoken Language Processing, October 2004.
- Q. Zhu, B. Chen, N. Morgan, and A. Stolcke, "On Using MLP Features in LVCSR," Proc. Int. Conf. Spoken Language Processing, October 2004.
1International Computer Science Institute (ICSI)
2Centre for Speech Technology Research, Edinburgh, UK