Electrical Engineering
      and Computer Sciences

Electrical Engineering and Computer Sciences

COLLEGE OF ENGINEERING

UC Berkeley

   

2008 Research Summary

Real-Time, Online, Speaker Diarization

View Current Project Information

Nelson Morgan

The goal of speaker diarization is to segment audio into speaker-homogeneous regions with the ultimate goal of answering the question "who spoke when?" For most of the applications of speaker diarization, e.g., automatic speech recognition (ASR), large volume audio retrieval, and multi-modal meeting event detection, real-time and online performance is required. We aim at developing real-time, online speaker diarization.

In achieving real-time speaker diarization, we start with a state-of-the-art system, which uses a combination of agglomerative clustering with Bayesian Information Criterion (BIC) and Gaussian Mixture Models (GMMs) of frame-based cepstral features (MFCCs). A fast-match framework for fast speaker diarization has been proposed [1]. The basic idea is using a computationally cheap method to reduce the hypothesis space of the more expensive and accurate search. Specifically, two fast match strategies, based on pitch-correlogram and KL-divergence, are developed.

In achieving the goal of online diarization, the batch mode agglomerative clustering is not feasible. We propose to build a generic speaker space using out-of domain data and each speaker can be used as a speaker anchor stimuli. A new segment of speech is then represented using the responses to each of these anchor stimuli, i.e., likelihood of the speech given the anchor model. A set of novelty detection algorithms will be explored in deciding where this new speech segment is posited in such a space with respect to earlier, already observed, speaker segments.

Furthermore, we are investigating alternative modeling such as HDP [2] for speaker diarization, which generatively fits the data and solves the K-problem, which exists in most clustering problems.

[1]
Y. Huang, O. Vinyals, G. Friedland, C. Muller, N. Mirghafori, and C. Wooters, "A Fast-Match Approach for Robust, Faster than Real-Time Speaker Diarization," Proceedings of the IEEE Automatic Speech Recognition Understanding Workshop, 2007.
[2]
Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, "Hierarchical Dirichlet Processes," Journal of the American Statistical Association, Vol. 101, 2006, pp. 1566-1581.