Audio Diarization for Meetings Speech Processing
Nelson Morgan, Oriol Vinyals and Gerald Friedland
Swiss National Science Foundation and AT&T
Perhaps more than any other domain, meetings represents a rich source of content for spoken language research and technology. Two common (and complementary) forms of meetings speech processing are automatic speech recognition--which seeks to determine what was said--and speaker diarization--which seeks to determine who spoke when. Because of the complexity of meetings, however, such forms of processing present a number of challenges. In the case of speech recognition, crosstalk speech is often the primary source of errors for audio from the personal microphones worn by participants in various meetings. With speaker diarization, overlapped speech generates a significant number of errors for most state-of-the-art systems, which are generally unequipped to deal with this phenomenon.
In this work we seek to address these two issues by employing audio diarization, using an HMM based segmenter to identify regions of interest--local speech for ASR and overlapped speech for speaker diarization--to improve performance of the respective systems. A particular focus is the selection of features which work well for these segmentation tasks. In addition, in the case of overlapped speech we investigate how two processing techniques--overlap detection and exclusion--effectively utilize overlap information to improve speaker diarization performance.