Audio-Visual Speaker Diarization
Mary Tai Knox and Nelson Morgan
Swiss National Science Foundation IM2
The goal of speaker diarization is to partition an audio recording into speaker homogeneous regions. Speaker diarization involves multiple tasks, including separating the audio into speech and non-speech regions and assigning the appropriate speaker(s) to the speech regions.
There are many speech related areas where speaker diarization provides useful information. For example, speaker diarization could be used to separate an automatic speech recognition output by speaker, thereby making the transcript more understandable. Also, nowadays remote meetings are becoming more prevalent. Incorporating speaker diarization could be used to inform remote meeting participants of the current speaker.
A significant amount of work has been done in speaker diarization using audio-only data. However, current diarization datasets also include video data. In this project, we hope to exploit information from the video data to improve current state-of-the-art speaker diarization systems.