Speech Detection, Classification, and Processing for Improved Automatic Speech Recognition in Multiparty Meetings
Kofi Agyeman Boakye and Nelson Morgan
The recognition of speech in multiparty meetings presents a number of challenges owing to the complexity of the domain. The present research paradigm essentially involves two major subtasks based on the sensors used to collect the audio data: individual microphones (headset or lapel) worn by the meeting participants and distant microphones (tabletop or array) placed in varying locations within the meeting room. Each of these subtasks, in turn, has its own particular challenges in terms of acoustic phenomena which adversely affect recognition performance.
In the case of the individual microphones, crosstalk speech is often the primary source of errors (manifested as high insertion error rates) and as such the segmentation of local speech is of critical importance. This project seeks to investigate the effectiveness of various acoustically-derived features for use in a Hidden Markov Model (HMM) based local speech segmenter, focusing on cross-channel features (i.e., features derived from multiple individual channel signals) because of the nature of the crosstalk phenomenon.
For the distant microphones, it is overlapped speech between participants that generates a significant number of recognition errors. Unlike crosstalk, this speech cannot simply be excluded, but rather requires additional processing to improve recognition performance. The project here, too, seeks to investigate features for segmentation, though for overlaps in this case. In addition, processing techniques to improve recognition accuracy in these overlap regions (e.g., binary masking, harmonic magnitude enhancement) are under study.