Robust Speech Segmentation Through Non-Speech Interval Suppression

Dr. Michael J. Thompson; Dr. Yoon-Seok Kim

Authors

Dr. Michael J. Thompson Centre for Speech Technology Research, University of Edinburgh, United Kingdom
Dr. Yoon-Seok Kim Department of Electrical and Computer Engineering, Seoul National University, South Korea

Keywords:

Speech segmentation, non-speech interval suppression, speech processing, voice activity detection, audio signal processing, robust segmentation

Abstract

Speech segmentation, the process of dividing continuous speech into meaningful units such as words, syllables, or phonemes, is a fundamental precursor for numerous speeches processing applications, including automatic speech recognition (ASR), speaker diarization, and language acquisition studies. Traditional segmentation methods often struggle with robustness in challenging acoustic environments, particularly in the presence of varied background noise and silent intervals. This article proposes and outlines a novel framework for robust speech segmentation based on the principle of "Non-Speech Interval Suppression" (NSIS). This method primarily focuses on accurately identifying and eliminating silent or non-speech regions to precisely delineate the boundaries of spoken content. By leveraging advanced Voice Activity Detection (VAD) techniques and integrating them within a structured segmentation pipeline, NSIS aims to enhance segmentation accuracy, especially in noisy conditions. The proposed framework's potential to yield cleaner speech segments is discussed, leading to improved performance in downstream speech technologies and offering a more robust foundation for speech analysis.

References

Okko Rasanen. (2007). Speech Segmentation and Clustering Methods for a New Speech Recognition Architecture. M.Sc Thesis, Department of Electrical and Communications Engineering, Laboratory of Acoustics and Audio Signal Processing, Helsinki University of Technology, Espoo, November.

Mermelstein, P. (1975). Automatic segmentation of speech into syllabic units. Journal of the Acoustical Society of America, 58(4), 880–883.

Mattys, S. L., & Jusczyk, P. W. (2001). Phonotactic cues for segmentation of fluent speech by infants. Cognition, 78, 91–121.

Zhang, T., & Kuo, C.-C. J. (1999). Hierarchical classification of audio data for archiving and retrieving. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 6, 3001–3004.

Antal, M. (2004). Speaker independent phoneme classification in continuous speech. Studia Universitatis Babeș-Bolyai, Informatica, 49(2).

Dahan, D., & Brent, M. R. (1999). On the discovery of novel word-like units from utterances: An artificial-language study with implications for native-language acquisition. Journal of Experimental Psychology: General, 128, 165–185.

Thangarajan, R., & Natarajan, A. M. (2008). Syllable-based continuous speech recognition for Tamil. South Asian Language Review, 18(1).

Hioka, Y., & Namada, N. (2003). Voice activity detection with array signal processing in the wavelet domain. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 86(11), 2802–2811.

Beritelli, F., & Casale, S. (1997). Robust voiced/unvoiced classification using fuzzy rules. In 1997 IEEE Workshop on Speech Coding for Telecommunications Proceedings (pp. 5–6).

Qi, Y., & Hunt, B. (1993). Voiced-unvoiced-silence classification of speech using hybrid features and a network classifier. IEEE Transactions on Speech and Audio Processing, 1(2), 250–255.

Basu, S. (2003). A linked-HMM model for robust voicing and speech detection. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03).

Kvale, K. (1993). Segmentation and Labeling of Speech. PhD Dissertation, The Norwegian Institute of Technology.

Rabiner, L., & Juang, B.-H. (1993). Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice Hall.

Sharma, M., & Mammone, R. (1996). Blind speech segmentation: Automatic segmentation of speech without linguistic knowledge. In Proceedings of the Fourth International Conference on Spoken Language (ICSLP ’96), Vol. 2, 1237–1240.

Schiel, F. (1999). Automatic phonetic transcription of non-prompted speech. In Proceedings of the ICPhS, San Francisco, 607–610.

Shapiro, L. G., & Stockman, G. C. (Year not provided). Computer Vision. [Incomplete reference—please provide full details if available.]

Frontiers in Emerging Computer Science and Information Technology

Article Details Page