Robust Speech Segmentation Through Non-Speech Interval Suppression
Keywords:
Speech segmentation, non-speech interval suppression, speech processing, voice activity detection, audio signal processing, robust segmentationAbstract
Speech segmentation, the process of dividing continuous speech into meaningful units such as words, syllables, or phonemes, is a fundamental precursor for numerous speeches processing applications, including automatic speech recognition (ASR), speaker diarization, and language acquisition studies. Traditional segmentation methods often struggle with robustness in challenging acoustic environments, particularly in the presence of varied background noise and silent intervals. This article proposes and outlines a novel framework for robust speech segmentation based on the principle of "Non-Speech Interval Suppression" (NSIS). This method primarily focuses on accurately identifying and eliminating silent or non-speech regions to precisely delineate the boundaries of spoken content. By leveraging advanced Voice Activity Detection (VAD) techniques and integrating them within a structured segmentation pipeline, NSIS aims to enhance segmentation accuracy, especially in noisy conditions. The proposed framework's potential to yield cleaner speech segments is discussed, leading to improved performance in downstream speech technologies and offering a more robust foundation for speech analysis.
References
Okko Rasanen. (2007). Speech Segmentation and Clustering Methods for a New Speech Recognition Architecture. M.Sc Thesis, Department of Electrical and Communications Engineering, Laboratory of Acoustics and Audio Signal Processing, Helsinki University of Technology, Espoo, November.
Mermelstein, P. (1975). Automatic segmentation of speech into syllabic units. Journal of the Acoustical Society of America, 58(4), 880–883.
Mattys, S. L., & Jusczyk, P. W. (2001). Phonotactic cues for segmentation of fluent speech by infants. Cognition, 78, 91–121.
Zhang, T., & Kuo, C.-C. J. (1999). Hierarchical classification of audio data for archiving and retrieving. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 6, 3001–3004.
Antal, M. (2004). Speaker independent phoneme classification in continuous speech. Studia Universitatis Babeș-Bolyai, Informatica, 49(2).
Dahan, D., & Brent, M. R. (1999). On the discovery of novel word-like units from utterances: An artificial-language study with implications for native-language acquisition. Journal of Experimental Psychology: General, 128, 165–185.
Thangarajan, R., & Natarajan, A. M. (2008). Syllable-based continuous speech recognition for Tamil. South Asian Language Review, 18(1).
Hioka, Y., & Namada, N. (2003). Voice activity detection with array signal processing in the wavelet domain. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 86(11), 2802–2811.
Beritelli, F., & Casale, S. (1997). Robust voiced/unvoiced classification using fuzzy rules. In 1997 IEEE Workshop on Speech Coding for Telecommunications Proceedings (pp. 5–6).
Qi, Y., & Hunt, B. (1993). Voiced-unvoiced-silence classification of speech using hybrid features and a network classifier. IEEE Transactions on Speech and Audio Processing, 1(2), 250–255.
Basu, S. (2003). A linked-HMM model for robust voicing and speech detection. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03).
Kvale, K. (1993). Segmentation and Labeling of Speech. PhD Dissertation, The Norwegian Institute of Technology.
Rabiner, L., & Juang, B.-H. (1993). Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice Hall.
Sharma, M., & Mammone, R. (1996). Blind speech segmentation: Automatic segmentation of speech without linguistic knowledge. In Proceedings of the Fourth International Conference on Spoken Language (ICSLP ’96), Vol. 2, 1237–1240.
Schiel, F. (1999). Automatic phonetic transcription of non-prompted speech. In Proceedings of the ICPhS, San Francisco, 607–610.
Shapiro, L. G., & Stockman, G. C. (Year not provided). Computer Vision. [Incomplete reference—please provide full details if available.]
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors retain the copyright of their articles published in this journal. All articles are licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). This license permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly cited.