A teacher student model based integrated feature speaker verification system robust to noisy environments

Kyo-won Koo; Jungwoo Heo; Hyun-seo Shin; Chan-yeong Lim; Seung-bin Kim; Jisoo Son; Kyung-Wha Kim; Ha-Jin Yu

doi:10.7776/ASK.2025.44.5.548

All Issue

2025 Vol.44, Issue 5 Preview Page

Research Article

A teacher student model based integrated feature speaker verification system robust to noisy environments 잡음 환경에 강인한 다중 특징 교사 학생 학습 기반 화자 인증 시스템

30 September 2025. pp. 548-555

PDF XML

Abstract

While existing speaker verification systems exhibit excellent performance in clean environments, they suffer from performance degradation when contaminated with noise. Although recent research has employed teacher-student learning to enhance the noise robustness of speaker verification systems, these approaches are limited by their reliance on single input modalities. In real-world acoustic environments, various types of noise exist such as stationary and impulsive, and their characteristics manifest differently across different modalities. We propose an integrated feature system that leverages various features that each can represent different noise types differently. This system incorperates a CNN Extractor that processes spectrograms in parallel with the teacher-student learning-based Pre-trained Large Model(PLM) branch that processes raw waveforms. Features extracted from both branches are adaptively integrated through a feature fusion module, designed to exploit the complementary advantages of each input representation. The experimental results showed that the Equal Error Rate (EER) was improved by approximately 18 % in the domain noise environment and approximately 49 % in the out-of-domain noise environment compared to the existing PLM-based single input system. Furthermore, consistent performance improvements were observed across various real-world datasets validating the competitive performance of the proposed system in noisy environments.

Keywords

Speaker verification

Noise robustness

Integrated feature

Self Supervised Pre-trained Large Model (Self Supervised PLM)

기존의 화자 인증 시스템은 깨끗한 발화 환경에서는 우수한 성능을 보이지만, 잡음이 혼입된 경우에는 성능이 저하되는 현상을 보인다. 이를 개선하기 위해 교사 학생 학습을 활용하여 화자 인증 시스템의 잡음 강인성을 향상시키는 연구가 진행되었지만, 단일 입력에 의존한다는 구조적 한계를 갖는다. 실제 음성 환경에서는 정상(stationary) 잡음이나 돌발성 잡음 등 다양한 잡음 유형이 존재할 수 있으며, 이러한 잡음은 표현 방식에 따라 상이하게 나타날 수 있다. 본 연구에서는 이러한 여러 가지 유형의 잡음을 각각 잘 표현해 주는 특징들을 통합하여 활용하는 통합 특징 시스템을 제안한다. 제안한 시스템은 원시 파형을 입력받는 교사 학생 학습 기반 Pre-trained Large Model(PLM) 분기에 병렬적으로 컨볼루션 기반 추출기를 통해 스펙트로그램을 가공하는 분기를 도입하였다. 그 후 특징 융합 모듈을 통해 두 분기 특징을 적응적으로 통합하여, 각 입력 특징의 장점을 상호보완적으로 활용하도록 설계하였다. 실험 결과, 기존 PLM 기반 단일 입력 시스템 대비 동일 오류율(Equal Error Rate, EER)이 도메인 내 잡음 환경에서 약 18 %, 도메인 외 잡음 환경에서 약 49 % 상대적으로 개선되었다. 또한 다양한 실제 환경 데이터셋에서도 경쟁력 있는 성능을 보여, 제안한 시스템이 잡음 환경에서 우수함을 입증하였다.

키워드

화자 인증

잡음 강인성

통합 특징

사전 학습된 거대 모델

References

J. Hansen and T. Hasan, “Speaker recognition by machines and humans: A tutorial review,” IEEE Signal Process. Mag. 32, 74-99 (2015).

10.1109/MSP.2015.2462851

M. MohammadAmini, D. Matrouf, J.-F. Bonastre, S. Dowerah, R. Serizel, and D. Jouvet, “Learning noise robust resnet-based speaker embedding for speaker recognition,” Proc. Odyssey 2022: The Speaker and Language Recognition Workshop, 41-46 (2022).

10.21437/Odyssey.2022-6

C. Lim, H. Shin, J. Kim, J. Heo, K. Koo, S. Kim, and H. Yu, “Improving noise robustness in self-supervised pre-trained model for speaker verification,” Proc. Interspeech, 2665-2669 (2024).

10.21437/Interspeech.2024-1630

K. Zhang, Z. Hua, R. Lan, Y. Guo, Y. Zhang, and G. Xu, “Multi-view collaborative learning network for speech deepfake detection,” Proc. AAAI Conf. Artif. Intell. 1075-1083 (2025).

10.1609/aaai.v39i1.32094

J. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep speaker recognition,” Proc. Interspeech, 1086- 1090 (2018).

10.21437/Interspeech.2018-1929

D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise corpus,” arXiv:1510.08484 (2015).

G. Hu and D. Wang, “A tandem algorithm for pitch estimation and voiced speech segregation,” IEEE Trans. on Audio, Speech, and Lang. Process. 18, 2067-2079 (2010).

10.1109/TASL.2010.2041110

J. Huh, J. S. Chung, A. Nagrani, A. Brown, J.-W. Jung, D. Garcia-Romero, and A. Zisserman, “The VoxCeleb speaker recognition challenge: A retrospective,” IEEE/ACM Trans. Audio Speech Lang. Process. 32, 3850-3866 (2024).

10.1109/TASLP.2024.3444456

H. S. Heo, K. Nam, B. J. Lee, Y. Kwon, M. Lee, Y. J. Kim, and J. S. Chung, “Rethinking session variability: Leveraging session embeddings for session robustness in speaker verification,” Proc. ICASSP, 12321-12325 (2024).

10.1109/ICASSP48485.2024.10445987

C. Richey, M. A. Barrios, Z. Armstrong, C. Bartels, H. Franco, M. Graciarena, A. Lawson, M. K. Nandwana, A. Stauffer, J. van Hout, P. Gamble, J. Hetherly, C. Stephenson, and K. Ni, “Voices Obscured in Complex Environmental Settings (VOiCES) corpus,” Proc. Interspeech, 1566-1570 (2018).

10.21437/Interspeech.2018-1454

C. Sanyuan, W. Chengyi, C. Zhengyang, W. Yu, L. Shujie, C. Zhuo, L. Jinyu, K. Naoyuki, Y. Takuya, X. Xiong, W. Jian, Z. Long, R. Shuo, Q. Yanmin, Q. Yao, Z. Michael, Y. Xiangzhan, and W. Furu, “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE J. Sel. Top. Signal Process. 16, 1505-1518 (2022).

10.1109/JSTSP.2022.3188113

B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” Proc. Interspeech, 3830-3834 (2020).

10.21437/Interspeech.2020-2650

Information

Publisher :The Acoustical Society of Korea
Publisher(Ko) :한국음향학회
Journal Title :The Journal of the Acoustical Society of Korea
Journal Title(Ko) :한국음향학회지
Volume : 44
No :5
Pages :548-555
Received Date : 2025-08-08
Accepted Date : 2025-09-09
DOI :https://doi.org/10.7776/ASK.2025.44.5.548

The Journal of the Acoustical Society of KoreaISSN:1225-4428(Print) 2287-3775(Online)한국음향학회

All Issue