Analysis of trends in speech denoising using deep learning and a feasibility study for a Korean real time model

Seon Man Kim

doi:10.7776/ASK.2025.44.5.540

All Issue

2025 Vol.44, Issue 5 Preview Page Next Page

Research Article

Analysis of trends in speech denoising using deep learning and a feasibility study for a Korean real time model 딥러닝 기반 음성 디노이징 기술 동향 및 한국어 실시간 모델 구현 검토

30 September 2025. pp. 540-547

PDF XML

Abstract

This paper systematically reviews the evolution of deep learning-based speech denoising technology and examines the feasibility of applying a state-of-the-art real-time model to the Korean language. We analyze the paradigm shift from statistical methods to deep learning, and from magnitude-only spectral processing to complex-domain approaches. Based on this analysis, we validate the effectiveness of the DeepFilterNet2 architecture, a proven lightweight real-time model, using Korean data. The experimental results showed that, compared to a baseline model trained only on English data, the model trained with additional 16 kHz-based Korean data exhibited minimal or even degraded performance. This study analyzes that the primary cause of this phenomenon is the sampling rate mismatch between the training DB. It concludes that this data quality mismatch is a critical challenge that must be addressed for the future development of successful Korean real-time models.

Keywords

Speech denoising

Speech enhancement

Deep learning

Real-time processing

DeepFilterNet

본 논문은 딥러닝 기반 음성 디노이징 기술의 발전 과정을 체계적으로 고찰하고, 이를 바탕으로 SOTA 실시간 모델을 한국어 환경에 적용하여 그 성능과 구현 가능성을 검토한다. 통계적 기법에서 딥러닝으로, 다시 스펙트럼의 크기에서 위상까지 고려하는 복소수 도메인으로의 기술 패러다임 전환을 살펴본다. 이러한 분석을 바탕으로, 검증된 경량 실시간 모델인 DeepFilterNet2 아키텍처에 한국어 데이터를 적용하여 그 유효성을 검증했다. 실험 결과, 영어 데이터로만 학습된 베이스라인 모델 대비, 16 kHz 기반의 한국어 데이터를 추가 학습한 모델의 성능 개선이 미미하거나 오히려 일부 저하되는 현상을 확인했다. 본 연구는 이 현상의 주된 원인이 학습 데이터셋 간의 샘플링 레이트 불일치에 있음을 분석하고, 이 데이터 품질 불일치 문제가 향후 성공적인 한국어 실시간 모델 개발을 위해 반드시 선결되어야 할 중요한 과제임을 제시한다.

키워드

음성 디노이징

음성 향상

딥러닝

실시간 처리

딥필터넷

References

P. C. Loizou, Speech Enhancement: Theory and Practice (CRC Press, Boca Raton, 2013), pp. 1-10.

10.1201/b14529-1

Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator,” IEEE Trans. Acoust. Speech, Signal Process. 32, 1109-1121 (1984).

10.1109/TASSP.1984.1164453

Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Trans. Audio, Speech, Lang. Process. 23, 7-19 (2015).

10.1109/TASLP.2014.2364452

K. Tan and D. Wang, “A convolutional recurrent neural network for real-time speech enhancement,” Proc. Interspeech, 3229-3233 (2018).

10.21437/Interspeech.2018-1405

C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investigating RNN-based speech enhancement methods for noise-robust text-to-speech,” Proc. SSW9, 145-150 (2016).

10.21437/SSW.2016-24

Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, and L. Xie, “DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement,” Proc. Interspeech, 2472-2476 (2020).

10.21437/Interspeech.2020-2537

S. Pascual, A. Bonafonte, and J. Serra, “SEGAN: Speech enhancement generative adversarial network,” Proc. Interspeech, 3642-3646 (2017).

10.21437/Interspeech.2017-1428

Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process. 27, 1256-1266 (2019).

10.1109/TASLP.2019.291516731485462PMC6726126

H. Schröter, A. N. Gomez, and T. Gerkmann, “DeepFilterNet2: Towards real-time speech enhancement on embedded devices for full-band audio,” arXiv: 2205.05474 (2022).

10.1109/IWAENC53105.2022.9914782

W. Tai, Y. Lei, F. Zhou, G. Trajcevski, and T. Zhong, “DOSE: Diffusion dropout with adaptive prior for speech enhancement,” Proc. NeurIPS, 1-22 (2023).

J.-M. Valin, “A hybrid DSP/deep learning approach to real-time full-band speech enhancement,” Proc. MMSP, 1-5 (2018).

10.1109/MMSP.2018.8547084

A. Li, C. Zheng, L. Zhang, and X. Li, “Glance and gaze: A collaborative learning framework for single- channel speech enhancement,” Appl. Acoust. 187, 108535 (2022).

10.1016/j.apacoust.2021.108499

G. Zhang, L. Yu, C. Wang, and J. Wei, “Multi-scale temporal frequency convolutional network with axial attention for speech enhancement,” Proc. ICASSP, 9122-9126 (2022).

10.1109/ICASSP43922.2022.9746610

H. Dubey, V. Gopal, R. Cutler, A. Aazami, S. Matusevych, S. Braun, S. E. Eskimez, M. Thakker, T. Yoshioka, H. Gamper, and R. Aichner, “ICASSP 2022 deep noise suppression challenge,” Proc. ICASSP, 9271-9275 (2022).

10.1109/ICASSP43922.2022.9747230

C. K. A. Reddy, V. Gopal, and R. Cutler, “Dnsmos P.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” Proc. ICASSP, 721-725 (2022).

10.1109/ICASSP43922.2022.9746108

J. Ha, S. Kwak, and S. Jung, “KsponSpeech: Korean spontaneous speech corpus for automatic speech recognition,” Appl. Sci. 10, 6936 (2020).

10.3390/app10196936

Information

Publisher :The Acoustical Society of Korea
Publisher(Ko) :한국음향학회
Journal Title :The Journal of the Acoustical Society of Korea
Journal Title(Ko) :한국음향학회지
Volume : 44
No :5
Pages :540-547
Received Date : 2025-08-24
Accepted Date : 2025-09-08
DOI :https://doi.org/10.7776/ASK.2025.44.5.540

The Journal of the Acoustical Society of KoreaISSN:1225-4428(Print) 2287-3775(Online)한국음향학회

All Issue