A study on deep neural network based speech enhancement by combining outlier-robust loss

Bongsu Jung; Wooil Kim

doi:10.7776/ASK.2025.44.1.066

All Issue

2025 Vol.44, Issue 1 Preview Page Next Page

Research Article

A study on deep neural network based speech enhancement by combining outlier-robust loss 이상치 강건 손실함수를 결합한 심층 신경망 기반 음성 향상 연구

31 January 2025. pp. 66-73

PDF XML

Abstract

Mask-based speech enhancement techniques have proven effective in improving speech quality by separating noise and speech. Many studies have explored methods to restore the spectrum of noisy speech using Deep Neural Networks (DNNs). To further enhance the performance of DNN-based speech enhancement methods, research has focused on improving loss functions. Previous studies proposed combining the Scale-Invariant Signal-to-Noise Ratio (SI-SNR) loss in the time domain and the Mean Squared Error (MSE) loss in the magnitude spectrum of the frequency domain to train speech enhancement models. However, the use of MSE loss is limited by its sensitivity to outliers, as the non-uniformity of noise in training data acts as an outlier, which adversely impacts the loss function. In this study, we replaced the MSE loss with Huber Loss, a robust regression loss function, and its variant, Berhu Loss, to evaluate their impact on speech enhancement performance. We compared the performance of models trained with the proposed loss functions to those trained with the conventional SI-SNR + MSE combination, using Source-to-Distortion Ratio (SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short-Time Objective Intelligibility (STOI) as evaluation metrics. Experimental results on the Voice Cloning Toolkit (VCTK) dataset demonstrated that combining SI-SNR with either Huber Loss or Berhu Loss improved all three metrics (SDR, PESQ, and STOI), outperforming the conventional SI-SNR + MSE combination.

Keywords

Speech enhancement

Outlier-Robust loss

Huber loss

Berhu loss

마스크 기반 음성 향상 기법은 잡음과 음성을 분리하여 음성 품질을 향상시키는 데 효과적이며, 심층 신경망(Deep Neural Network, DNN)을 사용하여 잡음이 있는 음성의 스펙트럼을 복원하는 방법이 많이 연구되어왔다. 심층 신경망 기반 음성 향상 기법의 성능 향상을 위한 방법의 하나로 손실함수 개선에 관한 연구가 진행되어 왔다. 선행 연구에서는 시간 영역에서 Scale-Invariant Signal-to-Noise Ratio(SI-SNR)을 활용한 손실함수와, 주파수 영역에서 크기 스펙트럼에서 Mean Squared Error(MSE)를 사용한 손실함수를 조합하여 음성 향상 모델을 훈련하는 방식이 연구되었다. 선행연구에서의 MSE 손실함수 사용은 훈련 데이터에서 잡음의 불균일성이 이상치(Outlier로 작용하여 MSE 손실함수가 이상치 데이터에 민감하게 반응한다는 한계가 존재한다. 본 연구에서는 MSE 손실함수 대신 이상치에 강건한 Robust Regression 손실함수 중 하나인 Huber 손실함수와 그 변형인 Berhu 손실함수로 대체하여 음성 향상 성능을 평가하였다. 성능 평가를 위해 Source-to-Distortion Ratio(SDR), Perceptual Evaluation of Speech Quality(PESQ), Short-Time Objective Intelligibility(STOI)를 이용해 비교 평가를 수행하였다. Voice Cloning Toolkit(VCTK) 데이터베이스를 이용한 실험 결과, 기존의 SI-SNR + MSE 손실함수 조합 대신 Huber 손실함수와 Berhu 손실함수를 SI-SNR 손실함수와 조합한 경우, SDR, PESQ, STOI 세 가지 성능 지표에서 모두 향상된 결과를 보였다.

키워드

음성 향상

이상치 강건 손실함수

Huber 손실함수

Berhu 손실함수

References

Y. H. Tu, J. Du, and C. H. Lee, "2d-to-2d mask estimation for speech enhancement based on fully convolutional neural network," Proc. IEEE ICASSP, 6664-6668 (2020).

10.1109/ICASSP40776.2020.9054615

Y. Xu, J. Du, and C. H. Lee, "A regression approach to speech enhancement based on deep neural networks," IEEE/ACM Trans. on Audio, Speech, and Lang. Process. 23, 7-19 (2014).

10.1109/TASLP.2014.2364452

J. Jung and W. Kim, "A study on loss combination in time and frequency for effective speech enhancement based on complex-valued spectrum" (in Korea), J. Acoust. Soc. Kr. 41, 38-44 (2022).

P. J. Huber, Breakthroughs in Statistics: Methodology and Distribution (Springer, New York, 1992), pp. 492-518.

10.1007/978-1-4612-4380-9_35

A. B. Owen, "A robust hybrid of lasso and ridge regression," Contemp. Math. 443.7, 59-72 (2007).

10.1090/conm/443/08555

H. S. Choi, J. H. Kim, J. Huh, A. Kim, J. W. Ha, and K. Lee, "Phase-aware speech enhancement with deep complex u-net," Proc. ICLR, 1-20 (2019).

O. Oktay, J. Schlemper, L. Le. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, B. Glocker, and D. Rueckert, "Attention u-net: Learning where to look for the pancreas," arXiv preprint arXiv:1804.03999 (2018).

C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J. F. Santos, S. Mehri, N. Rostamzadeh, Y. Bengio, and C. J. Pal, "Deep complex networks," arXiv preprint arXiv:1705.09792 (2017).

P. Ochieng, "Deep neural network techniques for monaural speech enhancement and separation: state of the art analysis," Artif. Intell. Rev. 56, 3651-3703 (2003).

10.1007/s10462-023-10612-2

CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit, https://doi.org/10.7488/ds/1994, (Last viewd January 21, 2025).

J. Thiemann, N. Ito, and E. Vincent, "DEMAND: a collection of multi-channel recordings of acoustic noise in diverse environments," Proc. Meetings on Acoustics, 19, 035081 (2013).

C. Valentini-Botinhao, Noisy Speech Database for Training Speech Enhancement Algorithms and TTS models, https://doi.org/10.7488/ds/2117, (Last viewed January 15, 2025).

E. Vincent, R. Gribonval, and C. Fevotte, "Performance measurement in blind audio source separation," IEEE Trans. Audio, Speech, and Lang. Process. 14, 1462-1469 (2006).

10.1109/TSA.2005.858005

A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, "Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs," Proc. IEEE ICASSP, 749-752 (2001).

10.1109/ICASSP.2001.941023

C. H. Taal, R. C. Hendriks, R, Heusdens, and J. Jensen, "A short-time objective intelligibility measure for time-frequency weighted noisy speech," Proc. IEEE ICASSP, 4214-4217 (2010).

10.1109/ICASSP.2010.5495701

Information

Publisher :The Acoustical Society of Korea
Publisher(Ko) :한국음향학회
Journal Title :The Journal of the Acoustical Society of Korea
Journal Title(Ko) :한국음향학회지
Volume : 44
No :1
Pages :66-73
Received Date : 2024-11-14
Accepted Date : 2024-12-31
DOI :https://doi.org/10.7776/ASK.2025.44.1.066

The Journal of the Acoustical Society of KoreaISSN:1225-4428(Print) 2287-3775(Online)한국음향학회

All Issue