A study on deep neural network based speech enhancement with self-supervised learning representation

Donghee Kim; Taewon Kim; Wooil Kim

doi:10.7776/ASK.2025.44.1.058

All Issue

2025 Vol.44, Issue 1 Preview Page Next Page

Research Article

A study on deep neural network based speech enhancement with self-supervised learning representation 자기지도학습 표현을 적용한 심층 신경망 기반 음성 향상 기법 연구

31 January 2025. pp. 58-65

PDF XML

Abstract

Various studies have been conducted to apply Self-Supervised Learning (SSL) representation to speech processing. SSL is one of the unsupervised learning, which refers to a method of performing learning by generating labels on large amount of data by itself. In this progress, the model can learn the latent features of the input large amount of data. In this paper, we propose a study on speech enhancement by applying the latent speech representation extracted from SSL to a mask estimation-based U-Net model. The encoder-decoder structure of the U-Net model compresses the input signal and effectively restores it through skip-connection to improve speech clarity. At this time, the SSL representation is additionally delivered to skip-connection to estimate the mask with improved performance than the existing system. Source-to-Distortion Ratio (SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short-Time Objective Intelligibility (STOI) were used as performance measure to evaluate speech enhancement, and the method proposed in this paper showed improved performance in SDR, PESQ, and STOI. Through this, we showed that the proposed method can improve speech clarity and quality.

Keywords

Self-supervised learning representation

U-Net

Skip-connection

Speech enhancement

자기지도학습(Self-Supervised Learning, SSL) 표현을 음성 처리 기법에 적용하는 다양한 연구가 진행되어 왔다. 자기지도학습은 비지도 학습 중 하나로 대량의 데이터에 대해 스스로 라벨을 생성하여 학습을 수행하는 방식을 말한다. 이 과정에서 모델은 입력된 대량의 데이터에서 잠재된 특징을 학습할 수 있게 된다. 본 논문에서는 자기지도학습에서 추출된 잠재된 음성 표현을 마스크 추정 기반 U-Net 모델에 적용한 음성 향상 연구를 제안한다. U-Net 모델의 Encoder-Decoder 구조는 입력 신호를 압축하고 skip-connection을 통해 복원하여 음성 명료도를 효과적으로 향상시킨다. 제안하는 방법에서 skip-connection에 자기지도학습 표현을 추가적으로 전달함으로써 기존의 시스템과 비교하여 깨끗한 음성 예측 성능이 향상된 마스크를 추정하도록 한다. 음성 향상의 결과를 평가하기 위한 지표로 Source-to-Distortion Ratio(SDR), Perceptual Evaluation of Speech Quality(PESQ), Short-Time Objective Intelligibility (STOI)를 사용하였으며, 본 논문에서 제안하는 방법이 SDR, PESQ, STOI 모든 지표에서 개선된 성능을 보였다.

키워드

자기지도학습 표현

U-Net

Skip-connection

음성 향상

References

J. Lim and A. Oppenheim, "All-pole modeling of degraded speech," IEEE Trans. Acoustics, Speech, and Signal Process. 26, 197-210 (1978).

10.1109/TASSP.1978.1163086

S. Boll, "Suppression of acoustic noise in speech using spectral subtraction," IEEE ICASSP, 27, 113-120 (1979).

10.1109/TASSP.1979.1163209

J.-h. Jung and W. Kim, "A study on loss combination in time and frequency for effective speech enhancement based on complex-valued spectrum" (in Korean), J. Acoust. Soc. Kr. 41, 38-44 (2022).

Z. Huang, S. Watanabe, S. W. Yang, P. García, and S. Khudanpur, "Investigating self-supervised learning for speech enhancement and separation," Proc. IEEE ICASSP, 6837-6841 (2022).

10.1109/ICASSP43922.2022.9746303

K.-H. Hung, S.-w. Fu, H.-h. Tseng, H.-T. Chiang, Y. Tsao, and C.-W. Lin, "Boosting self-supervised embeddings for speech enhancement," Proc. Interspeech, 186-190 (2022).

10.21437/Interspeech.2022-10002

O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks for biomedical image segmentation," Proc. MICCAI, 234-241 (2015).

10.1007/978-3-319-24574-4_28

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, "Wav2vec 2.0: a framework for self-supervised learning of speech representations," Proc. 34th Int. Conf. NeurIPS, 12449-12460 (2020).

C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, "Investigating RNN-based speech enhancement methods for noiserobust text-to-speech," Proc. 9th ISCA Speech Synthesis Workshop, 146-152 (2016).

10.21437/SSW.2016-24

J. Thiemann, N. Ito, and E. Vincent, "The diverse environments multi-channel acoustic noise database: A database of multichannel environmental noise recordings," J. Acoust. Soc. Am. 133, 3591-3591 (2013).

10.1121/1.4806631

E. Vincent, R. Gribonval, and C. Fevotte, "Performance measurement in blind audio source separation," IEEE Trans. Audio, Speech, and Lang. Process. 14, 1462-1469 (2006).

10.1109/TSA.2005.858005

A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, "Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs," IEEE ICASSP, 749-752 (2001).

10.1109/ICASSP.2001.941023

C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, "A short-time objective intelligibility measure for time-frequency weighted noisy speech," IEEE ICASSP, 4214-4217 (2010).

10.1109/ICASSP.2010.5495701

I. Cohen, "Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging," IEEE Trans. Audio, Speech, Lang. Process. 11, 466-475 (2003).

10.1109/TSA.2003.811544

S. Pascual, A. Bonafonte, and J. Serra, "SEGAN: Speech enhancement generative adversarial network," Proc. Interspeech, 3642-3646 (2017).

10.21437/Interspeech.2017-1428

M. N. Ali, A. Brutti, and D. Falavigna, "Speech enhancement using dilated wave-u-net: an experimental analysis," Proc. 27th FRUCT, 3-9 (2020).

10.23919/FRUCT49677.2020.9211072

W. Jiang, F. Wen, Y. Zhang, and K. Yu, "UnSE: Unsupervised speech enhancement using optimal transport," Proc. Interspeech, 4029-4033 (2023).

10.21437/Interspeech.2023-378

Information

Publisher :The Acoustical Society of Korea
Publisher(Ko) :한국음향학회
Journal Title :The Journal of the Acoustical Society of Korea
Journal Title(Ko) :한국음향학회지
Volume : 44
No :1
Pages :58-65
Received Date : 2024-11-14
Revised Date : 2024-12-27
Accepted Date : 2024-12-31
DOI :https://doi.org/10.7776/ASK.2025.44.1.058

The Journal of the Acoustical Society of KoreaISSN:1225-4428(Print) 2287-3775(Online)한국음향학회

All Issue