Performance improvement of a speech enhancement model based on two stage training method by transforming latent vectors

Seorim Hwang; Youngcheol Park; Sung Wook Park

doi:10.7776/ASK.2026.45.1.055

All Issue

2026 Vol.45, Issue 1 Preview Page Next Page

Research Article

Performance improvement of a speech enhancement model based on two stage training method by transforming latent vectors 은닉 벡터 변환을 이용한 두 단계 훈련 방식 기반 음성 향상 모델의 성능 개선 연구

31 January 2026. pp. 55-61

PDF XML

Abstract

This paper presents a two-stage training method for speech enhancement that effectively exploits embedding vectors generated by a large Speech Representation Learning (SRL) model. At Stage‑1, a DEMUCS-based conditioned model is trained to approach near-ideal performance by using SRL embedding vectors, and at Stage‑2, a lightweight model without the SRL model is trained to imitate the target latent vectors obtained from the Stage‑1 model. The target latent vector is produced in several ways, and we analyze which approach most effectively improves the final inference performance of the Stage‑2 model. The latent vectors to be imitated are initially acquired from the decoder input of the Stage‑1 DEMUCS model and processed to be the target latent vector for training. Their values are processed in four ways: 1) using the latent vectors directly as targets; 2) using the latent vectors conditioned by t‑SNE processed SRL vectors; 3) using the latent vectors regularized for maximum entropy; and 4) using the latent vectors scaled. Experimental results show that the Stage-2 model that imitated the scaled target latent vectors produces the most consistent performance improvement across five objective evaluation metrics, including Perceptual Evaluation of Speech Quality (PESQ), while incurring negligible additional inference complexity. Although the experiments are limited to DEMUCS as the baseline architecture, the results demonstrate that appropriately adjusting the values of the Stage‑1 latent vectors can improve the performance of the Stage‑2 model with a negligible increase in inference complexity.

Keywords

Speech enhancement

Two-stage

Latent vector

Representation

본 논문은 음성 향상에 있어 음성 표현 학습(Speech Representation Learning, SRL) 기반 거대 음성 표현 모델이 생성하는 임베딩 벡터를 효과적으로 활용하는 두 단계 학습 방법을 제안한다. 1 단계에서 임베딩 벡터를 이용하여 이상적인 성능에 근접하도록 한 모델을 학습하고, 2단계에서는 이 모델에서 얻은 은닉 벡터를 SRL 모델을 포함하지 않는 경량 모델이 모사하게 한다. 1단계에서 은닉 벡터를 생성할 때, 여러 방법으로 가공하고, 어떤 방법이 2단계 모델의 추론 성능에 가장 효과적인지 분석하였다. 모사 대상이 되는 은닉벡터는 1단계 모델로 채용된 DEMUCS의 디코더 입력단에서 추출했으며, 다음의 네 방식으로 가공하여 표적 은닉벡터로 만들었다: 1) 은닉벡터를 표적으로 사용; 2) t-SNE를 이용한 저차원 임베딩 활용 표적 생성; 3) 1단계 훈련시 최대 엔트로피 정규화 적용 표적 생성; 4) 크기 조정으로 표적 생성. 실험 결과, 크기 조정을 이용한 방식이 Perceptual Evaluation of Speech Quality(PESQ)를 포함한 다섯 가지 객관적 평가 지표 전반에서 가장 일관된 성능 향상을 보이며, 약간의 추론 복잡도 증가만으로 2단계 모델이 1 단계 모델 성능을 가장 근사하게 모사함을 확인하였다. 이 결과는 기본 모델을 DEMUCS로 한정한 실험이지만, 1단계 모델의 은닉 벡터의 값을 적절히 조정하는 것만으로도 2단계 모델의 성능을 보다 개선 시킬 수 있음을 확인할 수 있었다.

키워드

음성 향상

두 단계

은닉 벡터

표현

References

M. Kolbæk, Z.-H. Tan, and J. Jensen, “Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems,” IEEE/ACM Trans. Audio Speech Lang. Process. 25, 153-167 (2017).

10.1109/TASLP.2016.2628641

Z. Huang, S. Watanabe, S.-W. Yang, P. García, and S. Cohen, “Investigating self-supervised learning for speech enhancement and separation,” Proc. ICASSP, 6837-6841 (2022).

10.1109/ICASSP43922.2022.9746303

O. Tal, M. Mandel, F. Kreuk, and Y. Adi, “A systematic comparison of phonetic aware techniques for speech enhancement,” Proc. Interspeech, 1193-1197 (2022).

10.21437/Interspeech.2022-695

R. Shankar, K. Tan, B. Xu, and A. Kumar, “A closer look at wav2vec2 embeddings for on-device single-channel speech enhancement,” Proc. ICASSP, 751-755 (2024).

10.1109/ICASSP48485.2024.10447539

S. Hwang, S. W. Park, and Y. Park, “Causal speech enhancement based on a two-branch nested U-Net architecture using self-supervised speech embeddings,” Proc. ICASSP, 11466-11470 (2025).

10.1109/ICASSP49660.2025.10888248

S.-W. Yang, H.-J. Chang, Z. Huang, A. T. Liu, C.-I. Lai, H. Wu, J. Shi, X. Chang, H.-S. Tsai, W.-C. Huang, T.-H. Feng, P.-H. Chi, Y. Y. Lin, Y.-S. Chuang, T.-H. Huang, W.-C. Tseng, K. Lakhotia, S.-W. Li, S. Watanabe, and H.-Y. Lee, “A large-scale evaluation of speech foundation models,” IEEE/ACM Trans. Audio Speech Lang. Process. 32, 2884-2899 (2024).

10.1109/TASLP.2024.3389631

A. Défossez, G. Synnaeve, and Y. Adi, “Real time speech enhancement in the waveform domain,” Proc. Interspeech, 3291-3295 (2020).

10.21437/Interspeech.2020-2409

A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Proc. NeurIPS, 12449-12460 (2020).

C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investigating RNN-based speech enhancement methods for noise-robust text-to-speech,” Proc. SSW, 146-152 (2016).

10.21437/SSW.2016-24

Information

Publisher :The Acoustical Society of Korea
Publisher(Ko) :한국음향학회
Journal Title :The Journal of the Acoustical Society of Korea
Journal Title(Ko) :한국음향학회지
Volume : 45
No :1
Pages :55-61
Received Date : 2025-12-17
Accepted Date : 2026-01-16
DOI :https://doi.org/10.7776/ASK.2026.45.1.055

The Journal of the Acoustical Society of KoreaISSN:1225-4428(Print) 2287-3775(Online)한국음향학회

All Issue