Semi-supervised learning of speech recognizers based on variational autoencoder and unsupervised data augmentation

Hyeon Ho Jo; Byung Ok Kang; Oh-Wook Kwon

doi:10.7776/ASK.2021.40.6.578

All Issue

2021 Vol.40, Issue 6 Preview Page Next Page

Research Article

Semi-supervised learning of speech recognizers based on variational autoencoder and unsupervised data augmentation 변분 오토인코더와 비교사 데이터 증강을 이용한 음성인식기 준지도 학습

30 November 2021. pp. 578-586

PDF XML

Abstract

We propose a semi-supervised learning method based on Variational AutoEncoder (VAE) and Unsupervised Data Augmentation (UDA) to improve the performance of an end-to-end speech recognizer. In the proposed method, first, the VAE-based augmentation model and the baseline end-to-end speech recognizer are trained using the original speech data. Then, the baseline end-to-end speech recognizer is trained again using data augmented from the learned augmentation model. Finally, the learned augmentation model and end-to-end speech recognizer are re-learned using the UDA-based semi-supervised learning method. As a result of the computer simulation, the augmentation model is shown to improve the Word Error Rate (WER) of the baseline end-to-end speech recognizer, and further improve its performance by combining it with the UDA-based learning method.

Keywords

End-to-End ASR

Variational AutoEncoder (VAE)

Data augmentation

Semi-supervised learning

종단간 음성인식기의 성능향상을 위한 변분 오토인코더(Variational AutoEncoder, VAE) 및 비교사 데이터 증강(Unsupervised Data Augmentation, UDA) 기반의 준지도 학습 방법을 제안한다. 제안된 방법에서는 먼저 원래의 음성데이터를 이용하여 VAE 기반 증강모델과 베이스라인 종단간 음성인식기를 학습한다. 그 다음, 학습된 증강 모델로부터 증강된 데이터를 이용하여 베이스라인 종단간 음성인식기를 다시 학습한다. 마지막으로, 학습된 증강모델 및 종단간 음성인식기를 비교사 데이터 증강 기반의 준지도 학습 방법으로 다시 학습한다. 컴퓨터 모의실험 결과, 증강모델은 기존의 종단간 음성인식기의 단어오류율(Word Error Rate, WER)을 개선하였으며, 비교사 데이터 증강 학습방법과 결합함으로써 성능을 더욱 개선하였다.

키워드

종단간 음성인식

변분 오토인코더

데이터 증강

준지도 학습

References

F. Seide, G. Li, and D. Yu, "Conversational speech transcription using context-dependent deep neural networks," Proc. INTERSPEECH, 437-440 (2011). 10.21437/Interspeech.2011-169

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition," Proc. ICASSP. 4960-4964 (2016). 10.1109/ICASSP.2016.7472621

A. Vaswami, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," Proc. NIPS. 5998-6008 (2017).

T. Hori, R. Astudillo, T. Hayashi, Y. Zhang, S. Watanabe, and J. L. Roux, "Cycle-consistency training for end-to-end speech recognition," Proc. ICASSP. 6271-6275 (2019). 10.1109/ICASSP.2019.8683307

M.-K. Baskar, S. Watanabe, R. Astudillo, T. Hori, L. Burget, and J. Cernocky, "Semi-supervised sequence- to-sequence ASR using unpaired speech and text," Proc. ICASSP. 3790-3794 (2019). 10.21437/Interspeech.2019-3167

Q. Xie, Z. Dai, E. Hovy, M. T. Luong, and Q. V. Le, "Unsupervised data augmentation for consistency training," arXiv:1904.12848 (2019).

J. Li, M. L. Seltzer, X. Wang, R. Zhao, and Y. Gong, "Large-scale domain adaptation via teacher-student learning," Proc. INTERSPEECH, 2386-2390 (2017). 10.21437/Interspeech.2017-519

Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, "Self- training with noisy student improves ImageNet classification," Proc. CVPR. 10687-10698 (2020). 10.1109/CVPR42600.2020.01070

N. Jaitly and G. E. Hinton, "Vocal tract length perturbation (VTLP) improves speech recognition," Proc. ICML. 625-660 (2013).

D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, "SpecAugment: A simple data augmentation method for automatic speech recognition," Proc. INTERSPEECH, 2613-2617 (2019). 10.21437/Interspeech.2019-2680

X. Song, Z. Wu, Y. Huang, D. Su, and H. Meng, "SpecSwap: A simple data augmentation method for end-to-end speech recognition," Proc. INTERSPEECH, 581-585 (2020). 10.21437/Interspeech.2020-2275

D. P. Kingma and M. Welling, "Auto-encoding variational bayes," Proc. ICLR. 1-14 (2014).

D. B. Paul and J. M. Baker, "The design for the Wall Street Journal-based CSR corpus," Proc. ACL. 357-362 (1992). 10.3115/1075527.1075614

V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, "LibriSpeech: An ASR corpus based on public domain audio books," Proc. ICASSP. 5206-5210 (2015). 10.1109/ICASSP.2015.7178964

S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, "ESPnet: End- to-end speech processing toolkit," Proc. INTERSPEECH, 2207-2211 (2018). 10.21437/Interspeech.2018-145629730221

L. V. D. Maaten and G. Hinton, "Visualizing data using t-SNE," J. Mach. Learn. Res. 9, 2579-2605 (2008).

Information

Publisher :The Acoustical Society of Korea
Publisher(Ko) :한국음향학회
Journal Title :The Journal of the Acoustical Society of Korea
Journal Title(Ko) :한국음향학회지
Volume : 40
No :6
Pages :578-586
Received Date : 2021-09-28
Accepted Date : 2021-11-04
DOI :https://doi.org/10.7776/ASK.2021.40.6.578

The Journal of the Acoustical Society of KoreaISSN:1225-4428(Print) 2287-3775(Online)한국음향학회

All Issue