A Korean speech recognition based on conformer

Myoung-Wan Koo

doi:10.7776/ASK.2021.40.5.488

All Issue

2021 Vol.40, Issue 5 Preview Page Next Page

Research Article

A Korean speech recognition based on conformer 콘포머 기반 한국어 음성인식

30 September 2021. pp. 488-495

PDF XML

Abstract

We propose a speech recognition system based on conformer. Conformer is known to be convolution-augmented transformer, which combines transfer model for capturing global information with Convolution Neural Network (CNN) for exploiting local feature effectively. The baseline system is developed to be a transfer-based speech recognition using Long Short-Term Memory (LSTM)-based language model. The proposed system is a system which uses conformer instead of transformer with transformer-based language model. When Electronics and Telecommunications Research Institute (ETRI) speech corpus in AI-Hub is used for our evaluation, the proposed system yields 5.7 % of Character Error Rate (CER) while the baseline system results in 11.8 % of CER. Even though speech corpus is extended into other domain of AI-hub such as NHNdiguest speech corpus, the proposed system makes a robust performance for two domains. Throughout those experiments, we can prove a validation of the proposed system.

Keywords

Speech recognition

Deep learning

Conformer

Transformer

본 논문에서는 콘포머 기반 한국어 음성인식 시스템을 제안한다. 콘포머는 트랜스포머 모델에 콘볼루션신경망(Convolution Neural Network, CNN) 기능을 보강한 구조이며 광역 정보를 잘 표현할 수 있는 트랜스포머와 지역 정보를 잘 표현할 수 있는 CNN을 결합한 신경망이다. 음성인식 기본 시스템으로 트랜스포모에 기반한 음성인식시스템을 개발하였으며 언어모델로는 Long Short-Term Memory(LSTM)을 사용하였다. 콘포머 기반 음성인식시스템은 트랜스포머 대신에 콘포머를 사용하였고 언어모델로는 트랜스포머를 이용하였다. 성능 평가를 위해 AI-hub에 있는 Electronics and Telecommunications Research Institute(ETRI) 음성코퍼스를 활용하였으며 트랜스포머 기반 음성인식 시스템은 오인식률이 11.8 %이 되었으며 콘포머 기반 음성인식시스템은 오인식률이 5.7 %가 되었다. AI-hub에 있는 다른 영역의 NHN다이퀘스트 음성 코퍼스를 추가해도 유사한 성능이 유지가 되어 제안된 콘포머 음성인식시스템의 유효성을 입증하였다.

키워드

음성인식

딥 러닝

콘포머

트랜스포머

References

S. Zhao, X. Xiao, Z. Zhang, T. N. T. Nguyen, X. Zhong, B. Ren, L. Wang, D. L. Jones, E. S. Chng, and H. Li, "Robust speech recognition using beamforming with adaptive microphone gains and multichannel noise reductio," Proc. 2015 IEEE ASRU. 460-467 (2015). 10.1109/ASRU.2015.7404831

Y. Tachioka, T. Narita, L. Miura, T. Uramoto, N. Monta, S. Uenohara, K. Furuya, S. Wanatanabe, and J. Le Roux, "Coupled initialization of multi-channel non- negative matrix factorization based on spatial and spectral information," Proc. Interspeech, 2461-2465 (2017). 10.21437/Interspeech.2017-61

D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, E. Elsen, J. Engel, L. Fan, C. Fougner, T. Han, A. Hannun, B. Jun, P. LeGresley, L. Lin, S. Narang, A. Ng, S. Ozair, R. Prenger, J. Raiman, S. Satheesh, D. Seetapun, S. Sengupta, Y. Wang, Z. Wang, C. Wang, B. Xiao, D. Yogatama, J. Zhan, and Z. Zhu, "Deep speech 2: end-to-end speech recognition in English and Mandarim," arXiv:1512.02595v1 (2015).

A. Graves and N. Jaitly, "Towards end-to-end speech recognition with recurrent neural networks," Proc. ICML. 1764-1772 (2014).

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition," Proc. ICASSP. 4960-4964 (2016). 10.1109/ICASSP.2016.7472621

A. Graves, A. r. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," arXiv:1303.5778 (2013). 10.1109/ICASSP.2013.6638947

L. Dong, S. Xu, and B. Xu, "Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition," Proc. ICASSP. 5884-5888 (2018). 10.1109/ICASSP.2018.8462506

A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, "Wavenet: a generative model for raw audio," arXiv:1609.03499 (2016).

N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," Proc. NIPS 1-11 (2017).

Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, "Language modeling with gated convolutional networks," arXiv:1612.08083v3 (2017).

P. Ramachandran, B. Zoph, and Q. V. Le, " Swish: A self-gated activation function," arXiv:1710.05941v1 (2017).

S. Kim, S. Bae, and C. Won, "Open-source toolkit for end-to-end Korean speech recognition," Software Impacts, 7, 1-4 (2021). 10.1016/j.simpa.2021.100054

Information

Publisher :The Acoustical Society of Korea
Publisher(Ko) :한국음향학회
Journal Title :The Journal of the Acoustical Society of Korea
Journal Title(Ko) :한국음향학회지
Volume : 40
No :5
Pages :488-495
Received Date : 2021-08-02
Accepted Date : 2021-09-14
DOI :https://doi.org/10.7776/ASK.2021.40.5.488

The Journal of the Acoustical Society of KoreaISSN:1225-4428(Print) 2287-3775(Online)한국음향학회

All Issue