Improving transformer-based acoustic model performance using sequence discriminative training

Chae-Won Lee; Joon-Hyuk Chang

doi:10.7776/ASK.2022.41.3.335

All Issue

2022 Vol.41, Issue 3 Preview Page Next Page

Research Article

Improving transformer-based acoustic model performance using sequence discriminative training Sequence dicriminative training 기법을 사용한 트랜스포머 기반 음향 모델 성능 향상

31 May 2022. pp. 335-341

PDF XML

Abstract

In this paper, we adopt a transformer that shows remarkable performance in natural language processing as an acoustic model of hybrid speech recognition. The transformer acoustic model uses attention structures to process sequential data and shows high performance with low computational cost. This paper proposes a method to improve the performance of transformer AM by applying each of the four algorithms of sequence discriminative training, a weighted finite-state transducer (wFST)-based learning used in the existing DNN-HMM model. In addition, compared to the Cross Entropy (CE) learning method, sequence discriminative method shows 5 % of the relative Word Error Rate (WER).

Keywords

Speech recognition

Transformer

Sequence discriminative training

Weighted finite state transducer

본 논문에서는 기존 자연어 처리 분야에서 뛰어난 성능을 보이는 트랜스포머를 하이브리드 음성인식에서의 음향모델로 사용하였다. 트랜스포머 음향모델은 attention 구조를 사용하여 시계열 데이터를 처리하며 연산량이 낮으면서 높은 성능을 보인다. 본 논문은 이러한 트랜스포머 AM에 기존 DNN-HMM 모델에서 사용하는 가중 유한 상태 전이기(weighted Finite-State Transducer, wFST) 기반 학습인 시퀀스 분류 학습의 네 가지 알고리즘을 각각 적용하여 성능을 높이는 방법을 제안한다. 또한 기존 Cross Entropy(CE)를 사용한 학습방식과 비교하여 5 %의 상대적 word error rate(WER) 감소율을 보였다.

키워드

음성인식

트랜스포머

시퀀스 분류 학습

가중 유한 상태 전이기

References

B. Juang and L. Rabiner, "Hidden Markov models for speech recognition," Technometrics, 33, 251-272 (1991). 10.1080/00401706.1991.10484833

A. Senior, H. Sak, and I. Shafran, "Context dependent phone models for LSTM RNN acoustic modelling," Proc. IEEE ICASSP, 4585-4589 (2015). 10.1109/ICASSP.2015.7178839

J. Li, V. Lavrukhin, B. Ginsburg, and R. Leary, "Jasper: An end-to-end convolutional neural acoustic model," arXiv preprint arXiv:1904.03288 (2019). 10.21437/Interspeech.2019-1819

K. Chen and Q. Huo, "Training deep bidirectional LSTM acoustic model for LVCSR by a context-sensitive- chunk BPTT approach," IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24 (2016). 10.1109/TASLP.2016.2539499

L. Bahl, P. Brown, P. Souza, and R. Mercer, "Maximum mutual information estimation of hidden Markov model parameters for speech recognition," Proc. ICASSP, 49-52 (1986).

D. Povey, D. Kanevsky, B. Kingsbury, B. Ranabhadran, G. Saon, and K. Visweswariah, "Boosted MMI for model and feature-space discriminative training," Proc. IEEE ICASSP, 4057-4060 (2008). 10.1109/ICASSP.2008.4518545

M. Gibson and T. Hain, "Hypothesis spaces for minimum Bayes risk training in large vocabulary speech recognition," Proc. Interspeech, 2406-2409 (2006). 10.21437/Interspeech.2006-603

D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, and V. Manohar, "Purely sequence-trained neural networks for ASR based on lattice-free MMI," Proc. Interspeech, 2751-2755 (2016). 10.21437/Interspeech.2016-595

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," Advances in neural information processing systems, 30 (2017).

K. Vesely, A. Ghoshal, L. Burget, and D. Povey, "Sequence-discriminative training of deep neural networks," Proc. Interspeech, 2345-2349 (2013). 10.21437/Interspeech.2013-548

Y. Wang, A. mohamed, D. Le, C. Liu, and A. Xiao, "Transformer-based acoustic modeling for hybrid speech recognition," Proc. IEEE ICASSP, 6874-6878 (2020). 10.1109/ICASSP40776.2020.905434533123278PMC7591995

V. Panayotov, G. Chen, D. Povey, and S.Khudanpur, "Librispeech: an asr corpus based on public domain audio books," Proc. IEEE ICASSP, 5206-5210 (2015). 10.1109/ICASSP.2015.7178964

S. Watanabe, T. Hori, S. karita, and T. Hayashi, "Espnet: End-to-end speech processing toolkit," arXiv preprint arXiv:1804.00015 (2018). 10.21437/Interspeech.2018-145629730221

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, "The Kaldi speech recognition toolkit," Proc. ASRU, (2011).

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, and Z. Vito, "Pytorch: An imperative style, high- performance deep learning library," Advances in neural information processing systems, 32 (2019).

L. Lu, X. Xiao, Z. Chen, and Y. Gong, "Pykaldi2: Yet another speech toolkit based on kaldi and pytorch," arXiv preprint arXiv:1907.05955 (2019).

Y. Shao and Y. Wang, "Pychain: A fully parallelized pytorch implementation of lf-mmi for end-to-end asr," arXiv preprint arXiv:2005.09824 (2020). 10.21437/Interspeech.2020-3053PMC7696626

Information

Publisher :The Acoustical Society of Korea
Publisher(Ko) :한국음향학회
Journal Title :The Journal of the Acoustical Society of Korea
Journal Title(Ko) :한국음향학회지
Volume : 41
No :3
Pages :335-341
Received Date : 2022-03-21
Revised Date : 2022-05-09
Accepted Date : 2022-05-09
DOI :https://doi.org/10.7776/ASK.2022.41.3.335

The Journal of the Acoustical Society of KoreaISSN:1225-4428(Print) 2287-3775(Online)한국음향학회

All Issue