Double-attention mechanism of sequence-to-sequence deep neural networks for automatic speech recognition

Dongsuk Yook; Dan Lim; In-Chul Yoo

doi:10.7776/ASK.2020.39.5.476

All Issue

2020 Vol.39, Issue 5 Preview Page Next Page

Research Article

Double-attention mechanism of sequence-to-sequence deep neural networks for automatic speech recognition 음성 인식을 위한 sequence-to-sequence 심층 신경망의 이중 attention 기법

30 September 2020. pp. 476-482

PDF XML

Abstract

Sequence-to-sequence deep neural networks with attention mechanisms have shown superior performance across various domains, where the sizes of the input and the output sequences may differ. However, if the input sequences are much longer than the output sequences, and the characteristic of the input sequence changes within a single output token, the conventional attention mechanisms are inappropriate, because only a single context vector is used for each output token. In this paper, we propose a double-attention mechanism to handle this problem by using two context vectors that cover the left and the right parts of the input focus separately. The effectiveness of the proposed method is evaluated using speech recognition experiments on the TIMIT corpus.

Keywords

Attention

Sequence-to-sequence

Deep neural network

Automatic speech recognition

입력열과 출력열의 길이가 다른 경우 attention 기법을 이용한 sequence-to-sequence 심층 신경망이 우수한 성능을 보인다. 그러나, 출력열의 길이에 비해서 입력열의 길이가 너무 긴 경우, 그리고 하나의 출력값에 해당하는 입력열의 특성이 변화하는 경우, 하나의 문맥 벡터(context vector)를 사용하는 기존의 attention 방법은 적당하지 않을 수 있다. 본 논문에서는 이러한 문제를 해결하기 위해서 입력열의 왼쪽 부분과 오른쪽 부분을 각각 개별적으로 처리할 수 있는 두 개의 문맥 벡터를 사용하는 이중 attention 기법을 제안한다. 제안한 방법의 효율성은 TIMIT 데이터를 사용한 음성 인식 실험을 통하여 검증하였다.

키워드

Attention

Sequence-to-sequence

심층 신경망

음성 인식

References

I. Sutskever, O. Vinyals, and Q. Le, "Sequence to sequence learning with neural networks," Proc. Int. Conf. NIPS. 3104-3112 (2014).

D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," arXiv:1409.0473 (2014).

K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio, "Show, attend and tell: neural image caption generation with visual attention," Proc. ICML. 2048-2057 (2015).

S. Watanabe, T. Hori, S. Kim, J. Hershey, and T. Hayashi, "Hybrid CTC/attention architecture for end- to-end speech recognition," IEEE J. Selected Topics in Signal Processing, 11, 1240-1253 (2017).

10.1109/JSTSP.2017.2763455

H. Soltau, H. Liao, and H. Sak, "Neural speech recognizer: acoustic-to-word LSTM model for large vocabulary speech recognition," Proc. Interspeech, 3707- 3711 (2017).

10.21437/Interspeech.2017-1566

K. Audhkhasi, B. Kingsbury, B. Ramabhadran, G. Saon, and M. Picheny, "Building competitive direct acoustics-to-word models for English conversational speech recognition," Proc. IEEE ICASSP. 4759-4763 (2018).

10.1109/ICASSP.2018.8461935

C. Chiu, T. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, and M. Bacchiani, "State-of-the-art speech recognition with sequence-to-sequence models," Proc. IEEE ICASSP. 4774-4778 (2018).

10.1109/ICASSP.2018.8462105

J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, "Attention-based models for speech recognition," Proc. Int. Conf. NIPS. 577-585 (2015).

10.1109/ICASSP.2016.7472618

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, "Listen, attend and spell: a neural network for large vocabulary conversational speech recognition," Proc. IEEE ICASSP. 4960-4964 (2016).

10.1109/ICASSP.2016.7472621

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaizer, and I. Polosukhin, "Attention is all you need," Proc. Int. Conf. NIPS. 5998-6008 (2017).

S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, 9, 1735-1780 (1997).

10.1162/neco.1997.9.8.17359377276

K. Greff, R. Srivastava, J. Koutnik, B. Steunebrink, and J. Schmidhuber, "LSTM: a search space odyssey," IEEE Trans. on Neural Networks and Learning Systems, 28, 2222-2232 (2017).

10.1109/TNNLS.2016.258292427411231

Y. LeCun and Y. Bengio, "Convolutional networks for images, speech, and time-series," in Handbook of Brain Theory and Neural Networks, edited by M. A. Arbib (MIT Press, 1995).

O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, "Convolutional neural networks for speech recognition," IEEE/ACM Trans. on Audio, Speech, and Language Processing, 22, 1533-1545 (2014).

10.1109/TASLP.2014.2339736

D. Lim, Improving seq2seq by revising attention mechanism for speech recognition, (Dissertation, Korea University, 2018).

Y. Zhang, W. Chan, and N. Jaitly, "Very deep convolutional networks for end-to-end speech recognition," Proc. IEEE ICASSP. 4845-4849 (2017).

10.1109/ICASSP.2017.7953077PMC5568090

Information

Publisher :The Acoustical Society of Korea
Publisher(Ko) :한국음향학회
Journal Title :The Journal of the Acoustical Society of Korea
Journal Title(Ko) :한국음향학회지
Volume : 39
No :5
Pages :476-482
Received Date : 2020-07-31
Accepted Date : 2020-09-14
DOI :https://doi.org/10.7776/ASK.2020.39.5.476

The Journal of the Acoustical Society of Korea ISSN:1225-4428(Print) 2287-3775(Online) 한국음향학회지

All Issue