Speaker verification system combining attention-long short term memory based speaker embedding and I-vector in far-field and noisy environments

Ara Bae; Wooil Kim

doi:10.7776/ASK.2020.39.2.137

All Issue

2020 Vol.39, Issue 2 Preview Page

Research Article

Speaker verification system combining attention-long short term memory based speaker embedding and I-vector in far-field and noisy environments Attention-long short term memory 기반의 화자 임베딩과 I-vector를 결합한 원거리 및 잡음 환경에서의 화자 검증 알고리즘

31 March 2020. pp. 137-142

PDF XML

Abstract

Many studies based on I-vector have been conducted in a variety of environments, from text- dependent short-utterance to text-independent long-utterance. In this paper, we propose a speaker verification system employing a combination of I-vector with Probabilistic Linear Discriminant Analysis (PLDA) and speaker embedding of Long Short Term Memory (LSTM) with attention mechanism in far-field and noisy environments. The LSTM model’s Equal Error Rate (EER) is 15.52 % and the Attention-LSTM model is 8.46 %, improving by 7.06 %. We show that the proposed method solves the problem of the existing extraction process which defines embedding as a heuristic. The EER of the I-vector/PLDA without combining is 6.18 % that shows the best performance. And combined with attention-LSTM based embedding is 2.57 % that is 3.61 % less than the baseline system, and which improves performance by 58.41 %.

Keywords

Speaker verification

Speaker embedding

Attention mechanism

Far-field and noisy environments

문장 종속 짧은 발화에서 문장 독립 긴 발화까지 다양한 환경에서 I-vector 특징에 기반을 둔 많은 연구가 수행되었다. 본 논문에서는 원거리 잡음 환경에서 녹음한 데이터에서 Probabilistic Linear Discriminant Analysis (PLDA)를 적용한 I-vector와 주의 집중 기법을 접목한 Long Short Term Memory(LSTM) 기반의 화자 임베딩을 추출하여 결합한 화자 검증 알고리즘을 소개한다. LSTM 모델의 Equal Error Rate(EER)이 15.52 %, Attention-LSTM 모델이 8.46 %로 7.06 % 성능이 향상되었다. 이로써 본 논문에서 제안한 기법이 임베딩을 휴리스틱 하게 정의하여 사용하는 기존 추출방법의 문제점을 해결할 수 있는 것을 확인하였다. PLDA를 적용한 I-vector의 EER이 6.18 %로 결합 전 가장 좋은 성능을 보였다. Attention-LSTM 기반 임베딩과 결합하였을 때 EER이 2.57 %로 기존보다 3.61 % 감소하여 상대적으로 58.41 % 성능이 향상되었다.

키워드

화자 검증

화자 임베딩

주의 집중 기법

원거리 잡음 환경

References

P. Kenny, G. Boulianne, P. Oullet, and P. Dumouchel, "Joint factor analysis versus eigenchannes in speaker recognition," IEEE Trans on. Audio, Speech, and Language Processing, 15, 2072-2084 (2007).

10.1109/TASL.2006.881693

D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, "Speaker verification using adapted gaussian mixture models," Digital Signal Processing, 10, 19-41 (2000).

10.1006/dspr.1999.0361

N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, "Front-end factor analysis for speaker verification," IEEE Trans on. Audio, Speech, and Language Processing, 19, 788-798 (2011).

10.1109/TASL.2010.2064307

E. Variani, X. Lei, E. McDermott, I. Lopez-Moreno, and J. Gonzalez Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," Proc. ICASSP. 4080-4084 (2014).

10.1109/ICASSP.2014.6854363

V. Peddinti, D. Povey, and S. Khudanpur, "A time delay neural network architecture for efficient modeling of long temporal contexts," Proc. Interspeech, 3214-3218 (2015).

Y. Liu, Y. Qian, N. Chen, T. Fu, Y. Zhang, and K. Yu, "Deep feature for text-dependent speaker verification," Speech Communication, 73, 1-13 (2015).

10.1016/j.specom.2015.07.003

D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, "Deep neural network embeddings for text-independent speaker verification," Proc. Interspeech, 999-1003 (2017).

10.21437/Interspeech.2017-620

G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, "End-toend text-dependent speaker verification," Proc. IEEE ICASSP. 5115-5119 (2016).

10.1109/ICASSP.2016.7472652

D. Bahdanau, K. Cho, and Y. Bengio. "Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473 (2014).

S. J. D. Prince and J. H. Elder, "Probabilistic linear discriminant analysis for inferences about identity," Proc. IEEE 11th ICCV. 1-8 (2007).

10.1109/ICCV.2007.440905223132746PMC3488430

B. Fauve, N. Evans, and J. Mason, "Improving the performance of text-independent short duration SVM- and GMM based speaker verification," Proc. Odyssey, Stellenbosch, 18 (2008).

F. Chowdhury, Q. Wang, I. L. Moreno, and L. Wan, "Attention-based models for text-dependent speaker verification," arXiv preprint arXiv:1710.10470 (2017).

L. Wan, Q. Wang, A. Papir, and I. L. Moreno, "Generalized end-to-end loss for speaker verification," arXiv preprint rXiv:1710.10467 (2017).

10.1109/ICASSP.2018.846266531949563PMC6962917

Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I. L. Moreno, "Speaker diarization with lstm," Proc. ICASSP. 5239-5243 (2018).

10.1109/ICASSP.2018.8462628

Information

Publisher :The Acoustical Society of Korea
Publisher(Ko) :한국음향학회
Journal Title :The Journal of the Acoustical Society of Korea
Journal Title(Ko) :한국음향학회지
Volume : 39
No :2
Pages :137-142
Received Date : 2020-01-21
Revised Date : 2020-02-27
Accepted Date : 2020-03-20
DOI :https://doi.org/10.7776/ASK.2020.39.2.137

The Journal of the Acoustical Society of KoreaISSN:1225-4428(Print) 2287-3775(Online)한국음향학회

All Issue