Segment unit shuffling layer in deep neural networks for text-independent speaker verification

Jungwoo Heo; Hye-jin Shim; Ju-ho Kim; Ha-Jin Yu

doi:10.7776/ASK.2021.40.2.148

All Issue

2021 Vol.40, Issue 2 Preview Page Next Page

Research Article

Segment unit shuffling layer in deep neural networks for text-independent speaker verification 문장 독립 화자 인증을 위한 세그멘트 단위 혼합 계층 심층신경망

31 March 2021. pp. 148-154

PDF XML

Abstract

Text-Independent speaker verification needs to extract text-independent speaker embedding to improve generalization performance. However, deep neural networks that depend on training data have the potential to overfit text information instead of learning the speaker information when repeatedly learning from the identical time series. In this paper, to prevent the overfitting, we propose a segment unit shuffling layer that divides and rearranges the input layer or a hidden layer along the time axis, thus mixes the time series information. Since the segment unit shuffling layer can be applied not only to the input layer but also to the hidden layers, it can be used as generalization technique in the hidden layer, which is known to be effective compared to the generalization technique in the input layer, and can be applied simultaneously with data augmentation. In addition, the degree of distortion can be adjusted by adjusting the unit size of the segment. We observe that the performance of text-independent speaker verification is improved compared to the baseline when the proposed segment unit shuffling layer is applied.

Keywords

Text-independent speaker verification

Deep neural network

Speaker embedding

Shuffling generalization

문장 독립 화자 인증 연구에서는 일반화 성능 향상을 위해 문장 정보와 독립적인 화자 특징을 추출하는 것이 필수적이다. 그렇지만 심층 신경망은 학습 데이터에 의존적이므로, 동일한 시계열 정보를 반복 학습할 경우, 화자 정보를 학습하는 대신 문장 정보에 과적합 될 수 있다. 본 논문에서는 이러한 과적합을 방지하기 위해 시간 축으로 입력층 혹은 은닉층을 분할 및 무작위 재배열하여 시계열 정보의 순서를 뒤섞는 세그멘트 단위 혼합 계층을 제안한다. 세그멘트 단위 혼합 계층은 입력층 뿐만 아니라 은닉층에도 적용이 가능하므로, 입력층에서의 일반화 기법에 비해 효과적이라 알려진 은닉층에서의 일반화 기법으로 활용이 가능하며, 기존의 데이터 증강 방법과 동시에 적용할 수도 있다. 뿐만 아니라, 세그멘트의 단위 크기를 조절하여 혼합의 정도를 조절할 수도 있다. 본 논문에서는 제안한 방법을 적용하여 문장 독립 화자 인증 성능이 개선됨을 확인하였다.

키워드

문장 독립 화자 인증

심층 신경망

화자 특징

혼합 일반화

References

J. -w. Jung, H.-j. Shim, J.-h. Kim, and H.-J. Yu, "α- feature map scaling for raw waveform speaker verification" (in Korean), J. Acoust. Soc. Kr. 39, 441-446 (2020).

D. Snyder, P. Ghahremani, D. Povey, D. Garcia- Romero, Y. Carmie, and S. Khudanpur, "Deep neural network-based speaker embeddings for end-to-end speaker verification," Proc. IEEE SLT. 165-170 (2016). 10.1109/SLT.2016.784626026932919

D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, "Deep neural network embeddings for text-independent speaker verification," Proc. Interspeech, 999-1003 (2017). 10.21437/Interspeech.2017-620

E. Varian, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," Proc. ICASSP. 4080-4084 (2014). 10.1109/ICASSP.2014.6854363

S. Shon, H. Tang, and J. R. Glass, "Frame-level speaker embeddings for text-independent speaker recognition and analysis of end-to-end model," Proc. IEEE SLT. 1007-1013 (2018). 10.1109/SLT.2018.8639622

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: A simple way to prevent neural networks from overfitting," JMLR. 15, 1929-1958 (2014).

S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," Proc. PMLR. 448-456 (2015).

D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, "Specaugment: A simple data augmentation method for automatic speech recognition," arXiv preprint arXiv:1904.08779 (2019). 10.21437/Interspeech.2019-2680

T. Inoue, P. Vinayavekhin, S. Wang, D. Wood, N. Greco, and R. Tachibana, "Shuffling and mixing data aug-mentation for environmental sound classification," Proc. of the DCASE. 109-113 (2019). 10.33682/wgyb-bt40

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (MIT press, Cambridge, 2016), pp. 236-239.

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, "X-vectors: Robust dnn embeddings for speaker recognition," Proc. ICASSP. 5329-5333 (2018). 10.1109/ICASSP.2018.8461375

Y. Xu, R. Jia, L. Mou, G. Li, Y. Chen, Y. Lu, and Z. Jin, "Improved relation classification by deep recurrent neural networks with data augmentation," arXiv preprint arXiv:1601.03651 (2016).

Z. Wu, S. Wang, Y. Qian, and K. Yu, "Data augmentation using variational autoencoder for embedding based speaker verification," Proc. Interspeech, 1163- 1167 (2019). 10.21437/Interspeech.2019-2248

L. Perez and J. Wang, "The effectiveness of data augmentation in image classification using deep learning," arXiv preprint arXiv:1712.04621 (2017).

J. Hu, L. Shen, and G. Sun, "Squeeze-and-excitation networks," Proc. CVPR. 7132-7141 (2018). 10.1109/CVPR.2018.00745

Information

Publisher :The Acoustical Society of Korea
Publisher(Ko) :한국음향학회
Journal Title :The Journal of the Acoustical Society of Korea
Journal Title(Ko) :한국음향학회지
Volume : 40
No :2
Pages :148-154
Received Date : 2021-02-15
Accepted Date : 2021-03-15
DOI :https://doi.org/10.7776/ASK.2021.40.2.148

The Journal of the Acoustical Society of KoreaISSN:1225-4428(Print) 2287-3775(Online)한국음향학회

All Issue