Proposal of speaker change detection system considering speaker overlap

Jisu Park; Young-Sun Yun; Shin Cha; Jeon Gue Park

doi:10.7776/ASK.2021.40.5.466

All Issue

2021 Vol.40, Issue 5 Preview Page Next Page

Research Article

Proposal of speaker change detection system considering speaker overlap 화자 겹침을 고려한 화자 전환 검출 시스템 제안

30 September 2021. pp. 466-472

PDF XML

Abstract

Speaker Change Detection (SCD) refers to finding the moment when the main speaker changes from one person to the next in a speech conversation. In speaker change detection, difficulties arise due to overlapping speakers, inaccuracy in the information labeling, and data imbalance. To solve these problems, TIMIT corpus widely used in speech recognition have been concatenated artificially to obtain a sufficient amount of training data, and the detection of changing speaker has performed after identifying overlapping speakers. In this paper, we propose an speaker change detection system that considers the speaker overlapping. We evaluated and verified the performance using various approaches. As a result, a detection system similar to the X-Vector structure was proposed to remove the speaker overlapping region, while the Bi-LSTM method was selected to model the speaker change system. The experimental results show a relative performance improvement of 4.6 % and 13.8 % respectively, compared to the baseline system. Additionally, we determined that a robust speaker change detection system can be built by conducting related studies based on the experimental results, taking into consideration text and speaker information.

Keywords

Speaker overlap detection

Speaker representation

Speaker change detection

Deep neural networks

화자 전환 검출은 대화 중에 발성 화자가 다른 사람으로 바뀌는 시점을 검출하는 것을 의미한다. 이 과정에서 화자 중복, 화자 정보 표기의 부정확성, 데이터 불균형 등으로 화자가 바뀌는 순간을 검출하는 데 어려움이 발생한다. 본 논문에서는 이러한 문제를 해결하기 위해 음성 인식에 널리 사용되는 TIMIT 데이터를 가공하여 충분한 양의 훈련 데이터를 얻었으며, 화자가 겹치는지를 파악한 후에 화자 전환 여부를 판단하였다. 본 논문에서는 화자 겹침을 고려한 화자 전환 검출 시스템을 구축하기 위하여 다양한 접근법을 사용하여 성능을 평가하고 검증했다. 그 결과 화자 겹칩 영역을 제거하기 위해 X-Vector 구조와 유사한 형태의 검출 시스템과 화자 전환 검출 시스템을 모델링하기 위한 Bi-LSTM 모델을 제안하였다. 실험 결과 기준 시스템보다 상대적으로 각각 4.6 %, 13.8 % 성능 향상을 확인하였다. 또한, 실험 결과를 기반으로 텍스트 정보와 화자 정보 등을 고려한다면 좀 더 강인한 화자 전환 검출 시스템을 구축할 수 있을 것으로 판단한다.

키워드

화자 겹침 검출

화자표현

화자 전환 검출

심층 신경망

References

A. G. Adam, S. S. Kajarekar, and H. Hermansky, "A new speaker change detection method for two-speaker segmentation," Proc. ICASSP. 3908-3911 (2002). 10.1109/ICASSP.2002.5745511

L. Bullock, H. Bredin, and L. P. Garcia Perera, "Overlap aware diarization: Resegmentation using neural end- to-end overlapped speech detection," Proc. ICASSP. 7114-7118 (2020). 10.1109/ICASSP40776.2020.9053096

N. Sajjan, S. Ganesh, N. Sharma, S. Ganapathy, and N. Ryant, "Leveraging lstm models for overlap detection in multi party meetings," Proc. ICASSP. 5249-5253 (2018). 10.1109/ICASSP.2018.8462548

V. Andrei, H. Cucu, and C. Burileanu. "Detecting overlapped speech on short time frames using deep learning," Proc. Interspeech, 1198-1202 (2017). 10.21437/Interspeech.2017-188

E. Kazimirova, A. Belyaev, "Automatic detection of multi speaker fragments with high time resolution," Proc. ICASSP. 1338-1392 (2018). 10.21437/Interspeech.2018-1878

Z. Ge, A. N. Iyer, S. Cheluvaraja, and A. Ganapathiraju, "Speaker change detection using features through a neural network speaker classier," Proc. IEEE SAI Intelligent Systems Conference, 1111-1116 (2017). 10.1109/IntelliSys.2017.8324268

R. Yin, H. Bredin, and C. Barras, "Speaker change detection in broadcast tv using bidirectional long short term memory networks," Proc. Interspeech, 3827- 3831 (2017). 10.21437/Interspeech.2017-65

M. Kunesova, M. Hruz, Z. Zajc, and V. Radova, "Detection of overlapping speech for the purposes of speaker diarization," Proc. ICSC. 247-257 (2019). 10.1007/978-3-030-26061-3_26

S. C. Levinson, "Turn-taking in human communication - Origins and implications for language processing," Trends in Cognitive Sciences, 20, 6-14 (2016). 10.1016/j.tics.2015.10.01026651245

H. Bredin, "TristouNet: Triplet loss for speaker turn embedding," Proc. Interspeech, 5430-5434 (2017). 10.1109/ICASSP.2017.7953194

WebRTC Homepage, http://webrtc.org, (Last viewed November 21, 2020).

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, "X-vectors: Robust DNN embeddings for speaker recognition," Proc. ICASSP. 5329-5333 (2018). 10.1109/ICASSP.2018.8461375

J. Park, S. Cha, S. Eun, J. G. Park, and Y.-S. Yun, "Data augmentation and d-vector representation methods for speaker change detection," Proc. ICRACS. 67-71 (2020). 10.1145/3400286.3418270

V. Zue, S. Sene, and S. Glass, "Speech database development at MIT: TIMIT and beyond," Speech communication, 9, 351-356 (1990). 10.1016/0167-6393(90)90010-7

H. Kim, J. Park, S. Cha, K. A Son, Y.-S. Yun, and J. G. Park, "Framework switching of speaker overlap detection system" (in Korean), J. SW Assessment and Valuation, 17, 101-113 (2021). 10.29056/jsav.2021.06.13

Information

Publisher :The Acoustical Society of Korea
Publisher(Ko) :한국음향학회
Journal Title :The Journal of the Acoustical Society of Korea
Journal Title(Ko) :한국음향학회지
Volume : 40
No :5
Pages :466-472
Received Date : 2021-07-09
Accepted Date : 2021-08-26
DOI :https://doi.org/10.7776/ASK.2021.40.5.466

The Journal of the Acoustical Society of KoreaISSN:1225-4428(Print) 2287-3775(Online)한국음향학회

All Issue