Improving end-to-end speaker diarization with a contrastive center loss for discriminative embedding space

Donghee Kim; Wooil Kim

doi:10.7776/ASK.2025.44.5.525

All Issue

2025 Vol.44, Issue 5 Preview Page Next Page

Research Article

Improving end-to-end speaker diarization with a contrastive center loss for discriminative embedding space 임베딩 변별력 향상을 위한 대조 중심 손실 함수 기반 종단 간 화자 분할 개선 연구

30 September 2025. pp. 525-532

PDF XML

Abstract

Speaker Diarization, a technology for precessing multi-speaker speech environments in speech-based systems, plays a crucial role in various applications such as call center conversation analysis, automatic meeting transcription, and broadcast content processing. The performance of speaker diarization significantly impacts the overall quality of such systems, making performance enhancement a key research topic in this filed. We propose an approach to improve speaker diarization performance by applying a Contrastive center loss function to Single-Label Self-Attentive End-to-End Neural Diarization (SL-SA-EEND). The proposed method strengthens the discriminative power of each class by maintaining close distances between intra-class embeddings and maximizing distances between inter-class embeddings in the embedding space. This enables each class to learn discriminative features, thereby promoting performance improvement in classification tasks. Experimental results show that the proposed method achieved a 25.53 % improvement in Diarization Error Rate (DER) on a Simulated dataset compared to the baseline SL-SA-EEND system, and an 11.88 % improvement on the CALLHOME dataset. Finally, we visualize the results of applying the Contrastive center loss function versus not applying it in the embedding speace, demonstrating its effectiveness in speaker diarization systems defined as classification task.

Keywords

Speaker Diarization

Self-Attentive End-to-End Neural Diarization (SA-EEND)

Single-Label-SA- EEND

Contrastive center loss

화자 분할은 음성 기반 시스템에서 다중 발화 환경을 처리하는 기술로, 콜센터 상담 분석, 회의록 자동 생성, 방송 콘텐츠 처리 등 다양한 응용 분야에서 중요한 역할을 수행한다. 화자 분할 성능은 앞선 예시와 같은 시스템의 전반적인 품질에 큰 영향을 미치며, 성능을 향상시키는 것은 이 분야의 주요 연구 주제 중 하나로 자리 잡고 있다. 본 논문에서는 화자 분할 성능 향상을 위한 방법으로 종단 간 구조의 단일 라벨 분류로 정의된 화자 분할 모델인 Single-Label Self-Attentive End-to-End Neural Diarization(SL-SA-EEND)에 대조 중심 손실 함수를 적용한 기법을 제안한다. 제안된 방법은 임베딩 공간에서 동일 클래스 간 거리는 가깝게, 서로 다른 클래스 간 거리는 멀게 유지하도록 하여, 각 클래스의 변별력을 강화한다. 이를 통해 각 클래스가 변별력 있는 특징을 학습할 수 있도록 하며 분류 작업에서 성능 향상을 도모한다. 실험 결과, 제안된 방법은 베이스라인 시스템인 SL-SA-EEND의 Diarization Error Rate(DER) 대비 Simulated 데이터베이스에서 25.53 % 향상된 성과를 보였으며, CallHome 데이터베이스에서는 11.88 % 향상된 결과를 나타낸다. 마지막으로, 임베딩 공간에서 대조 중심 손실 함수를 적용한 경우와 그렇지 않은 경우의 결과를 시각화하여, 분류 작업으로 정의된 화자 분할 시스템에서 대조 중심 손실 함수의 효과를 나타낸다.

키워드

화자 분할

Self-Attentive End-to-End Neural Diarization (SL-SA-EEND)

Single-Label-SA-EEND

대조 중심 손실 함수

References

N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Trans. Audio Speech Lang. Process. 19, 788-798 (2011).

10.1109/TASL.2010.2064307

E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” Proc. ICASSP, 4052-4056 (2014).

10.1109/ICASSP.2014.6854363

D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker verification,” Proc. Interspeech, 999-1003 (2017).

10.21437/Interspeech.2017-620

D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. McCree, “Speaker diarization using deep neural network embeddings,” Proc ICASSP, 4930-4934 (2017).

10.1109/ICASSP.2017.7953094

G. Sell and D. Garcia-Romero, “Diarization resegmentation in the factor analysis subspace,” Proc. ICASSP, 4794-4798 (2015).

10.1109/ICASSP.2015.7178881

Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I. L. Moreno, “Speaker diarization with LSTM,” Proc. ICASSP, 5239-5243 (2018).

10.1109/ICASSP.2018.8462628

M. Diez, L. Burget, S. Wang, J. Rohdin, and J. Černocký, “Bayesian HMM based x-vector clustering for speaker diarization,” Proc. Interspeech, 346-350 (2019).

10.21437/Interspeech.2019-2813

Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and S. Watanabe, “End-to-end neural speaker diarization with permutation-free objectives,” Proc. Interspeech, 4300-4304 (2019).

10.21437/Interspeech.2019-2899

Y. C. Liu, E. Han, C. Lee, and A. Stolcke, “End-to- end neural diarization: from transformer to conformer,” Proc. Interspeech, 3081-3085 (2021).

10.21437/Interspeech.2021-1909

Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, and S. Watanabe, “End-to-end neural speaker diarization with self-attention,” Proc. ASRU, 296-303 (2019).

10.1109/ASRU46091.2019.9003959

S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue, and P. García, “Encoder-decoder based attractors for end-to- end neural diarization,” in IEEE/ACM Trans. Audio Speech Lang. Process. 30, 1493-1507 (2022).

10.1109/TASLP.2022.3162080

Y. Yu, D. Park, and H. Kook Kim, “Auxiliary loss of transformer with residual connection for end-to-end speaker diarization,” Proc. ICASSP, 8377-8381 (2022).

10.1109/ICASSP43922.2022.9746602

J. Jung and W. Kim, “A study on end-to-end speaker diarization system using single-label classification” (in Korean), J. Acoust. Soc. Kr. 42, 536-543 (2023).

C. Qi and F. Su, “Contrastive-center loss for deep neural networks,” Proc. ICIP, 2851-2855 (2017).

10.1109/ICIP.2017.8296803

V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” Proc. ICASSP, 5206-5210 (2015).

10.1109/ICASSP.2015.7178964

D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510. 08484 (2015).

T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” Proc. ICASSP, 5220-5224 (2017).

10.1109/ICASSP.2017.7953152

2000 Nist Speaker Recognition Evaluation, https://catalog.ldc.upenn.edu/LDC2001S97, (Last viewed September, 23, 2025).

The 2009 (rt-09) Rich Transcription Meeting Recognition Evaluation Plan, https://web.archive.org/web/20100606092041if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf, (Last viewed September, 23, 2025).

A. Plaquet and H. Bredin, “Powerset multi-class cross entropy loss for neural speaker diarization,” Proc. Interspeech, 3222-3226 (2023).

10.21437/Interspeech.2023-205

H. Bredin, “Pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe,” Proc. Interspeech, 1983-1987 (2023).

10.21437/Interspeech.2023-105

Information

Publisher :The Acoustical Society of Korea
Publisher(Ko) :한국음향학회
Journal Title :The Journal of the Acoustical Society of Korea
Journal Title(Ko) :한국음향학회지
Volume : 44
No :5
Pages :525-532
Received Date : 2025-08-05
Accepted Date : 2025-09-04
DOI :https://doi.org/10.7776/ASK.2025.44.5.525

The Journal of the Acoustical Society of KoreaISSN:1225-4428(Print) 2287-3775(Online)한국음향학회

All Issue