Many-to-many voice conversion experiments using a Korean speech corpus

Dongsuk Yook; HyungJin Seo; Bonggu Ko; In-Chul Yoo

doi:10.7776/ASK.2022.41.3.351

All Issue

2022 Vol.41, Issue 3 Preview Page Next Page

Research Article

Many-to-many voice conversion experiments using a Korean speech corpus 다수 화자 한국어 음성 변환 실험

31 May 2022. pp. 351-358

PDF XML

Abstract

Recently, Generative Adversarial Networks (GAN) and Variational AutoEncoders (VAE) have been applied to voice conversion that can make use of non-parallel training data. Especially, Conditional Cycle- Consistent Generative Adversarial Networks (CC-GAN) and Cycle-Consistent Variational AutoEncoders (CycleVAE) show promising results in many-to-many voice conversion among multiple speakers. However, the number of speakers has been relatively small in the conventional voice conversion studies using the CC-GANs and the CycleVAEs. In this paper, we extend the number of speakers to 100, and analyze the performances of the many- to-many voice conversion methods experimentally. It has been found through the experiments that the CC-GAN shows 4.5 % less Mel-Cepstral Distortion (MCD) for a small number of speakers, whereas the CycleVAE shows 12.7 % less MCD in a limited training time for a large number of speakers.

Keywords

Voice conversion

Conditional Cycle-Consistent Generative Adversarial Network (CC-GAN)

Cycle- Consistent Variational AutoEncoder (CycleVAE)

Generative Adversarial Network (GAN)

Variational AutoEncoder (VAE)

심층 생성 모델의 일종인 Generative Adversarial Network(GAN)과 Variational AutoEncoder(VAE)는 비병렬 학습 데이터를 사용한 음성 변환에 새로운 방법론을 제시하고 있다. 특히, Conditional Cycle-Consistent Generative Adversarial Network(CC-GAN)과 Cycle-Consistent Variational AutoEncoder(CycleVAE)는 다수 화자 사이의 음성 변환에 우수한 성능을 보이고 있다. 그러나, CC-GAN과 CycleVAE는 비교적 적은 수의 화자를 대상으로 연구가 진행되어왔다. 본 논문에서는 100 명의 한국어 화자 데이터를 사용하여 CC-GAN과 CycleVAE의 음성 변환 성능과 확장 가능성을 실험적으로 분석하였다. 실험 결과 소규모 화자의 경우 CC-GAN이 Mel-Cepstral Distortion(MCD) 기준으로 4.5 % 우수한 성능을 보이지만 대규모 화자의 경우 CycleVAE가 제한된 학습 시간 안에 12.7 % 우수한 성능을 보였다.

키워드

음성 변환

Conditional Cycle-consistent Generative Adversarial Network (CC-GAN)

Cycle-Consistent Variational AutoEncoder (CycleVAE)

Generative Adversarial Network (GAN)

Variational AutoEncoder (VAE)

References

B. Ko, K. Lee, I.-C. Yoo, and D. Yook, "Korean voice conversion experiments using CC-GAN and VAW- GAN" (in Korean), Proc, Speech Communication and Signal Processing, 36, 39 (2019).

B. Jang, H. Seo, I.-C. Yoo, and D. Yook, "CycleVAE based many-to-many voice conversion experiments using Korean speech corpus" (in Korean), J. Acoust. Soc. Suppl.2(s) 40, 79 (2021).

I.-C. Yoo, K. Lee, S.-G. Leem, H. Oh, B. Ko, and D. Yook, "Speaker anonymization for personal information protection using voice conversion techniques," IEEE Access, 8, 198637-198645 (2020). 10.1109/ACCESS.2020.3035416

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial nets," Proc. NIPS, 2672-2680 (2014).

D. Kingma and M. Welling, "Auto-encoding variational Bayes," arXiv:1312.6114 (2013).

J. Zhu, T. Park, P. Isola, and A. Efros, "Unpaired image-to image translation using cycle-consistent adversarial networks," Proc. IEEE Int. Conf. Computer Vision, 2242-2251 (2017). 10.1109/ICCV.2017.244

T. Kaneko and H. Kameoka, "CycleGAN-VC: Non- parallel voice conversion using cycle-consistent adversarial networks," Proc. EUSIPCO, 2114-2118 (2018). 10.23919/EUSIPCO.2018.8553236

T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, "CycleGAN-VC2: Improved CycleGAN-based non- parallel voice conversion," Proc. IEEE ICASSP, 6820- 6824 (2019). 10.1109/ICASSP.2019.8682897

T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, "CycleGAN-VC3: Examining and improving CycleGAN-VCs for Mel-spectrogram conversion," Proc. Interspeech, 2017-2021 (2020). 10.21437/Interspeech.2020-2280

D. Yook, I.-C. Yoo, and S. Yoo, "Voice conversion using conditional CycleGAN," Proc. Int. Conf. CSCI, 1460-1461 (2018). 10.1109/CSCI46756.2018.00290

S. Lee, B. Ko, K. Lee, I.-C. Yoo, and D. Yook, "Many-to-many voice conversion using conditional cycle-consistent adversarial networks," Proc. IEEE ICASSP, 6279-6283 (2020). 10.1109/ICASSP40776.2020.9053726

H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, "StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks," Proc. IEEE Workshop on SLT, 266-273 (2018). 10.1109/SLT.2018.8639535

T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, "StarGAN-VC2: Rethinking conditional methods for StarGAN-based voice conversion," Proc. Interspeech, 679-683 (2019). 10.21437/Interspeech.2019-2236

C. Hsu, H. Hwang, Y. Wu, Y. Tsao, and H. Wang, "Voice conversion from non-parallel corpora using variational autoencoder," Proc. APSIPA, 1-6 (2016). 10.1109/APSIPA.2016.7820786

A. Oord and O. Vinyals, "Neural discrete representation learning," Proc. NIPS, 6309-6318 (2017).

C. Hsu, H. Hwang, Y. Wu, Y. Tsao, and H. Wang, "Voice conversion from unaligned corpora using variational autoencoding Wasserstein generative adversarial networks," Proc. Interspeech, 3364-3368 (2017). 10.21437/Interspeech.2017-6329257322

H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, "ACVAE-VC: Non-parallel voice conversion with auxiliary classifier variational autoencoder," IEEE/ ACM Trans. on Audio, Speech, and Lang. Process. 27, 1432-1443 (2019). 10.1109/TASLP.2019.2917232

P. Tobing, Y. Wu, T. Hayashi, K. Kobayashi, and T. Toda, "Non-parallel voice conversion with cyclic variational autoencoder," Proc. Interspeech, 674-678 (2019). 10.21437/Interspeech.2019-2307

D. Yook, S.-G. Leem, K. Lee, and I.-C. Yoo, "Many- to-Many voice conversion using cycle-consistent variational autoencoder with multiple decoders," Proc. Odyssey: The Speaker Language Recognition Workshop, 215-221 (2020). 10.21437/Odyssey.2020-31

B. Ko, Many-to-many voice conversion using cycle- consistency for Korean speech (in Korean), (Master Thesis, Korea University, 2020).

M. Morise, F. Yokomori, and K. Ozawa, "WORLD: A vocoder-based high-quality speech synthesis system for real-time applications," IEICE Trans. on Information and Systems, 99, 1877-1884 (2016). 10.1587/transinf.2015EDP7457

D. Kingma and J. Ba, "Adam: A method for stochastic optimization," Proc. ICLR, 1-13 (2015). 10.1007/978-3-662-46214-0_125497406

T. Toda, A. Black, and K. Tokuda, "Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory," IEEE Trans. on Audio, Speech, and Lang. Process. 15, 2222-2235 (2007). 10.1109/TASL.2007.907344

S. Takamichi, T. Toda, A. Black, G. Neubig, S. Sakti, and S. Nakamura, "Postfilters to modify the modulation spectrum for statistical parametric speech synthesis," IEEE/ACM Trans. on Audio, Speech, and Lang. Process. 24, 755-767 (2016). 10.1109/TASLP.2016.2522655

Information

Publisher :The Acoustical Society of Korea
Publisher(Ko) :한국음향학회
Journal Title :The Journal of the Acoustical Society of Korea
Journal Title(Ko) :한국음향학회지
Volume : 41
No :3
Pages :351-358
Received Date : 2022-03-16
Revised Date : 2022-04-29
Accepted Date : 2022-05-13
DOI :https://doi.org/10.7776/ASK.2022.41.3.351

The Journal of the Acoustical Society of KoreaISSN:1225-4428(Print) 2287-3775(Online)한국음향학회

All Issue