A Korean menu-ordering sentence text-to-speech system using conformer-based FastSpeech2

Yerin Choi; JaeHoo Jang; Myoung-Wan Koo

doi:10.7776/ASK.2022.41.3.359

All Issue

2022 Vol.41, Issue 3 Preview Page Next Page

Research Article

A Korean menu-ordering sentence text-to-speech system using conformer-based FastSpeech2 콘포머 기반 FastSpeech2를 이용한 한국어 음식 주문 문장 음성합성기

31 May 2022. pp. 359-366

PDF XML

Abstract

In this paper, we present the Korean menu-ordering Sentence Text-to-Speech (TTS) system using conformer-based FastSpeech2. Conformer is the convolution-augmented transformer, which was originally proposed in Speech Recognition. Combining two different structures, the Conformer extracts better local and global features . It comprises two half Feed Forward module at the front and the end, sandwiching the Multi-Head Self-Attention module and Convolution module. We introduce the Conformer in Korean TTS, as we know it works well in Korean Speech Recognition. For comparison between transformer-based TTS model and Conformer-based one, we train FastSpeech2 and Conformer-based FastSpeech2. We collected a phoneme- balanced data set and used this for training our models. This corpus comprises not only general conversation, but also menu-ordering conversation consisting mainly of loanwords. This data set is the solution to the current Korean TTS model’s degradation in loanwords. As a result of generating a synthesized sound using ParallelWave Gan, the Conformer-based FastSpeech2 achieved superior performance of MOS 4.04. We confirm that the model performance improved when the same structure was changed from transformer to Conformer in the Korean TTS.

Keywords

Text-to-Speech (TTS)

Speech synthesis

Conformer

Deep learning

본 논문에서는 콘포머 기반 FastSpeech2를 이용한 한국어 메뉴 음성합성기를 제안한다. 콘포머는 본래 음성인식 분야에서 제안된 것으로, 합성곱 신경망과 트랜스포머를 결합하여 광역과 지역 정보를 모두 잘 추출할 수 있도록 한 구조다. 이를 위해 순방향 신경망을 반으로 나누어 제일 처음과 마지막에 위치시켜 멀티 헤드 셀프 어텐션 모듈과 합성곱 신경망을 감싸는 마카론 구조를 구성했다. 본 연구에서는 한국어 음성인식에서 좋은 성능이 확인된 콘포머 구조를 한국어 음성합성에 도입하였다. 기존 음성합성 모델과의 비교를 위하여 트랜스포머 기반의 FastSpeech2와 콘포머 기반의 FastSpeech2를 학습하였다. 이때 데이터셋은 음소 분포를 고려한 자체 제작 데이터셋을 이용하였다. 특히 일반대화 뿐만 아니라, 음식 주문 문장 특화 코퍼스를 제작하고 이를 음성합성 훈련에 사용하였다. 이를 통해 외래어 발음에 대한 기존 음성합성 시스템의 문제점을 보완하였다. ParallelWave GAN을 이용하여 합성음을 생성하고 평가한 결과, 콘포머 기반의 FastSpeech2가 월등한 성능인 MOS 4.04을 달성했다. 본 연구를 통해 한국어 음성합성 모델에서, 동일한 구조를 트랜스포머에서 콘포머로 변경하였을 때 성능이 개선됨을 확인하였다.

키워드

Text-to-Speech (TTS)

음성합성

콘포머

딥러닝

References

A. J. Hunt and A. W. Black, "Unit selection in a concatenative speech synthesis system using a large speech database," Proc. IEEE ICASSP, 373-376 (1996).

T. Yoshimura, K. Tokuda, T. Masuko, T, Kobayashi, and T. Kitamura, "Simultaneous modeling of spectrum, pitch and duration in HMM based speech synthesis," Proc. Eurospeech, 2347-2350 (1999).

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, "Natural tts synthesis by conditioning wavenet on mel spectrogram predictions," Proc. IEEE ICASSP, 4779- 4783 (2018). 10.1109/ICASSP.2018.8461368

Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T-Y. Liu. "Fastspeech2: Fast and high-quality end- to-end text to speech," arXiv:2006.04558 (2021).

A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, "Wavenet: a generative model for raw audio," arXiv:1609.03499 (2016).

R. Yamamoto, E. Song, and J. Kim. "Parallel waveGAN: A fast waveformgeneration model based on generative adversarial networks with multi-resolution spectrogram," Proc. IEEE ICASSP, 6199-6203 (2020). 10.1109/ICASSP40776.2020.9053795

Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu. "Fastspeech:Fast, robust and controllable text to speech," Proc. NIPS, 3165-3174 (2019).

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Yu Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, "Conformer: Convolution-augmented transformer for speech recognition," Proc. Interspeech, 5036-5040 (2020). 10.21437/Interspeech.2020-3015

M. Koo, "A korean speech recognition based on conformer" (In Korean), J. Acoust. Soc. Kr. 40, 488- 495 (2021)

P. Guo, F. Boyer, X. Chang, T. Hayashi, Y. Higuchi, H. Inaguma, N. Kamo, C. Li, D. Garcia-Romero, J. Shi, J. Shi, S. Watanabe, K. Wei, W. Zhang, and Y. Zhang, "Recent developments on espnet toolkit boosted by conformer," Proc. IEEE ICASSP, 5874-5878 (2021) 10.1109/ICASSP39728.2021.941485834060830

N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," Proc. NeurIPS, 1-11 (2017)

P. Ramachandran, B. Zoph, and Q. V. Le, "Swish: A self-gated activation function," arXiv:1710.05941v1 (2017).

Information

Publisher :The Acoustical Society of Korea
Publisher(Ko) :한국음향학회
Journal Title :The Journal of the Acoustical Society of Korea
Journal Title(Ko) :한국음향학회지
Volume : 41
No :3
Pages :359-366
Received Date : 2022-03-21
Revised Date : 2022-04-29
Accepted Date : 2022-05-23
DOI :https://doi.org/10.7776/ASK.2022.41.3.359

The Journal of the Acoustical Society of KoreaISSN:1225-4428(Print) 2287-3775(Online)한국음향학회

All Issue