Research Article
Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “FastSpeech 2: Fast and high-quality end-to-end text to speech,” Proc. ICLR, 1-15 (2021).
X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y. Liu, X. Wang, Y. Leng, Y. Yi, L. He, F. Soong, T. Qin, S. Zhao, and T.-Y. Liu, “NaturalSpeech: End-to-end text-to-speech synthesis with human-level quality,” IEEE Trans. Pattern Anal. Mach. Intell. 46, 4234- 4245 (2024).
10.1109/TPAMI.2024.3356232Y. A. Li, C. Han, V. S. Raghavan, G. Mischler, and N. Mesgarani, “StyleTTS 2: Towards human-level text- to-speech through style diffusion and adversarial training with large speech language models,” Adv. Neural Inf. Process. Syst. 36, 19594-19621 (2023).
Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, Y. Zhang, and R. A. Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” Proc. ICML (PMLR 80), 5180-5189 (2018).
D. Diatlova and V. Shutov, “EmoSpeech: Guiding FastSpeech2 towards emotional text-to-speech,” Proc. 12th ISCA Speech Synthesis Workshop (SSW12), 106-111 (2023).
10.21437/SSW.2023-17J. H. Turner, Human Emotions: A Sociological Theory (Routledge, London, 2007), pp. 1-256.
10.4324/9780203961278O. Kwon, C.-R. Kim, and G. Kim, “Factors affecting the intensity of emotional expressions in mobile communications,” Online Inf. Rev. 37, 114-131 (2013).
10.1108/14684521311311667T. Li, S. Yang, L. Xue, and L. Xie, “Controllable emotion transfer for end-to-end speech synthesis,” Proc. ISCSLP, 1-5 (2021).
10.1109/ISCSLP49672.2021.9362069S. Wang, J. Guðnason, and D. Borth, “Fine-grained emotional control of text-to-speech: Learning to rank inter- and intra-class emotion intensities,” Proc. ICASSP, 1-5 (2023).
10.1109/ICASSP49357.2023.10097118D.-H. Cho, H.-S. Oh, S.-B. Kim, S.-H. Lee, and S.-W. Lee, “EmoSphere-TTS: Emotional style and intensity modeling via spherical emotion vector for controllable emotional text-to-speech,” Proc. Interspeech, 1810-1814 (2024).
10.21437/Interspeech.2024-398K. Zhou, B. Sisman, R. Rana, B. W. Schuller, and H. Li, “Speech synthesis with mixed emotions,” IEEE Trans. Affective Comput. 14, 3120-3134 (2023).
10.1109/TAFFC.2022.3233324X. Zhu, S. Yang, G. Yang, and L. Xie, “Controlling emotion strength with relative attribute for end-to- end speech synthesis,” Proc. IEEE ASRU, 192-199 (2019).
10.1109/ASRU46091.2019.9003829J. Park, J. Park, Z. Xiong, N. Lee, J. Cho, S. Oymak, K. Lee, and D. Papailiopoulos, “Can mamba learn how to learn? A comparative study on in-context learning tasks,” Proc. ICML (PMLR 235), 39793- 39812 (2024).
A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” Proc. ICLR, 1-36 (2024).
L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,” arXiv: 2401.09417 (2024).
Y. Liu, Y. Tian, Y. Zhao, H. Yu, L. Xie, Y. Wang, Q. Ye, J. Jiao, and Y. Liu, “VMamba: Visual state space model,” arXiv:2401.10166 (2024).
K. Miyazaki, Y. Masuyama, and M. Murata, “Exploring the capability of mamba in speech applications,” arXiv:2406.16808 (2024).
10.21437/Interspeech.2024-994X. Jiang, Y. A. Li, A. N. Florea, C. Han, and S. Mesgarani, “Speech slytherin: Examining the performance and efficiency of mamba for speech separation, recognition, and synthesis,” arXiv:2407.09732 (2024).
10.1109/ICASSP49660.2025.10889391X. Zhang, Q. Zhang, H. Liu, T. Xiao, X. Qian, B. Ahmed, E. Ambikairajah, H. Li, and J. Epps, “Mamba in speech: Towards an alternative to self-attention,” arXiv:2405.12609 (2024).
10.1109/TASLPRO.2025.3566210M. H. Erol, A. Senocak, J. Feng, and J. S. Chung, “Audio mamba: Bidirectional state space model for audio representation learning,” IEEE Signal Process. Lett. 31, 2975-2979 (2024).
10.1109/LSP.2024.3483009D. Y. Fu, T. Dao, K. K. Saab, A. W. Thomas, A. Rudra, and C. Ré, “Hungry hungry hippos: Towards language modeling with state space models,” Proc. ICLR, 1-27 (2023).
A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences with structured state spaces,” Proc. ICLR, 1-32 (2022).
K. Zhou, B. Sisman, R. Liu, and H. Li, “Emotional voice conversion: Theory, databases and ESD,” Speech Commun. 137, 1-18 (2022).
10.1016/j.specom.2021.11.006G. Kim, D. K. Han, and H. Ko, “SpecMix: A mixed sample data augmentation method for training with time-frequency domain features,” Proc. Interspeech, 546-550 (2021).
10.21437/Interspeech.2021-103S. Mun, S. Park, D. K. Han, and H. Ko, “Generative adversarial network based acoustic scene training set augmentation and selection using SVM Hyper-Plane,” Proc. DCASE, 93-97 (2017).
H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez- Paz, “mixup: Beyond empirical risk minimization,” Proc. ICLR, 1-13 (2018).
D. Parikh and K. Grauman, “Relative attributes,” Proc. ICCV, 503-510 (2011).
10.1109/ICCV.2011.6126281C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech separation,” Proc. ICASSP, 21-25 (2021).
10.1109/ICASSP39728.2021.9413901A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” Proc. Interspeech, 5036-5040 (2020).
10.21437/Interspeech.2020-3015C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei, “Neural codec language models are zero-shot text-to-speech synthesizers,” arXiv:2301.02111 (2023).
A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A generative model for raw audio,” Proc. 9th SSW, 125 (2016).
S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-weighted linear units for neural network function approximation in reinforcement learning,” Neural Netw. 107, 3-11 (2018).
10.1016/j.neunet.2017.12.012Y. Kim, K. Ko, J. Lee, and H. Ko, “CAS-TJ: Channel attention shuffle and temporal jigsaw for audio classification,” Appl. Acoust. 233, 110590 (2025).
10.1016/j.apacoust.2025.110590S. Lee, D. K. Han, and H. Ko, “Multimodal emotion recognition fusion analysis adapting BERT with heterogeneous feature unification,” IEEE Access, 9, 94557-94572 (2021).
10.1109/ACCESS.2021.3092735T. Hayashi, R. Yamamoto, K. Inoue, T. Yoshimura, S. Watanabe, T. Toda, K. Takeda, Y. Zhang, and X. Tan, “ESPnet-TTS: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit,” Proc. ICASSP, 7654-7658 (2020).
10.1109/ICASSP40776.2020.9053512R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi- resolution spectrogram,” Proc. ICASSP, 6199-6203 (2020).
10.1109/ICASSP40776.2020.9053795- Publisher :The Acoustical Society of Korea
- Publisher(Ko) :한국음향학회
- Journal Title :The Journal of the Acoustical Society of Korea
- Journal Title(Ko) :한국음향학회지
- Volume : 44
- No :5
- Pages :496-507
- Received Date : 2025-06-17
- Accepted Date : 2025-08-11
- DOI :https://doi.org/10.7776/ASK.2025.44.5.496



The Journal of the Acoustical Society of Korea









