All Issue

2025 Vol.44, Issue 5 Preview Page

Research Article

30 September 2025. pp. 496-507
Abstract
References
1

Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “FastSpeech 2: Fast and high-quality end-to-end text to speech,” Proc. ICLR, 1-15 (2021).

2

X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y. Liu, X. Wang, Y. Leng, Y. Yi, L. He, F. Soong, T. Qin, S. Zhao, and T.-Y. Liu, “NaturalSpeech: End-to-end text-to-speech synthesis with human-level quality,” IEEE Trans. Pattern Anal. Mach. Intell. 46, 4234- 4245 (2024).

10.1109/TPAMI.2024.3356232
3

Y. A. Li, C. Han, V. S. Raghavan, G. Mischler, and N. Mesgarani, “StyleTTS 2: Towards human-level text- to-speech through style diffusion and adversarial training with large speech language models,” Adv. Neural Inf. Process. Syst. 36, 19594-19621 (2023).

4

Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, Y. Zhang, and R. A. Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” Proc. ICML (PMLR 80), 5180-5189 (2018).

5

D. Diatlova and V. Shutov, “EmoSpeech: Guiding FastSpeech2 towards emotional text-to-speech,” Proc. 12th ISCA Speech Synthesis Workshop (SSW12), 106-111 (2023).

10.21437/SSW.2023-17
6

J. H. Turner, Human Emotions: A Sociological Theory (Routledge, London, 2007), pp. 1-256.

10.4324/9780203961278
7

O. Kwon, C.-R. Kim, and G. Kim, “Factors affecting the intensity of emotional expressions in mobile communications,” Online Inf. Rev. 37, 114-131 (2013).

10.1108/14684521311311667
8

T. Li, S. Yang, L. Xue, and L. Xie, “Controllable emotion transfer for end-to-end speech synthesis,” Proc. ISCSLP, 1-5 (2021).

10.1109/ISCSLP49672.2021.9362069
9

S. Wang, J. Guðnason, and D. Borth, “Fine-grained emotional control of text-to-speech: Learning to rank inter- and intra-class emotion intensities,” Proc. ICASSP, 1-5 (2023).

10.1109/ICASSP49357.2023.10097118
10

D.-H. Cho, H.-S. Oh, S.-B. Kim, S.-H. Lee, and S.-W. Lee, “EmoSphere-TTS: Emotional style and intensity modeling via spherical emotion vector for controllable emotional text-to-speech,” Proc. Interspeech, 1810-1814 (2024).

10.21437/Interspeech.2024-398
11

K. Zhou, B. Sisman, R. Rana, B. W. Schuller, and H. Li, “Speech synthesis with mixed emotions,” IEEE Trans. Affective Comput. 14, 3120-3134 (2023).

10.1109/TAFFC.2022.3233324
12

X. Zhu, S. Yang, G. Yang, and L. Xie, “Controlling emotion strength with relative attribute for end-to- end speech synthesis,” Proc. IEEE ASRU, 192-199 (2019).

10.1109/ASRU46091.2019.9003829
13

J. Park, J. Park, Z. Xiong, N. Lee, J. Cho, S. Oymak, K. Lee, and D. Papailiopoulos, “Can mamba learn how to learn? A comparative study on in-context learning tasks,” Proc. ICML (PMLR 235), 39793- 39812 (2024).

14

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” Proc. ICLR, 1-36 (2024).

15

L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,” arXiv: 2401.09417 (2024).

16

Y. Liu, Y. Tian, Y. Zhao, H. Yu, L. Xie, Y. Wang, Q. Ye, J. Jiao, and Y. Liu, “VMamba: Visual state space model,” arXiv:2401.10166 (2024).

17

K. Miyazaki, Y. Masuyama, and M. Murata, “Exploring the capability of mamba in speech applications,” arXiv:2406.16808 (2024).

10.21437/Interspeech.2024-994
18

X. Jiang, Y. A. Li, A. N. Florea, C. Han, and S. Mesgarani, “Speech slytherin: Examining the performance and efficiency of mamba for speech separation, recognition, and synthesis,” arXiv:2407.09732 (2024).

10.1109/ICASSP49660.2025.10889391
19

X. Zhang, Q. Zhang, H. Liu, T. Xiao, X. Qian, B. Ahmed, E. Ambikairajah, H. Li, and J. Epps, “Mamba in speech: Towards an alternative to self-attention,” arXiv:2405.12609 (2024).

10.1109/TASLPRO.2025.3566210
20

M. H. Erol, A. Senocak, J. Feng, and J. S. Chung, “Audio mamba: Bidirectional state space model for audio representation learning,” IEEE Signal Process. Lett. 31, 2975-2979 (2024).

10.1109/LSP.2024.3483009
21

D. Y. Fu, T. Dao, K. K. Saab, A. W. Thomas, A. Rudra, and C. Ré, “Hungry hungry hippos: Towards language modeling with state space models,” Proc. ICLR, 1-27 (2023).

22

A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences with structured state spaces,” Proc. ICLR, 1-32 (2022).

23

K. Zhou, B. Sisman, R. Liu, and H. Li, “Emotional voice conversion: Theory, databases and ESD,” Speech Commun. 137, 1-18 (2022).

10.1016/j.specom.2021.11.006
24

G. Kim, D. K. Han, and H. Ko, “SpecMix: A mixed sample data augmentation method for training with time-frequency domain features,” Proc. Interspeech, 546-550 (2021).

10.21437/Interspeech.2021-103
25

S. Mun, S. Park, D. K. Han, and H. Ko, “Generative adversarial network based acoustic scene training set augmentation and selection using SVM Hyper-Plane,” Proc. DCASE, 93-97 (2017).

26

H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez- Paz, “mixup: Beyond empirical risk minimization,” Proc. ICLR, 1-13 (2018).

27

D. Parikh and K. Grauman, “Relative attributes,” Proc. ICCV, 503-510 (2011).

10.1109/ICCV.2011.6126281
28

C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech separation,” Proc. ICASSP, 21-25 (2021).

10.1109/ICASSP39728.2021.9413901
29

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” Proc. Interspeech, 5036-5040 (2020).

10.21437/Interspeech.2020-3015
30

C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei, “Neural codec language models are zero-shot text-to-speech synthesizers,” arXiv:2301.02111 (2023).

31

A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A generative model for raw audio,” Proc. 9th SSW, 125 (2016).

32

S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-weighted linear units for neural network function approximation in reinforcement learning,” Neural Netw. 107, 3-11 (2018).

10.1016/j.neunet.2017.12.012
33

J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv:1607.06450 (2016).

34

Y. Kim, K. Ko, J. Lee, and H. Ko, “CAS-TJ: Channel attention shuffle and temporal jigsaw for audio classification,” Appl. Acoust. 233, 110590 (2025).

10.1016/j.apacoust.2025.110590
35

S. Lee, D. K. Han, and H. Ko, “Multimodal emotion recognition fusion analysis adapting BERT with heterogeneous feature unification,” IEEE Access, 9, 94557-94572 (2021).

10.1109/ACCESS.2021.3092735
36

T. Hayashi, R. Yamamoto, K. Inoue, T. Yoshimura, S. Watanabe, T. Toda, K. Takeda, Y. Zhang, and X. Tan, “ESPnet-TTS: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit,” Proc. ICASSP, 7654-7658 (2020).

10.1109/ICASSP40776.2020.9053512
37

R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi- resolution spectrogram,” Proc. ICASSP, 6199-6203 (2020).

10.1109/ICASSP40776.2020.9053795
38

T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo- SaruLab system for VoiceMOS challenge 2022,” Proc. Interspeech, 4521-4525 (2022).

10.21437/Interspeech.2022-439
39

Resemblyzer, https://github.com/resemble-ai/Resemblyzer, (Last viewed September 17, 2025).

40

Parselmouth, https://github.com/YannickJadoul/Parselmouth, (Last viewed September 17, 2025).

Information
  • Publisher :The Acoustical Society of Korea
  • Publisher(Ko) :한국음향학회
  • Journal Title :The Journal of the Acoustical Society of Korea
  • Journal Title(Ko) :한국음향학회지
  • Volume : 44
  • No :5
  • Pages :496-507
  • Received Date : 2025-06-17
  • Accepted Date : 2025-08-11