Research Article
S. Katkov, A. Liotta, and A. Vietti, “Evaluating the robustness of ASR systems in adverse acoustic conditions,” Proc. 5th IDSTA, 76-80 (2024).
10.1109/IDSTA62194.2024.10746999V. K. Singh, K. Sharma, and S. N. Sur, “A survey on preprocessing and classification techniques for acoustic scene,” Expert Syst. Appl. 229, 120520 (2023).
10.1016/j.eswa.2023.120520T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” Proc. Interspeech, 3586-3590 (2015).
10.21437/Interspeech.2015-711L. Wijayasingha and J. A. Stankovic, “Robustness to noise for speech emotion classification using CNNs and attention mechanisms,” Smart Health, 19, 100165 (2021).
10.1016/j.smhl.2020.100165S. E. Bou-Ghazale and K. Assaleh, “A robust endpoint detection of speech for noisy environments with application to automatic speech recognition,” Proc. IEEE ICASSP, IV-3808 (2002).
10.1109/ICASSP.2002.5745486M. N. Ali, A. Brutti, and D. Falavigna, “Enhancing embeddings for speech classification in noisy conditions,” Proc. Interspeech, 2933-2937 (2022).
10.21437/Interspeech.2022-10707V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: an ASR corpus based on public domain audio books,” Proc. IEEE ICASSP, 5206- 5210 (2015).
10.1109/ICASSP.2015.7178964C. Veaux, J. Yamagishi, and K. MacDonald, “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit,” The Centre for Speech Technology Research (CSTR), University of Edinburgh, Tech. Rep., 2017.
T.-P. Doan, H. Dinh-Xuan, T. Ryu, I. Kim, W. Lee, K. Hong, and S. Jung, “Trident of poseidon: A generalized approach for detecting deepfake voices,” Proc. CCS, 2222-2235 (2024).
10.1145/3658644.3690311J. S. Garofolo, L. F. Lamel, W. M. Fisher, D. S. Pallett, N. L. Dahlgren, V. Zue, and J. G. Fiscus, “TIMIT acoustic-phonetic continuous speech corpus,” National Institute of Standards and Technology (NIST), Tech. Rep., 1993.
D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484 (2015).
K. J. Piczak, “ESC: Dataset for environmental sound classification,” Proc. 23rd ACM Int. Conf. Multimedia, 1015-1018 (2015).
10.1145/2733373.2806390Simple Auto-Tune in Python, https://github.com/JanWilczek/python-auto-tune, (Last viewed October 10, 2024).
T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” Proc. IEEE ICASSP, 5220-5224 (2017).
10.1109/ICASSP.2017.7953152J. B. Awotunde, R. O. Ogundokun, F. E. Ayo, and O. E. Matiluko, “Speech segregation in background noise based on deep learning,” IEEE Access 8, 169568- 169575 (2020).
10.1109/ACCESS.2020.3024077S. Rosenzweig, S. Schwär, J. Driedger, and M. Müller, “Adaptive pitch-shifting with applications to intonation adjustment in a cappella recordings,” Proc. 24th DAFx, 121-128 (2021).
10.23919/DAFx51585.2021.9768268X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, and Z. H. Ling, “ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech,” Comput. Speech Lang. 64, 101114 (2020).
10.1016/j.csl.2020.101114B. Desplanques, J. Thienpondt, and K. D. Ecapa- tdnn, “Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” arXiv: 2005.07143 (2020).
10.21437/Interspeech.2020-2650H. J. Heo, U. H. Shin, R. Lee, Y. Cheon, and H. M. Park, “NeXt-TDNN: Modernizing multi-scale temporal convolution backbone for speaker verification,” Proc. IEEE ICASSP, 11186-11190 (2024).
10.1109/ICASSP48485.2024.10447037J. Thienpondt and K. Demuynck, “Ecapa2: A hybrid neural network architecture and training strategy for robust speaker embeddings,” Proc. IEEE ASRU, 1-8 (2023).
10.1109/ASRU57964.2023.10389750J. W. Jung, H. S. Heo, H. Tak, H. J. Shim, J. S. Chung, B. J. Lee, and N. Evans, “Aasist: Audio anti- spoofing using integrated spectro-temporal graph attention networks,” Proc. IEEE ICASSP, 6367-6371 (2022).
10.1109/ICASSP43922.2022.9747766H. Tak, M. Todisco, X. Wang, J. W. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” arXiv:2202. 12233 (2022).
10.21437/Odyssey.2022-16D. T. Truong, R. Tao, T. Nguyen, H. T. Luong, K. A. Lee, and E. S. Chng, “Temporal-channel modeling in multi-head self-attention for synthetic speech detection,” arXiv:2406.17376 (2024).
10.21437/Interspeech.2024-659A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Proc. NeurIPS, 12449-12460 (2020).
- Publisher :The Acoustical Society of Korea
- Publisher(Ko) :한국음향학회
- Journal Title :The Journal of the Acoustical Society of Korea
- Journal Title(Ko) :한국음향학회지
- Volume : 44
- No :5
- Pages :533-539
- Received Date : 2025-08-01
- Revised Date : 2025-08-28
- Accepted Date : 2025-09-01
- DOI :https://doi.org/10.7776/ASK.2025.44.5.533



The Journal of the Acoustical Society of Korea









