A study on two stage acoustic classification neural network training algorithm from pretrained models for small scale data environments

Seunghyeon Shin; Minhan Kim; Seokjin Lee

doi:10.7776/ASK.2025.44.3.270

All Issue

2025 Vol.44, Issue 3 Preview Page Next Page

Research Article

A study on two stage acoustic classification neural network training algorithm from pretrained models for small scale data environments 소규모 데이터 환경을 위한 사전학습 모델을 활용한 2단계의 음향 분류 신경망 학습 알고리즘 연구

31 May 2025. pp. 270-280

PDF XML

Abstract

Training data directly impacts neural network performance during machine learning. Limited training data causes performance degradation in larger neural networks compared to simpler ones. We propose a two stage neural network method using feature extraction and classifier networks with pretrained models to address data scarcity. Performance evaluation on small scale datasets compared our method against conventional networks. Our approach achieved improved classification performance at similar complexity levels. The method demonstrated improved performance of the proposed method even with complex models where traditional training models of similar complexity typically degrade performance, showing effectiveness of the proposed method under data constraints.

Keywords

Machine learning

Small scale data

Pretrained model

Acoustic scene classification

기계학습으로 신경망을 학습하는 과정에서 학습 데이터는 직접적으로 성능에 큰 영향을 미친다. 가용한 학습 데이터의 양이 제한적인 환경에서 학습하고자 하는 신경망의 매개변수의 양이 일정 규모 이상일 경우 학습에 부정적인 영향을 미쳐 오히려 작은 매개변수를 가지는 신경망을 사용하는 것과 대비하여 성능이 저하된다. 이러한 데이터 부족으로 인한 문제를 완화하고자 본 논문에서는 특징 추출 및 분류기 신경망의 2단계로 구성된 신경망 구조 및 사전에 학습이 완료된 대형 신경망 모델의 출력을 사용하여 특징 추출 신경망을 학습하는 방법을 제안한다. 제안하는 방법 및 기존 방식의 신경망들을 소규모의 데이터를 사용하여 학습하여 음향 장면의 분류 성능 및 신경망의 복잡도 정도를 산출하였다. 제안하는 방법의 분류 성능은 기존 방식으로 학습하는 신경망 대비하여 유사한 복잡도 수준에서 더 우수한 분류 결과를 얻었으며, 기존 방식으로 학습한 신경망의 분류 성능이 저하되는 정도의 신경망 규모에서도 효과적으로 학습하여 분류 성능을 개선하였다.

키워드

기계학습

소규모 데이터

사전학습 모델

음향 장면인식

References

F. Schmid, P. Primus, T. Heittola, A. Mesaros, I. M.-Morató, K. Koutini, and G. Widmer, "Data-efficient low-complexity acoustic scene classification in the DCASE 2024 challenge," arXiv preprint, arxiv:2405. 10018 (2024).

N. Turpault, R. Serizel, A. P. Shah, and J. Salamon, "Sound event detection in domestic environments with weakly labeled data and soundscape synthesis," Proc. DCASE, 253-257 (2019).

10.33682/006b-jx26

N. Harada, D. Niizumi, Y. Ohishi, D. Takeuchi, and M. Yasuda, "First-shot anomaly sound detection for machine condition monitoring: A domain generalization baseline," Proc. EUSIPCO, 191-195 (2023).

10.23919/EUSIPCO58844.2023.10289721

S. Cheng, C. Wang, K. Yue, R. Li, F. Shen, W. Shuai, W. Li, and L. Dai, "Automated sleep apnea detection in snoring signal using long short-term memory neural networks," Biomed. Signal Process. Control. 71, 103238 (2022).

10.1016/j.bspc.2021.103238

S. K. Ghosh, R. N. Ponnalagu, R. K. Tripathy, G. Panda, and R. B. Pachori, "Automated heart sound activity detection from PCG signal using time-frequency-domain deep neural network," IEEE Trans. Instrum. Meas. 71, 1-10 (2022).

10.1109/TIM.2022.3192257

Y. Cai, S. Li, and X. Shao, "Leveraging self-supervised audio representations for data-efficient acoustic scene classification," Proc. DCASE, 21-25 (2024).

D. Nadrchal, A. Rostamza, and P. Schilcher, "Data- efficient acoustic scene classification with pre- training, bayesian ensemble averaging, and extensive augmentations," Proc. DCASE, 91-95 (2024).

D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley, "Detection and classification of acoustic scenes and events," IEEE Trans. Multimedia. 17, 1733-1746 (2015).

10.1109/TMM.2015.2428998

H. Eghbal-Zadeh, B. Lehner, M. Dorfer, and G. Widmer, "CP-JKU submissions for DCASE-2016: A hybrid approach using binaural i-vectors and deep convolutional neural networks," DCASE, Tech. Rep., 2016.

K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint, arxiv:1409.1556 (2014).

Y. Sakashita and M. Aono, "Acoustic scene classification by ensemble of spectrograms based on adaptive temporal divisions," DCASE, Tech. Rep., 2018.

C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, "Inception-v4, inception-ResNet and the impact of residual connections on learning," Proc. AAAI, 4278- 4284 (2017).

10.1609/aaai.v31i1.11231

H. Chen, Z. Liu, Z. Liu, P. Zhang, and Y. Yan, "Integrating the data augmentation scheme with various classifiers for acoustic scene modeling," arXiv preprint, arxiv:1907.06639 (2019).

K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," Proc. CVPR, 770- 889 (2016).

10.1109/CVPR.2016.9026180094

S. Suh, S. Park, Y. Jeong, and T. Lee, "Designing acoustic scene classification models with CNN variants," DCASE, Tech. Rep., 2020.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," Proc. NIPS, 5998-6008 (2017).

T. Heittola, A. Mesaros, and T. Virtanen, "Acoustic scene classification in DCASE 2020 challenge: Generalization across devices and low complexity solutions," Proc. DCASE, 56-60 (2020).

A. Mesaros, T. Heittola, and T. Virtanen, "A multi- device dataset for urban acoustic scene classification," arXiv preprint, arxiv:1807.09840 (2018).

G. Dekkers, S. Lauwereins, B. Thoen, M. W. Adhana, H. Brouckxon, B. van den Bergh, T. van Waterschoot, B. Vanrumste, M. Verhelst, and P. Karsmakers, "The SINS database for detection of daily activities in a home environment using an acoustic sensor network," Proc. DCASE, 1-5 (2017).

J. Salamon, C. Jacoby, and J. P. Bello, "A dataset and taxonomy for urban sound research," Proc. ACM, 1041-1044 (2014).

10.1145/2647868.2655045

J.-W. Jung, H.-S. Heo, H.-J. Shim, and H.-J. Yu, "Knowledge distillation in acoustic scene classification," IEEE Access. 8, 166870-166879 (2020).

10.1109/ACCESS.2020.3021711

S. Takeyama, T. Komatsu, K. Miyazaki, M. Togami, and S. Ono, "Robust acoustic scene classification to multiple devices using maximum classifier discrepancy and knowledge distillation," Proc. EUSIPCO, 36-40 (2021).

10.23919/Eusipco47968.2020.9287734

A. M. Tripathi and O. J. Pandey, "Divide and distill: New outlooks on knowledge distillation for environmental sound classification," IEEE Trans. Audio Speech Lang. Process. 31, 1100-1113 (2023).

10.1109/TASLP.2023.3244507

H. Dinkel, Y. Wang, Z. Yan, J. Zhang, and Y. Wang, "CED: Consistent ensemble distillation for audio tagging," Proc. IEEE ICASSP, 291-295 (2024).

10.1109/ICASSP48485.2024.10446348

B. Han, W. Huang, Z. Chen, A. Jiang, P. Fan, C. Lu, Z. Lv, J. Liu, W.-Q. Zhang, and Y. Qian, "Data- efficient low-complexity acoustic scene classification via distilling and progressive pruning," arXiv preprint, arxiv:2410.20775 (2024).

Y. Cai, S. Li, and X. Shao, "Leveraging self-supervised audio representations for data-efficient acoustic scene classification," Proc. DCASE, 21-25 (2024).

W. Chen, Y. Liang, Z. Ma, Z. Zheng, and X. Chen, "EAT: Self-supervised pre-training with efficient audio transformer," arXiv preprint, arXiv:2401.03497 (2024).

10.24963/ijcai.2024/421PMC11597076

S. Abdulatif, R. Cao, and B. Yang, "CMGAN: Conformer-based metric-gan for monaural speech enhancement," IEEE/ACM Trans. Audio Speech Lang. Process. 32, 2477-2493 (2024).

10.1109/TASLP.2024.3393718

Y.-X. Lu, Y. Ai, and Z.-H. Ling, "MP-SENet: A speech enhancement model with parallel denoising of magnitude and phase spectra," Proc. Interspeech, 3834-3838 (2023).

A. Pandey and D. Wang, "Densely connected neural network with dilated convolutions for real-time speech enhancement in the time domain," Proc. IEEE ICASSP, 6629-6633 (2020).

10.1109/ICASSP40776.2020.9054536

G. Dekkers, L. Vuegen, T. van Waterschoot, B. Vanrumste, and P. Karsmakers, "DCASE 2018 challenge - Task 5: Monitoring of domestic activities based on multi-channel acoustics," arXiv preprint, arxiv:1807.11246 (2018).

I. Loshchilov and F. Hutter, "Decoupled weight decay regularization," arXiv preprint, arxiv:1711.05101 (2019).

T. Iqbal, Y. Cao, A. Bailey, M. D. Plumbley, and W. Wang, "ARCA23K: An audio dataset for investigating open-set label noise," Proc, DCASE, 201-205 (2021).

E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, "FSD50K: An open dataset of human-labeled sound events," IEEE/ACM Trans. Audio Speech Lang. Process. 30, 829-852 (2022).

10.1109/TASLP.2021.3133208

Information

Publisher :The Acoustical Society of Korea
Publisher(Ko) :한국음향학회
Journal Title :The Journal of the Acoustical Society of Korea
Journal Title(Ko) :한국음향학회지
Volume : 44
No :3
Pages :270-280
Received Date : 2025-03-24
Revised Date : 2025-04-22
Accepted Date : 2025-04-23
DOI :https://doi.org/10.7776/ASK.2025.44.3.270

The Journal of the Acoustical Society of KoreaISSN:1225-4428(Print) 2287-3775(Online)한국음향학회

All Issue