Emergency situation detection and speech recognition enhancement utilizing Whisper

Taeyeun Hwang; Ha-Jin Yu; Jeong-Rae Kim

doi:10.7776/ASK.2025.44.2.132

All Issue

2025 Vol.44, Issue 2 Preview Page Next Page

Research Article

Emergency situation detection and speech recognition enhancement utilizing Whisper Whisper를 활용한 위급 상황 탐지 및 음성 인식 성능 향상

31 March 2025. pp. 132-143

PDF XML

Abstract

This study proposes a model designed to promptly detect and report emergency situations that may occur in single-person or elderly households. To achieve this, we modified the model architecture suggested in the Whisper Audio Tagging (Whisper-AT) paper, based on the Whisper model, to enable both classification of emergency situations and prediction of their occurrence times. Additionally, Whisper and the classification model were fine-tuned jointly to perform Automatic Speech Recognition (ASR) training on emergency situation data. As a result, the proposed method achieved an accuracy of 97.70 % in the classification of 16 types of emergency situations. Furthermore, compared to the approach of solely fine-tuning Whisper, integrating emergency situation classification during training improved ASR performance, reducing the Character Error Rate (CER) from 12.03 to 10.11. The proposed model is capable of detecting emergency situations with a low latency of only 4.2 s.

Keywords

Deep learning

Emergency situation

Acoustic detection

Speech recognition

본 연구에서는 1인 가구 및 노인 가구 등에서 발생할 수 있는 가정 내 위급 상황을 신속하게 인식하고 신고할 수 있는 모델을 제안한다. 이를 위해 Whisper 모델을 기반으로, Whisper Audio Tagging(Whisper-AT) 논문에서 제안된 모델 구조를 수정하여 위급 상황의 분류와 발생 시점 예측이 가능하도록 설계하였다. 또한 Whisper와 분류 모델을 동시에 미세 조정하여 위급 상황 데이터에 대한 자동 음성 인식 학습을 동시에 수행하였다. 최종적으로, 본 연구에서 제안한 방법은 16가지 위급 상황 분류에서 97.70 %의 정확도를 달성하였다. 추가적으로, Whisper를 단순히 미세 조정한 경우와 비교했을 때, 위급 상황 분류를 통합하여 학습한 경우 자동 음성 인식의 문자 오류율이 12.03에서 10.11로 감소하는 성능 향상을 확인하였다. 제안한 모델은 4.2 s의 짧은 지연 시간 내에서 위급 상황 탐지를 수행할 수 있다.

키워드

딥러닝

위급 상황

음향 탐지

음성 인식

References

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, "Robust speech recognition via large-scale weak supervision," Proc. 40th ICML, 28492-28518 (2023).

Y. Gong, S. Khurana, L. Karlinsky, and J. Glass, "Whisper-AT: Noise-robust automatic speech recognizers are also strong general audio event taggers," Proc. Interspeech, 2798-2802 (2023).

10.21437/Interspeech.2023-2193

A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, "wav2vec 2.0: A framework for self-supervised learning of speech representations," Adv. Neural. Inf. Process. Syst. 33, 12449-12460 (2020).

W. N. Hsu, B. Bolte, Y. H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, "Hubert: selfsupervised speech representation learning by masked prediction of hidden units," IEEE/ACM Trans. Audio. Speech. Lang. Process. 29, 3451-3460 (2021).

10.1109/TASLP.2021.3122291

S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wilson, "CNN architectures for large-scale audio classification," Proc. IEEE, ICASSP, 131-135 (2017).

10.1109/ICASSP.2017.7952132

Y. Gong, Y. A. Chung, and J. Glass, "AST: Audio spectrogram transformer," Proc. Interspeech, 571-575 (2021).

10.21437/Interspeech.2021-69833559302

P. Y. Huang, H. Xu, J. Li, A. Baevski, M. Auli, W. Galuba, F. Metze, and C. Feichtenhofer, "Masked autoencoders that listen," Adv. Neural. Inf. Process. Syst. 35, 28708-28720 (2022).

J. Kim, K. Min, M. Jung, and S. Chi, "Occupant behavior monitoring and emergency event detection in single-person households using deep learning-based sound recognition," Build. Environ. 181, 107092 (2020).

10.1016/j.buildenv.2020.107092

J. Sharma, O. C. Granmo, and M. Goodwin, "Emergency detection with environment sound using deep convolutional neural networks," Proc. 5th ICICT, 144-154 (2020).

10.1007/978-981-15-5859-7_14

Y. J. Jeong, Y. A. Jung, S. H. Kim, and D. H. Kim, "Implementation of integrated platform of risk prevention and STT service for the deaf using deep learning," J. Digit. Contents Soc. Kr. 23, 1459-1467 (2022).

10.9728/dcs.2022.23.8.1459

D. Macháček, R. Dabre, and O. Bojar, "Turning whisper into real-time transcription system," Proc. IJCNLP-AACL Demos, 17-24 (2023).

10.18653/v1/2023.ijcnlp-demo.3

D. Liu, G. Spanakis, and J. Niehues, "Low-latency sequence-to-sequence speech recognition and translation by partial hypothesis selection," Proc. Interspeech, 3620-3624 (2020).

10.21437/Interspeech.2020-2897PMC7594873

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," Adv. Neural. Inf. Process. Syst. 31, 6000-6010 (2017).

Information

Publisher :The Acoustical Society of Korea
Publisher(Ko) :한국음향학회
Journal Title :The Journal of the Acoustical Society of Korea
Journal Title(Ko) :한국음향학회지
Volume : 44
No :2
Pages :132-143
Received Date : 2025-01-03
Accepted Date : 2025-02-27
DOI :https://doi.org/10.7776/ASK.2025.44.2.132

The Journal of the Acoustical Society of KoreaISSN:1225-4428(Print) 2287-3775(Online)한국음향학회

All Issue