Combining deep learning-based online beamforming with spectral subtraction for speech recognition in noisy environments

Sung-Wook Yoon; Oh-Wook Kwon

doi:10.7776/ASK.2021.40.5.439

All Issue

2021 Vol.40, Issue 5 Preview Page Next Page

Research Article

Combining deep learning-based online beamforming with spectral subtraction for speech recognition in noisy environments 잡음 환경에서의 음성인식을 위한 온라인 빔포밍과 스펙트럼 감산의 결합

30 September 2021. pp. 439-451

PDF XML

Abstract

We propose a deep learning-based beamformer combined with spectral subtraction for continuous speech recognition operating in noisy environments. Conventional beamforming systems were mostly evaluated by using pre-segmented audio signals which were typically generated by mixing speech and noise continuously on a computer. However, since speech utterances are sparsely uttered along the time axis in real environments, conventional beamforming systems degrade in case when noise-only signals without speech are input. To alleviate this drawback, we combine online beamforming algorithm and spectral subtraction. We construct a Continuous Speech Enhancement (CSE) evaluation set to evaluate the online beamforming algorithm in noisy environments. The evaluation set is built by mixing sparsely-occurring speech utterances of the CHiME3 evaluation set and continuously-played CHiME3 background noise and background music of MUSDB . Using a Kaldi-based toolkit and Google web speech recognizer as a speech recognition back-end, we confirm that the proposed online beamforming algorithm with spectral subtraction shows better performance than the baseline online algorithm.

Keywords

Online beamforming

Deep learning

Spectral subtraction

Continuous speech enhancement

본 논문에서는 실제 환경에서의 연속 음성 강화를 위한 딥러닝 기반 온라인 빔포밍 알고리듬과 스펙트럼 감산을 결합한 빔포머를 제안한다. 기존 빔포밍 시스템은 컴퓨터에서 음성과 잡음을 완전히 겹친 방식으로 혼합하여 생성된 사전 분할 오디오 신호를 사용하여 대부분 평가되었다. 하지만 실제 환경에서는 시간 축으로 음성 발화가 띄엄띄엄 발성되기 때문에, 음성이 없는 잡음 신호가 시스템에 입력되면 기존 빔포밍 알고리듬의 성능이 저하된다. 이러한 효과를 경감하기 위하여, 심층 학습 기반 온라인 빔포밍 알고리듬과 스펙트럼 감산을 결합하였다. 잡음 환경에서 온라인 빔포밍 알고리듬을 평가하기 위해 연속 음성 강화 세트를 구성하였다. 평가 세트는 CHiME3 평가 세트에서 추출한 음성 발화와 CHiME3 배경 잡음 및 MUSDB에서 추출한 연속 재생되는 배경음악을 혼합하여 구성되었다. 음성인식기로는 Kaldi 기반 툴킷 및 구글 웹 음성인식기를 사용하였다. 제안한 온라인 빔포밍 알고리듬 과 스펙트럼 감산이 베이스라인 빔포밍 알고리듬에 비해 성능 향상을 보임을 확인하였다.

키워드

온라인 빔포밍

딥 러닝

스펙트럼 감산

연속 음성 강화

References

S. Zhao, X. Xiao, Z. Zhang, T. N. T. Nguyen, X. Zhong, B. Ren, L. Wang, L. J. Douglas, E. Chng, and H. Li, "Robust speech recognition using beamforming with adaptive microphone gains and multichannel noise reduction," Proc. IEEE Workshop on ASRU. 460-467 (2015). 10.1109/ASRU.2015.7404831

Y. Tachioka, T. Narita, I. Miura, T. Uramoto, N. Monta, S. Uenohara, K. Furuya, S. Watanabe, and J. L. Roux, "Coupled Initialization of multi-channel non- negative matrix factorization based on spatial and spectral information," Proc. 2017 INTERSPEECH, 2461-2465 (2017). 10.21437/Interspeech.2017-61

D. Kitamura, N. Ono, H. Sawada, H. Kameoka, and H. Saruwatari, "Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization," IEEE Trans. on Audio, Speech, and Lang. Process. 24, 1626-1641 (2016). 10.1109/TASLP.2016.2577880

T. V. d. Bogaert, S. Doclo, J. Wouters, and M. Moonen, "Speech enhancement with multichannel Wiener filter techniques in multimicrophone binaural hearing aids," J. Acoust. Soc. Am. 125, 360-371 (2009). 10.1121/1.302306919173423

E. A. Habets, J. Benesty, S. Gannot, and I. Cohen, "The MVDR beamformer for speech enhancement," Proc. Speech Processing in Modern Communication, 225-254 (2010). 10.1007/978-3-642-11130-3_9

E. Warsitz and R. Haeb-Umbach, "Blind acoustic beamforming based on generalized eigenvalue decomposition," IEEE Trans. on audio, speech, and lang. process. 15, 1529-1539 (2007). 10.1109/TASL.2007.898454

S. Gannot and I. Cohen, "Speech enhancement based on the general transfer function GSC and postfiltering," IEEE Trans. on Speech and Audio Process. 12, 561- 571(2004). 10.1109/TSA.2004.834599

J. Heymann, L. Drude, A. Chinaev, and R. Haeb- Umbach, "BLSTM supported GEV beamformer front- end for the 3rd CHiME challenge," Proc. IEEE Workshop on ASRU. 444-451 (2015). 10.1109/ASRU.2015.7404829

C. Deng, H. Song, Y. Zhang, Y. Sha, and X. Li, "DNN-based mask estimation integrating spectral and spatial features for robust beamforming," Proc. ICASSP. 4647-4651 (2020). 10.1109/ICASSP40776.2020.9054239

Y. Liu, A. Ganguly, K. Kamath, and T. Kristjansson, "Neural network based time-frequency masking and steering vector estimation for two-channel MVDR beamforming," Proc. ICASSP. 6717-6721 (2018). 10.1109/ICASSP.2018.8462069

N. Shankar, G. S. Bhat, and I. M. Panahi, "Real-time dual-channel speech enhancement by VAD assisted MVDR beamformer for hearing aid applications using smartphone," Proc. 42nd Annual Int. Conf. of the IEEE EMBC. 952-955 (2020). 10.1109/EMBC44109.2020.917521233018142PMC7545265

Y. Zhou, Y. Chen, Y. Ma, and H. Liu, "A real-time dual-microphone speech enhancement algorithm assisted by bone conduction sensor," Sensors, 20, 5050 (2020). 10.3390/s2018505032899533PMC7571026

T. Higuchi, N. Ito, S. Araki, T. Yoshioka, M. Delcroix, and T. Nakatani, "Online MVDR beamformer based on complex Gaussian mixture model with spatial prior for noise robust ASR," IEEE Trans. on audio, speech, and lang. process. 25, 780-793 (2017). 10.1109/TASLP.2017.2665341

J. Barker, R. Marxer, E. Vincent, and S. Watanabe, "The third 'CHiME'speech separation and recognition challenge: Dataset, task and baselines," Proc. 2015 IEEE Workshop on ASRU. 504-511 (2015). 10.1109/ASRU.2015.740483726035872

Z. Rafii, A. Liutkus, F. R. Stöter, S. I. Mimilakis, and R. Bittner, MUSDB18 - a corpus for music separation (2017).

J. Heymann, L. Drude, and R. Haeb-Umbach, "Neural network based spectral mask estimation for acoustic beamforming," Proc. IEEE ICASSP. 196-200 (2016). 10.1109/ICASSP.2016.7471664

J. S. Lim and A. V. Oppenheim, "Enhancement and bandwidth compression of noisy speech," Proc. IEEE. 1586-1604 (1979). 10.1109/PROC.1979.11540

D. Gala, A. Vasoya, and V. M. Misra, "Speech enhancement combining spectral subtraction and beamforming techniques for microphone array," Proc. the Int. Conf. and Workshop on Emerging Trends in Technology, 163-166 (2010). 10.1145/1741906.1741938

Y. Takahashi, Y. Uemura, H. Saruwatari, K. Shikano, and K. Kondo, "Structure selection algorithm for less musical-noise generation in integration systems of beamforming and spectral subtraction," Proc. 2009 IEEE/SP 15th Workshop on Statistical Signal Processing, 701-704 (2009). 10.1109/SSP.2009.527848019336245

S. Karimian-Azari and T. H. Falk, "Modulation spectrum based beamforming for speech enhancement," Proc. 2017 IEEE WASPAA. 91-95 (2017). 10.1109/WASPAA.2017.8170001

H. Saruwatari, S. Kurita, K. Takeda, F. Itakura, T. Nishikawa, and K. Shikano, "Blind source separation combining independent component analysis and beamforming," EURASIP J. Advances in Signal Processing, 2003, 569270 (2003). 10.1155/S1110865703305104

Google WebRTC,https://webrtc.org/, (Last viewed September 1, 2021).

Google Web Speech API,https://wicg.github.io/speech- api/, (Last viewed September 1, 2021).

Information

Publisher :The Acoustical Society of Korea
Publisher(Ko) :한국음향학회
Journal Title :The Journal of the Acoustical Society of Korea
Journal Title(Ko) :한국음향학회지
Volume : 40
No :5
Pages :439-451
Received Date : 2021-06-10
Accepted Date : 2021-08-09
DOI :https://doi.org/10.7776/ASK.2021.40.5.439

The Journal of the Acoustical Society of KoreaISSN:1225-4428(Print) 2287-3775(Online)한국음향학회

All Issue