Speech extraction based on AuxIVA with weighted source variance and noise dependence for robust speech recognition

Ui-Hyeop Shin; Hyung-Min Park

doi:10.7776/ASK.2022.41.3.326

All Issue

2022 Vol.41, Issue 3 Preview Page Next Page

Research Article

Speech extraction based on AuxIVA with weighted source variance and noise dependence for robust speech recognition 강인 음성 인식을 위한 가중화된 음원 분산 및 잡음 의존성을 활용한 보조함수 독립 벡터 분석 기반 음성 추출

31 May 2022. pp. 326-334

PDF XML

Abstract

In this paper, we propose speech enhancement algorithm as a pre-processing for robust speech recognition in noisy environments. Auxiliary-function-based Independent Vector Analysis (AuxIVA) is performed with weighted covariance matrix using time-varying variances with scaling factor from target masks representing time-frequency contributions of target speech. The mask estimates can be obtained using Neural Network (NN) pre-trained for speech extraction or diffuseness using Coherence-to-Diffuse power Ratio (CDR) to find the direct sounds component of a target speech. In addition, outputs for omni-directional noise are closely chained by sharing the time-varying variances similarly to independent subspace analysis or IVA. The speech extraction method based on AuxIVA is also performed in Independent Low-Rank Matrix Analysis (ILRMA) framework by extending the Non-negative Matrix Factorization (NMF) for noise outputs to Non-negative Tensor Factorization (NTF) to maintain the inter-channel dependency in noise output channels. Experimental results on the CHiME-4 datasets demonstrate the effectiveness of the presented algorithms.

Keywords

Independent vector analysis

Variance

Adaptive filtering

Robust speech recognition

Speech extraction

이 논문에서는 배경 잡음이 포함되는 환경에서 강인한 음성 인식을 하기 위한 전처리 단계로서 쓰이는 목표 음성 향상 방법을 제안한다. 보조 함수 기반의 독립 벡터 분석(Auxiliary-function-based Independent Vector Analysis, AuxIVA) 기법을 기반으로 가중 공분산 행렬에서 시간에 따라 변하는 분산에 의해서 가중치가 결정된다. 목표 음성에 대한 시간-주파수별 기여도를 나타내는 마스크를 통해 분산의 크기를 조절한다. 이러한 마스크는 음성 향상을 위해서 학습된 신경망 혹은 목표 화자로부터의 직선 성분의 기여도를 찾기 위한 확산성으로부터 추정할 수 있다. 이에 더하여 둘러싼 잡음에 대한 출력들은 서로 다차원 독립 성분 분석을 도입하여 의존성을 주어 안정적으로 노이즈 성분을 추출할 수 있다. 이 AuxIVA 기반의 목표 음성 추출 알고리즘은 또한 노이즈에 대해서 비음수 행렬 분해(Non- negative Matrix Factorization, NMF)를 비음수 텐서 분해(Non-negative Tensor Factorization, NTF)로 확장하여 독립 단순 행렬 분석(Independent Low-Rank Matrix Analysis, ILRMA)의 틀에서도 수행될 수 있다. 이러한 확장을 통해서 여전히 잡음 출력 채널에서의 채널간 의존성을 유지할 수 있다. CHiME-4데이터셋에 대한 실험 결과는 소개된 알고리즘에 대한 효과를 보여준다.

키워드

독립 벡터 분석

분산

적응 필터

강인 음성 인식

음성 추출

References

T. Virtanen, R. Singh, and B. Raj, Techniques for Noise Robustness in Automatic Speech Recognition (John Wiley & Sons, New York, 2012), pp. 109-154. 10.1002/9781118392683

M. Ẅolfel and J. McDonoug, Distant Speech Recognition (John Wiley & Sons, New York, 2009), pp. 387-491.

J. Droppo and A. Acero. Environmental Robustness (Springer, Heidelberg, 2008), pp. 653-680. 10.1007/978-3-540-49127-9_33

M. Kim and H.-M. Park, "Efficient online target speech extraction using DOA-constrained independent component analysis of stereo data for robust speech recognition," Signal Processing, 117, 126-137 (2015). 10.1016/j.sigpro.2015.04.022

L. Albera, "Independent component analysis and applications," Handbook of Blind Source Separation: Independent Component Analysis and Applications, edited by P. Comon and C. Jutten (Academic press, Kidlington, 2010).

S. Haykin, Unsupervised Adaptive Filtering, volume 1: Blind Source Separation (John Wiley & Sons, New York, 2000), pp. 238-258.

A. Hyv̈arinen, J. Karhunen, and E. Oja, Independent Component Analysis and Blind Source Separation (John Wiley & Son, New York, 2001), pp. 4-42. 10.1002/0471221317

Y. Takahashi, T. Takatani, K. Osako, H. Saruwatari, and K. Shikano, "Blind spatial subtraction array for speech enhancement in noisy environment," IEEE Transactions on Audio, Speech, and Language Processing, 17, 650-664 (2009). 10.1109/TASL.2008.2011517

F. Nesta and M. Matassoni, "Blind source extraction for robust speech recognition in multisource noisy environments," Computer Speech and Language, 27, 703-725 (2013). 10.1016/j.csl.2012.08.001

M. El Rhabi, H. Fenniri, A. Keziou, and E. Moreau, "A robust algorithm for convolutive blind source separation in presence of noise," Signal Processing, 93, 818-827 (2013). 10.1016/j.sigpro.2012.09.026

T. Kim, H. T. Attias, S.-Y. Lee, and T.-W. Lee, "Blind source separation exploiting higher-order frequency dependencies," IEEE Transactions on Audio, Speech, and Language Processing, 15, 70-79 (2007). 10.1109/TASL.2006.872618

T. Kim, "Real-time independent vector analysis for convolutive blind source separation," IEEE Transactions on Circuits and Systems I: Regular Papers, 57, 1431-1438 (2010). 10.1109/TCSI.2010.2048777

M. Oh and H.-M. Park, "Blind source separation based on independent vector analysis using feed-forward network," Neurocomputing, 74, 3713-3715 (2011). 10.1016/j.neucom.2011.06.008

I. Lee, G.-J. Jang, and T.-W. Lee, "Independent vector analysis using densities represented by chain-like overlapped cliques in graphical models for separation of convolutedly mixed signals," Electronics Letters, 45, 710-711 (2009). 10.1049/el.2009.0945

C.-H. Choi, W. Chang, and S.-Y. Lee, "Blind source separation of speech and music signals using harmonic frequency dependent independent vector analysis," Electronics Letters, 48, 124-125 (2012). 10.1049/el.2011.3215

N. Ono, "Stable and fast update rules for independent vector analysis based on auxiliary function technique," Proc. IEEE WASPAA, 189-192 (2011). 10.1109/ASPAA.2011.6082320

N. Ono, "Auxiliary-function-based independent vector analysis with power of vector-norm type weighting functions," Proc. APSIPA, 1-4 (2012).

D. D. Lee and H. S. Seung, "Learning the parts of objects by non-negative matrix factorization," Nature, 401, 788 (1999). 10.1038/4456510548103

D. D. Lee and H. S. Seung, "Algorithms for non- negative matrix factorization," Advances in Neural Information Processing Systems, 13, 556-562 (2001).

D. Kitamura, N. Ono, H. Sawada, H. Kameoka, and H. Saruwatari, "Efficient multichannel nonnegative matrix factorization exploiting rank-1 spatial model," Proc. IEEE ICASSP, 276-280 (2015). 10.1109/ICASSP.2015.7177975

D. Kitamura, N. Ono, H. Sawada, H. Kameoka, and H. Saruwatari, "Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization," IEEE/ACM TASLP, 24, 1622- 1637 (2016). 10.1109/TASLP.2016.2577880

U.-H. Shin and H.-M. Park, "Auxiliary-function-based independent vector analysis using generalized inter- clique dependence source models with clique variance estimation," IEEE Access, 8, 68103-68113 (2020). 10.1109/ACCESS.2020.2985842

A. R. Ĺopez, N. Ono, U. Remes, K. Palom̈aki, and M. Kurimo, "Designing multichannel source separation based on single-channel source separation," Proc. IEEE ICASSP, 469-473 (2015).

Z. Koldovsḱy, P. Tichavsḱy, and V. Kautsk, "Orthogonally constrained independent component extraction: Blind MPDR beamforming," Proc. EUSIPCO, 1155- 1159 (2017). 10.23919/EUSIPCO.2017.8081389

T. Kounovsḱy, Z. Koldovsky, and J. Cmejla, "Recursive and partially supervised algorithms for speech enhancement on the basis of independent vector extraction," Proc. IWAENC, 401-405 (2018). 10.1109/IWAENC.2018.8521399

J.-F. Cardoso, "Multidimensional independent component analysis," Proc. IEEE ICASSP, 4, 1941-1944 (1998).

D. FitzGerald, M. Cranitch, and E. Coyle, "Non- negative tensor factorisation for sound source separation," Proc. Irish Signals and Systems Conf. 8-12 (2005). 10.1049/cp:20050279

J. Heymann, L. Drude, and R. Haeb-Umbach, "Neural network based spectral mask estimation for acoustic beamforming," Proc. IEEE ICASSP, 196-200 (2016). 10.1109/ICASSP.2016.7471664

A. Schwarz and W. Kellermann, "Coherent-to-diffuse power ratio estimation for dereverberation," IEEE/ ACM Transactions on Audio, Speech, and Language Processing, 23, 1006-1018 (2015). 10.1109/TASLP.2015.2418571

R. Lee, M.-S. Kang, B.-H. Kim, K.-H. Park, S. Q. Lee, and H.-M. Park, "Sound source localization based on gcc-phat with diffuseness mask in noisy and reverberant environments," IEEE Access, 8, 7373-7382 (2020). 10.1109/ACCESS.2019.2963768

J. Caroselli, I. Shafran, A. Narayanan, and R. Rose, "Adaptive multichannel dereverberation for automatic speech recognition," Proc. Interspeech, 3877-3881 (2017). 10.21437/Interspeech.2017-1791

B. J. Cho, J.-M. Lee, and H.-M. Park, "A beamforming algorithm based on maximum likelihood of a complex gaussian distribution with time-varying variances for robust speech recognition," IEEE Signal Processing Letters, 26, 1398-1402 (2019). 10.1109/LSP.2019.2932848

E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker, and R. Marxer, "An analysis of environment, microphone and data simulation mismatches in robust speech recognition," Computer Speech & Language, 46, 535-557 (2017). 10.1016/j.csl.2016.11.005

J. Barker, R. Marxer, E. Vincent, and S. Watanabe, "The third "CHiME" speech separation and recognition challenge: Dataset, task and baselines," Proc. IEEE Workshop on ASRU, 504-511 (2015). 10.1109/ASRU.2015.740483726035872

T. Higuchi, N. Ito, T. Yoshioka, and T. Nakatani, "Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise," Proc. IEEE ICASSP, 5210-5214 (2016). 10.1109/ICASSP.2016.7472671

O. L. Frost, "An algorithm for linearly constrained adaptive array processing," Proceedings of the IEEE, 60, 926-935 (1972). 10.1109/PROC.1972.8817

Information

Publisher :The Acoustical Society of Korea
Publisher(Ko) :한국음향학회
Journal Title :The Journal of the Acoustical Society of Korea
Journal Title(Ko) :한국음향학회지
Volume : 41
No :3
Pages :326-334
Received Date : 2022-03-21
Revised Date : 2022-05-03
Accepted Date : 2022-05-11
DOI :https://doi.org/10.7776/ASK.2022.41.3.326

The Journal of the Acoustical Society of KoreaISSN:1225-4428(Print) 2287-3775(Online)한국음향학회

All Issue