Noise Spectrum Estimation Using Line Spectral Frequencies for Robust Speech Recognition

Gil-Jin Jang; Jeong-Sik Park; Sanghun Kim

doi:None

Preview

Noise Spectrum Estimation Using Line Spectral Frequencies for Robust Speech Recognition

Gil-Jin Jang¹

Jeong-Sik Park²^*

Sanghun Kim³

¹Ulsan National Institute of Science and Technology (UNIST)

²Mokwon University, Daejeon, South Korea

²Speech and Language Information Research Department Electronics and Telecommunications Research Institute

^{*Corresponding Author}

License:

ABSTRACT

This paper presents a novel method for estimating reliable noise spectral magnitude for acoustic background noise suppression where only a single microphone recording is available. The proposed method finds noise estimates from spectral magnitudes measured at line spectral frequencies (LSFs), under the observation that adjacent LSFs are near the peak frequencies and isolated LSFs are close to the relatively flattened valleys of LPC spectra. The parameters used in the proposed method are LPC coefficients, their corresponding LSFs, and the gain of LPC residual signals, so it suits well to LPC-based speech coders.

Keywords

Line spectral frequencies (LSF)

Noise suppression

Speech recognition

MAIN

I. Introduction
II. Noise Spectrum Estimation
III. Proposed Noise Suppression Method
IV. Experimental Results
4.1 Database
4.2 Implementation Details
4.3 Illustrations of Noise Suppression Results
4.4 Speech Recognition Performance Comparison
V. Concluding Remarks

I. Introduction

The assumption of spectral subtraction ^[1] is that the noise is additive and changes slowly over time, so that noise spectrum should be approximated by an average spectrum in non-voice period. The error in estimating true noise spectrum directly accounts for either voice attenuation or less noise suppression, hence the performance is closely related to how reliable the noise spectrum estimates are. Most con-ventional methods rely on detecting whether the instantaneous input frame contains speech, called voice activity detector (VAD), which then enables updating noise estimates when background noise is present only. However, the performance of the VAD varies a lot according to various noise conditions.

This paper proposes a novel procedure for noise spectral magnitude estimation which also eliminates the use of VAD in a very efficient manner. From the basic LSF derivation formulae it is observed that the local maxima of LPC spectra are near the adjacently located LSFs ^[2,3], and relatively flattened valleys across frequency are around the isolated LSFs. In the proposed method the spectral magnitudes at LSFs are considered as representatives of the peaks and valleys of the corresponding LPC spectra, and participate in estimating noise spectral magnitude. Without any consideration of determining if the current analysis frame contains noise only, the distribution of the log spectral magnitudes at LSFs are modeled by mixture of dual Gaussian probability density functions. The Gaussian with smaller mean is then taken as noise distribution, so the mean is adopted as a noise spectral estimate. An online adaptation algorithm for the parameters of Gaussian distributions is also proposed so that it can handle real-time inputs. The noise Gaussian mean is updated at every time frame. A time-domain Wiener filter suppressing the estimated amount of noise spectral magnitude is computed for every time frame and frequency band, and applied to the input speech signal. The required parameters are LPC coefficients, LSFs, and excitation gains, which are all available in most LP vocoders. Therefore the proposed method can be easily integrated into LP vocoders with much less additional overhead than the other conventional noise suppression methods.

To assess the validity of the proposed method, automatic speech recognition experiments are carried out on speech separation challenge database. Results show significant improvement in speech recognition rates with relatively less speech distortion when compared to ETSI frontend and TIA’s EVRC standard noise suppression.

II. Noise Spectrum Estimation

The proposed method makes use of the properties of LPC analysis. The input speech signal is decom-posed into spectral envelope and excitation signal, such that

, (1)

where is a digitized sample index, is the sampled input speech, are the prediction filter coefficients of order , is the excitation signal, and is a scalar gain so that has unit variance. Equation 1 is equivalently expressed in the frequency domain as

, (2)

where and are z-transforms of and . is spectrally flattened, so that should contain most of the spectral envelope of the given input speech frame. For a transmission purpose, is expressed by the two reciprocal polynomials ^[4]:

(3)

The roots of these two auxiliary polynomials are called line spectral frequencies (LSFs), and known to be most efficient in coding LPC coefficients due to its stability and little sensitivity to quantization error ^[5,6].

At LSFs either or is zero, so is close to its local minima since and are monotonic between any pair of neighboring LSFs. Figure 1 illustrates the behavior of at LSFs. The two dotted lines in the figure are the frequency responses of and , the black solid line is the magnitude of LP filter response expressed by , and the lightly colored line is the spectral envelope approximated by . A pre-emphasis filter, , has been applied to boost high frequency energies. Downward triangles are drawn on , and upward triangles are on at the root frequencies of and .

Fig. 1. Properties of LPC spectrum at line spectral frequencies.

As adjacent LSFs become closer, for example around 500 Hz, decreases and hence becomes more resonant around those frequencies ^[2,3]. However, when a single LSF is isolated, far from its neighbors, and change slowly so that be relatively flattened. Therefore these LPC spectra at LSFs represent either spectral magnitude of speech at their formant frequencies between closely located LSFs, or background noise spectral magnitude at isolated LSFs. These properties are implicitly exploited by the proposed noise suppression procedure.

By the definition of discrete Fourier transform, the impulse response of at frequency is expressed by

, (4)

where . The smoothed spectral magnitude at LSF at frame , denoted by , is approximated by multiplication of its LPC spectral magnitude and frame gain and expressed in log domain by

, (5)

where is gain at frame defined in Equation 2 to represent relative magnitude difference across analysis frames. To model the global frequency characteristics of the input sounds, we approximate the long-term average of by the long-term average frequency response of LPC filters, denoted by , is updated instantaneously by

, (6)

with an initial value . The adaptation rate gives a good performance in our experiments. The LP spectral envelope is then normalized by long-term average, and its log is approximated by the following equation:

(7)

By using instead of , we can disregard the global shape of the noise, and a single noise estimate can be used regardless of frequencies.

The distribution of the log spectral magnitudes at LSFs is shown in Figure 2. The x-axis is quantized histogram intervals from the log spectral magnitude, and the y-axis is the number of frames whose x-value is in each interval. The speech spectra is from male speech, and the noise spectra is from car factory noise of SPIB database. The used sources are male speech and factory noise from Signal Processing Information Base (SPIB) database that is available at http://spib.rice.edu/. Since the spectral energy of the factory noise is near stationary over time, there is a significant peak between -2 and -1 on x-axis. Speech spectral magnitude is relatively scattered and varies a lot. The mixed distribution, expressed by lightly colored bars, has a peak around the one of the factory noise, and the portion of small energy components reduced a lot. This is because the noise spectra being consistent over time conceal tiny spectral magnitudes of speech signals. In the high energy regions (greater than -1.5 in Figure 2) where speech is present only, the spectral energy distribution of the speech signal dominates the mixed distribution.

Fig. 2. Distribution of log of LPC spectrum at LSFs multiplied by excitation gain.

III. Proposed Noise Suppression Method

On the log spectral magnitudes in Equation 6 that are globally whitened, a mixture of dual Gaussian probability density functions is used to approximate mean spectral magnitude of noise. For each LSF, , a substitution is taken for a compact notation. Denoting as a set of parameters for noise Gaussian, and for speech Gaussian, the posterior probability of belonging to noise Gaussian is expressed by

(8)

where and are the prior probabilities of noise and speech presence, with a constraint that . The likelihood of given a set of parameters, , is modeled by a univariate Gaussian density function:

. (9)

The Gaussian parameters are updated online by the following adaptation rules:

(10)

where a positive constant is step size, and are Gaussian parameters of the previous frame. is the computed adaptation rate for noise at frame and LSF. The same formula for speech can be derived by substituting with . The adaptation rules for speech parameters are:

(11)

From the mean of the noise Gaussian in Equation 10, , noise spectral magnitude at frame is approximated by

. (12)

A Wiener filter suppressing the noise estimate from the spectral magnitude of the mixture signal is derived by

(13)

Then it is floored so that it should be always higher than a certain limit,

, (14)

where a nonnegative constant is a minimum Wiener filter gain.

IV. Experimental Results

4.1 Database

The proposed method is compared to the con-ventional methods by automatic speech recognition performances on speech separation challenge (SSC) database ^[7]. The database is designed for assessing the effect of a noise suppression algorithm to a simple speech recognition task. Talkers say short sentences of exactly 6 words, whose format is “com-mand color preposition letter number adverb”. For example, “bin blue at F 2 now”. All of the original sound files are sampled at 25 kHz, and they are downsampled to 8 kHz since the EVRC and ETSI standards support 8 kHz only.

The database has a training set, which consists of 17,000 utterances (500 x 34 talkers). All training sound files are recorded in a quiet environment, i.e., without any background noise. The HMM models are obtained by HTK (hidden Markov model toolkit) ^[8] as suggested by the coordinators of SSC. The adopted feature is 12 MFCC plus log energy, plus their velocities and accelerations, resulting in a 39- dimensional vector at every 10 ms. A separate testing set of 600 utterances is also provided. There is no overlap between the training and the testing data. The original recordings do not contain environmental noise. Noisy data files are generated by adding speech- shaped noise (ssn), Gaussian random noise with their frequency responses modulated by the average of general speech signals. The simulated SNRs (signal- to-noise ratios) are clean (∞), 6, 0, -6, and -12 dBs, resulting in 3,000 (600 x 5) test sound files.

4.2 Implementation Details

The analysis settings of the proposed method are: sampling frequency 8 kHz, shift size 10 ms (80 samples), analysis frame length 20 ms (160 samples), and hamming windowing in LPC analysis of order 10, which results in 10 LSFs at every frame. A pre- emphasis filter defined by is used before the analysis to boost high-frequency energies. In re-synthesis, time-domain filters of order 48 are derived from the Wiener filters in Equation 14, applied to the input frames, and the resulting frames are overlap-added by trapezoidal windows of 24 samples overlaps between the neighboring frames. Among commercial standards, the noise suppression frontends in EVRC ^[9] and ETSI ^[11,12] standards are compared with the proposed method. They support 8 kHz sampling rate only, and mel-warped filter bank energies are used in voice activity detection and deriving noise estimates. The source codes of the two methods are publicly available by the distributors.

4.3 Illustrations of Noise Suppression Results

Figure 3 illustrates the noise spectrum estimation procedures for a mixture of male speech and factory noise. The x-axis is the log of spectral magnitude at LSF and frame , and the y-axis is the normalized histogram and Wiener filter gains at the same time. The distribution of is displayed by histogram bars, and the estimated Gaussian density functions are overdrawn on them by solid curves. The Gaussian mixture model generates a noise mean and a speech mean of log spectral magnitudes. The left one is noise Gaussian, , and the right is speech, , whose mean values are indicated by wedged vertical lines. The Wiener filter obtained by Equation 13 is plotted by a thick, dashed line. Wiener filter gain (before flooring by Equation 14) is zero when is smaller than the mean of noise Gaussian . The SNR of the input mixture is approximated by the distance between the two Gaussian means, .

Fig. 3.Noise spectral mean and Wiener filter estima-tion result by mixture of Gaussian density functions, for an additive mixture of male speech and factory noise.

The distribution has a sharp peak around -1 which is well approximated by . A smaller peak is located around 0, approximated by . The spectral magnitude of speech signal varies much more than the noise, so its peak location is not as distinct as that of noise. The distribution in Figure 4, male speech only, does not have a sharp peak, and the noise Gaussian mean is around -3 with much bigger variance. The estimated SNR is = 2.67 = 23.2 dB, which implies that the input signal is almost clean. The noise mean estimate is shifted to the left about 14 dB when compared to Figure 3, while speech Gaussian means of both Figures are very close.

Fig. 4.Noise spectral mean and Wiener filter estima-tion result by mixture of Gaussian density functions for male speech only.

4.4 Speech Recognition Performance Comparison

HMM models trained by clean training dataset is used for the experiments. The ssn mixtures of various SNRs are processed by EVRC standard, ETSI standard, and the proposed method. Speech recognition rates are compared in Table 1, where “bypass”' columns are the results without any processing. The proposed method significantly outperforms all the others in 6 dB and 0 dB, and slightly worse than ETSI in -6 dB and -12 dB conditions. One explanation is that the proposed method limits the minimum Wiener filter gain to -13 dB to obtain reasonable intelligibility loss. ETSI has been developed for high speech recognition performances in adverse environments ^[11], so it is expected to perform well in harsh noisy conditions. Although the proposed method does not have such features, the performance is not degraded in clean conditions.

Table 1. Comparison of speech recognition perfor-mances on the testing set with additive speech- shaped Gaussian noise (ssn).
Methods	clean	6 dB	0 dB	-6 dB	-12 dB
Bypass	97.6 %	60.7 %	27.1 %	12.8 %	11.2 %
EVRC	96.7 %	68.3 %	32.1 %	14.4 %	12.2 %
ETSI	96.9 %	76.4 %	43.4 %	21.6 %	14.1 %
Proposed	97.7 %	81.1 %	49.9 %	21.3 %	11.8 %

Since ssn is an artificial noise, a number of real noise cases are evaluated as well. From AURORA2 database ^[12], 8 different noise sources (airport, babble, car, exhibition, restaurant, street, subway, and train) are chosen, and added to clean test files. The measured speech recognition rates are in Table 2. Noise mixing levels are 12 dB, 6 dB, and 0 dB, and used as column indexes. Clean condition results are the same as Table 1, and negative SNRs are not considered since the speech recognition rates are too low to be meaningful. Row indexes are various noise suppression methods: bypass (no processing), EVRC standard, ETSI standard, and the proposed method. The last row summarizes speech recognition rates averaged over 8 noises. The top 2 highest values are boldfaced in each mixing SNR.

In terms of average recognition rates, the proposed method is always of higher recognition rates than the other methods with 6 dB and 0 dB SNR mixtures, by up to 7.7 %. In 12 dB SNR, the improvement over ETSI is about 3 %~4 %, and EVRC is the best but the difference to the proposed is only 0.1 %. The pro-posed method is always within the top 2 in all 3 SNR conditions. EVRC works well with relatively higher SNRs (12 dB and 6 dB), and ETSI is better suited to lower SNR cases (0 dB). However, the proposed method guarantees decent speech recognition perfor-mance in all noise levels. In terms of noise types, the proposed method significantly improves the perfor-mance with airport, car, street, and train noises, about the same performance as ETSI’s and EVRC’s with babble and restaurant noises, and EVRC is slightly better with exhibition and subway noises, but the difference is not that large. In summary, the speech recognition results prove that the proposed method is quite stable, and much better than the conventional methods with various noise types and various noise levels.

Table 2. Comparison of speech recognition perfor-mances on AURORA2 database.
Noise	Methods	12 dB	6 dB	0 dB
airport	bypass	85.1 %	65.2 %	37.6 %
	EVRC	89.4 %	69.0 %	35.8 %
	ETSI	86.9 %	68.4 %	39.3 %
	Proposed	90.3 %	74.7 %	48.1 %
babble	bypass	81.4 %	56.6 %	28.6 %
	EVRC	88.4 %	65.1 %	34.3 %
	ETSI	85.0 %	63.2 %	34.6 %
	Proposed	87.6 %	68.5 %	39.2 %
car	bypass	79.4 %	54.4 %	22.3 %
	EVRC	88.3 %	60.8 %	24.6 %
	ETSI	85.0 %	61.1 %	30.8 %
	Proposed	89.1 %	72.8 %	42.4 %
exhibition	bypass	73.7 %	41.9 %	18.4 %
	EVRC	83.4 %	64.8 %	31.0 %
	ETSI	82.1 %	59.6 %	30.2 %
	Proposed	83.7 %	60.5 %	30.7 %
restaurant	bypass	83.3 %	58.3 %	29.1 %
	EVRC	87.9 %	65.5 %	33.6 %
	ETSI	85.7 %	63.9 %	35.4 %
	Proposed	86.0 %	67.9 %	37.3 %
street	bypass	77.1 %	52.8 %	26.4 %
	EVRC	85.3 %	61.2 %	29.9 %
	ETSI	80.8 %	58.9 %	30.5 %
	Proposed	85.1 %	65.0 %	37.7 %
subway	bypass	76.8 %	43.9 %	19.1 %
	EVRC	85.6 %	65.9 %	32.3 %
	ETSI	81.3 %	60.1 %	30.6 %
	Proposed	85.1 %	63.0 %	31.1 %
train	bypass	86.9 %	67.1 %	38.5 %
	EVRC	89.7 %	67.9 %	35.6 %
	ETSI	87.9 %	71.7 %	41.5 %
	Proposed	90.7 %	77.4 %	51.6 %
Average	bypass	80.5 %	55.0 %	27.5 %
	EVRC	87.3 %	65.0 %	32.1 %
	ETSI	84.3 %	63.4 %	34.1 %
	Proposed	87.2 %	68.7 %	39.8 %

V. Concluding Remarks

A novel method to reduce near-stationary acoustic noise added to a speech signal recorded by a single microphone is proposed. The noise spectral magnitude is estimated by the smaller mean of dual Gaussian mixture distributions for globally flattened LPC spectra at line spectral frequencies, and a Wiener filter sup-pressing the estimated noise is derived and applied to the input speech signal. The proposed method has several advantages.

Improved noise suppression performance: It sup-presses additive noise with much less speech recog-nition performance degradations when compared to the conventional methods, proved by automatic speech recognition experiments on simulated mixtures of clean speech signals and diverse kinds of real noises in various SNRs. The characteristics of LPC spectral envelopes at line spectral frequencies are actively exploited to estimate noise spectral magnitude, and the need for voice activity detection has been elim-inated.

Computational efficiency: ETSI and EVRC standards use fixed filter bank energies, and there are a lot of control parameters for reliable voice activity detection. The proposed algorithm does not use fixed filter banks, but variable filter banks are chosen according to line spectral frequencies. In 8 kHz sampling fre-quency, the number of filter banks are 23 for EVRC and 16 for ETSI, but the number of line spectral frequencies is only 10, which makes the proposed method much more computationally efficient.

Good match to voice coders: The algorithm is quite simple and requires only LPC coefficients, line spectral frequencies, and excitation gains, which are all available in LPC-based voice coders such as QCELP. Being embedded with voice coders, it can be implemented much more efficiently.

Acknowledgements

This research was supported by the Ministry of Knowledge Economy, Korea (2008-S-019-02, Development of Portable Korean-English Automatic Speech Translation Technology), and by the 2010 Research Fund of the UNIST (Ulsan National Institute of Science and Technology).

References

S. F. Boll, "Suppression of acoustic noise in speech using spectral subtraction," IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 27, no. 2, pp. 113-120, 1979.

10.1109/TASSP.1979.1163209

Mi Suk Lee, Hong Kook Kim, Seung Ho Choi, and Hwang Soo Lee, "On the use of LSF intermodel in-terlacing property for spectral quantization," in Proc. 1999 IEEE Workshop on Speech Coding, June 1999, pp. 43-45.

Mi Suk Lee, Hong Kook Kim, and Hwang Soo Lee, "A new distortion measure for spectral quantization based on the LSF intermodel interlacing property," Speech Communication, vol. 35, no. 3-4, pp. 191-202, October 2001.

10.1016/S0167-6393(00)00080-7

Peter Kabal and Ravi Prakash Ramachandran, "The computation of line spectral frequencies using chebyshev polynomials," IEEE Trans. Acoustics, Speech, Signal Processing, vol. 34, no. 6, pp. 1419-1426, December 1986.

10.1109/TASSP.1986.1164983

A. Kindoz and Ahmet M. Kondoz, Digital Speech; Coding for Low Bit Rate Communication Systems, John Wiley & Sons, Inc., New York, USA, 1994.

Tom Bäckström and Carlo Magi, "Properties of line spectrum pair polynomials -- a review," Signal Processing, vol. 86, pp. 3286-3298, 2006.

10.1016/j.sigpro.2006.01.010

Martin Cooke and Te-Won Lee, "Speech separation challenge," INTERSPEECH, 2006.

Steve Young, Gunnar Evermann, , Mark Gales, Thomas Hain, Dan Kershaw, Xun-ying (Andrew) Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, Valtcho Valtchev, and Phil Woodland, "Hidden Markov model toolkit (HTK) version 3.4," December 2006.

Telecommunications Industry Association (TIA), "En-hanced variable rate codec, speech service option 3 for wideband spread spectrum digital systems," Tech. Rep., TIA/EIA/IS-127, September 1996.

3rd Generation Partnership Project, "Amr speech codec," Tech. Rep., 3GPP TS 26.071, V6.0.0, December 2004.

European Telecommunications Standards Institute (ETSI), "Speech processing, transmission and quality aspects (stq); distributed speech recognition; advanced front- end feature extraction algorithm; compression algorithm," Tech. Rep., ES 202 050 v1.1.1, October 2002.

David Pearce and Hans-Günter Hirsch, "The AURORA experimental framework for the performance evaluations of speech recognition systems under noisy condition," in Proc. ICSLP, Beijing, China, October 2000.

The Journal of the Acoustical Society of KoreaISSN:1225-4428(Print) 2287-3775(Online)한국음향학회

Preview

Noise Spectrum Estimation Using Line Spectral Frequencies for Robust Speech Recognition

ABSTRACT

MAIN

Acknowledgements

References