ABSTRACT


MAIN

  • I. Introduction

  • II. Noise Spectrum Estimation

  • III. Proposed Noise Suppression Method

  • IV. Experimental Results

  •   4.1 Database

  •   4.2 Implementation Details

  •   4.3 Illustrations of Noise Suppression Results

  •   4.4 Speech Recognition Performance Comparison

  • V. Concluding Remarks

I. Introduction

The assumption of spectral subtraction [1] is that the noise is additive and changes slowly over time, so that noise spectrum should be approximated by an average spectrum in non-voice period. The error in estimating true noise spectrum directly accounts for either voice attenuation or less noise suppression, hence the performance is closely related to how reliable the noise spectrum estimates are. Most con-ventional methods rely on detecting whether the instantaneous input frame contains speech, called voice activity detector (VAD), which then enables updating noise estimates when background noise is present only. However, the performance of the VAD varies a lot according to various noise conditions.

This paper proposes a novel procedure for noise spectral magnitude estimation which also eliminates the use of VAD in a very efficient manner. From the basic LSF derivation formulae it is observed that the local maxima of LPC spectra are near the adjacently located LSFs [2,3], and relatively flattened valleys across frequency are around the isolated LSFs. In the proposed method the spectral magnitudes at LSFs are considered as representatives of the peaks and valleys of the corresponding LPC spectra, and participate in estimating noise spectral magnitude. Without any consideration of determining if the current analysis frame contains noise only, the distribution of the log spectral magnitudes at LSFs are modeled by mixture of dual Gaussian probability density functions. The Gaussian with smaller mean is then taken as noise distribution, so the mean is adopted as a noise spectral estimate. An online adaptation algorithm for the parameters of Gaussian distributions is also proposed so that it can handle real-time inputs. The noise Gaussian mean is updated at every time frame. A time-domain Wiener filter suppressing the estimated amount of noise spectral magnitude is computed for every time frame and frequency band, and applied to the input speech signal. The required parameters are LPC coefficients, LSFs, and excitation gains, which are all available in most LP vocoders. Therefore the proposed method can be easily integrated into LP vocoders with much less additional overhead than the other conventional noise suppression methods.

To assess the validity of the proposed method, automatic speech recognition experiments are carried out on speech separation challenge database. Results show significant improvement in speech recognition rates with relatively less speech distortion when compared to ETSI frontend and TIA’s EVRC standard noise suppression.

II. Noise Spectrum Estimation

The proposed method makes use of the properties of LPC analysis. The input speech signal is decom-posed into spectral envelope and excitation signal, such that

http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB426.gif, (1)

where http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB437.gif is a digitized sample index, http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB467.gif is the sampled input speech, http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB477.gif are the prediction filter coefficients of order http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB488.gif, http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB4A8.gif is the excitation signal, and http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB4A9.gif is a scalar gain so that http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB4C9.gif has unit variance. Equation 1 is equivalently expressed in the frequency domain as

http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB547.gif, (2)

where http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB577.gif and http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB598.gif are z-transforms of http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB599.gif and http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB5A9.gif. http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB5C9.gif is spectrally flattened, so that http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB5F9.gif should contain most of the spectral envelope of the given input speech frame. For a transmission purpose, http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB61A.gif is expressed by the two reciprocal polynomials [4]:

http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB715.gif (3)

The roots of these two auxiliary polynomials are called line spectral frequencies (LSFs), and known to be most efficient in coding LPC coefficients due to its stability and little sensitivity to quantization error [5,6].

At LSFs either http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB735.gif or http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB755.gif is zero, so http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB775.gif is close to its local minima since http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB7A5.gif and http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB7C5.gif are monotonic between any pair of neighboring LSFs. Figure 1 illustrates the behavior of http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB835.gif at LSFs. The two dotted lines in the figure are the frequency responses of http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB855.gif and http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB875.gif, the black solid line is the magnitude of LP filter response expressed by http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB8E4.gif, and the lightly colored line is the spectral envelope approximated by http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB914.gif. A pre-emphasis filter, http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB915.gif, has been applied to boost high frequency energies. Downward triangles are drawn on http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB935.gif, and upward triangles are on http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB965.gif at the root frequencies of http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB995.gif and http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB9B5.gif.

http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB805.gif

Fig. 1. Properties of LPC spectrum at line spectral frequencies.

As adjacent LSFs become closer, for example around 500 Hz, http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB9D5.gif decreases and hence becomes more resonant around those frequencies [2,3]. However, when a single LSF is isolated, far from its neighbors, http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICB9F5.gif and http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBA16.gif change slowly so that http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBA45.gif be relatively flattened. Therefore these LPC spectra at LSFs represent either spectral magnitude of speech at their formant frequencies between closely located LSFs, or background noise spectral magnitude at isolated LSFs. These properties are implicitly exploited by the proposed noise suppression procedure.

By the definition of discrete Fourier transform, the impulse response of http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBA46.gif at frequency http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBA57.gif is expressed by

http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBA87.gif, (4)

where http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBA88.gif. The smoothed spectral magnitude at http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBA99.gif LSF at frame http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBA9A.gif, denoted by http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBAAA.gif, is approximated by multiplication of its LPC spectral magnitude and frame gain and expressed in log domain by

http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBAF9.gif, (5)

where http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBAFA.gif is gain at frame http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBB0B.gif defined in Equation 2 to represent relative magnitude difference across analysis frames. To model the global frequency characteristics of the input sounds, we approximate the long-term average of http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBB2B.gif by the long-term average frequency response of LPC filters, denoted by http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBB4B.gif, is updated instantaneously by

http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBBD9.gif, (6)

with an initial value http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBC19.gif. The adaptation rate http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBC1A.gif gives a good performance in our experiments. The LP spectral envelope is then normalized by long-term average, and its log is approximated by the following equation:

http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBD05.gif (7)

By using http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBD44.gif instead of http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBD65.gif, we can disregard the global shape of the noise, and a single noise estimate can be used regardless of frequencies.

The distribution of the log spectral magnitudes at LSFs is shown in Figure 2. The x-axis is quantized histogram intervals from the log spectral magnitude, and the y-axis is the number of frames whose x-value is in each interval. The speech spectra is from male speech, and the noise spectra is from car factory noise of SPIB database. The used sources are male speech and factory noise from Signal Processing Information Base (SPIB) database that is available at http://spib.rice.edu/. Since the spectral energy of the factory noise is near stationary over time, there is a significant peak between -2 and -1 on x-axis. Speech spectral magnitude is relatively scattered and varies a lot. The mixed distribution, expressed by lightly colored bars, has a peak around the one of the factory noise, and the portion of small energy components reduced a lot. This is because the noise spectra being consistent over time conceal tiny spectral magnitudes of speech signals. In the high energy regions (greater than -1.5 in Figure 2) where speech is present only, the spectral energy distribution of the speech signal dominates the mixed distribution.

http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBD85.gif

Fig. 2. Distribution of log of LPC spectrum at LSFs multiplied by excitation gain.

III. Proposed Noise Suppression Method

On the log spectral magnitudes in Equation 6 that are globally whitened, a mixture of dual Gaussian probability density functions is used to approximate mean spectral magnitude of noise. For each LSF, http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBD96.gif, a substitution http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBDC5.gif is taken for a compact notation. Denoting http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBDD6.gif as a set of parameters for noise Gaussian, and http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBDD7.gif for speech Gaussian, the posterior probability of http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBDE8.gif belonging to noise Gaussian is expressed by

http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBE56.gif (8)

where http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBE76.gif and http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBE97.gif are the prior probabilities of noise and speech presence, with a constraint that http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBEE6.gif. The likelihood of http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBEF6.gif given a set of parameters, http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBF07.gif, is modeled by a univariate Gaussian density function:

http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBF18.gif. (9)

The Gaussian parameters are updated online by the following adaptation rules:

http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBFA5.gif (10)

where a positive constant http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBFB6.gif is step size, http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBFD6.gif and http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBFE7.gif are Gaussian parameters of the previous frame. http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBFE8.gif is the computed adaptation rate for noise at frame http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICBFF8.gif and http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC009.gif LSF. The same formula for speech can be derived by substituting http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC01A.gif with http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC02A.gif. The adaptation rules for speech parameters are:

http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC0B8.gif (11)

From the mean of the noise Gaussian in Equation 10, http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC0C8.gif, noise spectral magnitude at frame http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC0C9.gif is approximated by

http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC0F9.gif. (12)

A Wiener filter suppressing the noise estimate from the spectral magnitude of the mixture signal is derived by

http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC1F4.gif (13)

Then it is floored so that it should be always higher than a certain limit,

http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC243.gif, (14)

where a nonnegative constant http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC264.gif is a minimum Wiener filter gain.

IV. Experimental Results

4.1 Database

The proposed method is compared to the con-ventional methods by automatic speech recognition performances on speech separation challenge (SSC) database [7]. The database is designed for assessing the effect of a noise suppression algorithm to a simple speech recognition task. Talkers say short sentences of exactly 6 words, whose format is “com-mand color preposition letter number adverb”. For example, “bin blue at F 2 now”. All of the original sound files are sampled at 25 kHz, and they are downsampled to 8 kHz since the EVRC and ETSI standards support 8 kHz only.

The database has a training set, which consists of 17,000 utterances (500 x 34 talkers). All training sound files are recorded in a quiet environment, i.e., without any background noise. The HMM models are obtained by HTK (hidden Markov model toolkit) [8] as suggested by the coordinators of SSC. The adopted feature is 12 MFCC plus log energy, plus their velocities and accelerations, resulting in a 39- dimensional vector at every 10 ms. A separate testing set of 600 utterances is also provided. There is no overlap between the training and the testing data. The original recordings do not contain environmental noise. Noisy data files are generated by adding speech- shaped noise (ssn), Gaussian random noise with their frequency responses modulated by the average of general speech signals. The simulated SNRs (signal- to-noise ratios) are clean (∞), 6, 0, -6, and -12 dBs, resulting in 3,000 (600 x 5) test sound files.

4.2 Implementation Details

The analysis settings of the proposed method are: sampling frequency 8 kHz, shift size 10 ms (80 samples), analysis frame length 20 ms (160 samples), and hamming windowing in LPC analysis of order 10, which results in 10 LSFs at every frame. A pre- emphasis filter defined by http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC265.gif is used before the analysis to boost high-frequency energies. In re-synthesis, time-domain filters of order 48 are derived from the Wiener filters in Equation 14, applied to the input frames, and the resulting frames are overlap-added by trapezoidal windows of 24 samples overlaps between the neighboring frames. Among commercial standards, the noise suppression frontends in EVRC [9] and ETSI [11,12] standards are compared with the proposed method. They support 8 kHz sampling rate only, and mel-warped filter bank energies are used in voice activity detection and deriving noise estimates. The source codes of the two methods are publicly available by the distributors.

4.3 Illustrations of Noise Suppression Results

Figure 3 illustrates the noise spectrum estimation procedures for a mixture of male speech and factory noise. The x-axis is the log of spectral magnitude http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC2D4.gif at http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC2D5.gif LSF and frame http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC2E6.gif, and the y-axis is the normalized histogram and Wiener filter gains at the same time. The distribution of http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC325.gif is displayed by histogram bars, and the estimated Gaussian density functions are overdrawn on them by solid curves. The Gaussian mixture model generates a noise mean and a speech mean of log spectral magnitudes. The left one is noise Gaussian, http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC336.gif, and the right is speech, http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC337.gif, whose mean values are indicated by wedged vertical lines. The Wiener filter http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC367.gif obtained by Equation 13 is plotted by a thick, dashed line. Wiener filter gain (before flooring by Equation 14) is zero when http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC3A6.gif is smaller than the mean of noise Gaussian http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC3A7.gif. The SNR of the input mixture is approximated by the distance between the two Gaussian means, http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC3C7.gif.

http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC2A4.gif

Fig. 3.Noise spectral mean and Wiener filter estima-tion result by mixture of Gaussian density functions, for an additive mixture of male speech and factory noise.

The distribution has a sharp peak around -1 which is well approximated by http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC3C8.gif. A smaller peak is located around 0, approximated by http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC3D9.gif. The spectral magnitude of speech signal varies much more than the noise, so its peak location is not as distinct as that of noise. The distribution in Figure 4, male speech only, does not have a sharp peak, and the noise Gaussian mean is around -3 with much bigger variance. The estimated SNR is http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC41A.gif = 2.67 = 23.2 dB, which implies that the input signal is almost clean. The noise mean estimate is shifted to the left about 14 dB when compared to Figure 3, while speech Gaussian means of both Figures are very close.

http://static.apub.kr/journalsite/sites/ask/2012-031-03/0660310307/images/PICC419.gif

Fig. 4.Noise spectral mean and Wiener filter estima-tion result by mixture of Gaussian density functions for male speech only.

4.4 Speech Recognition Performance Comparison

HMM models trained by clean training dataset is used for the experiments. The ssn mixtures of various SNRs are processed by EVRC standard, ETSI standard, and the proposed method. Speech recognition rates are compared in Table 1, where “bypass”' columns are the results without any processing. The proposed method significantly outperforms all the others in 6 dB and 0 dB, and slightly worse than ETSI in -6 dB and -12 dB conditions. One explanation is that the proposed method limits the minimum Wiener filter gain to -13 dB to obtain reasonable intelligibility loss. ETSI has been developed for high speech recognition performances in adverse environments [11], so it is expected to perform well in harsh noisy conditions. Although the proposed method does not have such features, the performance is not degraded in clean conditions.

Table 1. Comparison of speech recognition perfor-mances on the testing set with additive speech- shaped Gaussian noise (ssn).

Methods

clean

6 dB

0 dB

-6 dB

-12 dB

Bypass

97.6 %

60.7 %

27.1 %

12.8 %

11.2 %

EVRC

96.7 %

68.3 %

32.1 %

14.4 %

12.2 %

ETSI

96.9 %

76.4 %

43.4 %

21.6 %

14.1 %

Proposed

97.7 %

81.1 %

49.9 %

21.3 %

11.8 %

Since ssn is an artificial noise, a number of real noise cases are evaluated as well. From AURORA2 database [12], 8 different noise sources (airport, babble, car, exhibition, restaurant, street, subway, and train) are chosen, and added to clean test files. The measured speech recognition rates are in Table 2. Noise mixing levels are 12 dB, 6 dB, and 0 dB, and used as column indexes. Clean condition results are the same as Table 1, and negative SNRs are not considered since the speech recognition rates are too low to be meaningful. Row indexes are various noise suppression methods: bypass (no processing), EVRC standard, ETSI standard, and the proposed method. The last row summarizes speech recognition rates averaged over 8 noises. The top 2 highest values are boldfaced in each mixing SNR.

In terms of average recognition rates, the proposed method is always of higher recognition rates than the other methods with 6 dB and 0 dB SNR mixtures, by up to 7.7 %. In 12 dB SNR, the improvement over ETSI is about 3 %~4 %, and EVRC is the best but the difference to the proposed is only 0.1 %. The pro-posed method is always within the top 2 in all 3 SNR conditions. EVRC works well with relatively higher SNRs (12 dB and 6 dB), and ETSI is better suited to lower SNR cases (0 dB). However, the proposed method guarantees decent speech recognition perfor-mance in all noise levels. In terms of noise types, the proposed method significantly improves the perfor-mance with airport, car, street, and train noises, about the same performance as ETSI’s and EVRC’s with babble and restaurant noises, and EVRC is slightly better with exhibition and subway noises, but the difference is not that large. In summary, the speech recognition results prove that the proposed method is quite stable, and much better than the conventional methods with various noise types and various noise levels.

Table 2. Comparison of speech recognition perfor-mances on AURORA2 database.

Noise

Methods

12 dB

6 dB

0 dB

airport

bypass

85.1 %

65.2 %

37.6 %

EVRC

89.4 %

69.0 %

35.8 %

ETSI

86.9 %

68.4 %

39.3 %

Proposed

90.3 %

74.7 %

48.1 %

babble

bypass

81.4 %

56.6 %

28.6 %

EVRC

88.4 %

65.1 %

34.3 %

ETSI

85.0 %

63.2 %

34.6 %

Proposed

87.6 %

68.5 %

39.2 %

car

bypass

79.4 %

54.4 %

22.3 %

EVRC

88.3 %

60.8 %

24.6 %

ETSI

85.0 %

61.1 %

30.8 %

Proposed

89.1 %

72.8 %

42.4 %

exhibition

bypass

73.7 %

41.9 %

18.4 %

EVRC

83.4 %

64.8 %

31.0 %

ETSI

82.1 %

59.6 %

30.2 %

Proposed

83.7 %

60.5 %

30.7 %

restaurant

bypass

83.3 %

58.3 %

29.1 %

EVRC

87.9 %

65.5 %

33.6 %

ETSI

85.7 %

63.9 %

35.4 %

Proposed

86.0 %

67.9 %

37.3 %

street

bypass

77.1 %

52.8 %

26.4 %

EVRC

85.3 %

61.2 %

29.9 %

ETSI

80.8 %

58.9 %

30.5 %

Proposed

85.1 %

65.0 %

37.7 %

subway

bypass

76.8 %

43.9 %

19.1 %

EVRC

85.6 %

65.9 %

32.3 %

ETSI

81.3 %

60.1 %

30.6 %

Proposed

85.1 %

63.0 %

31.1 %

train

bypass

86.9 %

67.1 %

38.5 %

EVRC

89.7 %

67.9 %

35.6 %

ETSI

87.9 %

71.7 %

41.5 %

Proposed

90.7 %

77.4 %

51.6 %

Average

bypass

80.5 %

55.0 %

27.5 %

EVRC

87.3 %

65.0 %

32.1 %

ETSI

84.3 %

63.4 %

34.1 %

Proposed

87.2 %

68.7 %

39.8 %

V. Concluding Remarks

A novel method to reduce near-stationary acoustic noise added to a speech signal recorded by a single microphone is proposed. The noise spectral magnitude is estimated by the smaller mean of dual Gaussian mixture distributions for globally flattened LPC spectra at line spectral frequencies, and a Wiener filter sup-pressing the estimated noise is derived and applied to the input speech signal. The proposed method has several advantages.

Improved noise suppression performance: It sup-presses additive noise with much less speech recog-nition performance degradations when compared to the conventional methods, proved by automatic speech recognition experiments on simulated mixtures of clean speech signals and diverse kinds of real noises in various SNRs. The characteristics of LPC spectral envelopes at line spectral frequencies are actively exploited to estimate noise spectral magnitude, and the need for voice activity detection has been elim-inated.

Computational efficiency: ETSI and EVRC standards use fixed filter bank energies, and there are a lot of control parameters for reliable voice activity detection. The proposed algorithm does not use fixed filter banks, but variable filter banks are chosen according to line spectral frequencies. In 8 kHz sampling fre-quency, the number of filter banks are 23 for EVRC and 16 for ETSI, but the number of line spectral frequencies is only 10, which makes the proposed method much more computationally efficient.

Good match to voice coders: The algorithm is quite simple and requires only LPC coefficients, line spectral frequencies, and excitation gains, which are all available in LPC-based voice coders such as QCELP. Being embedded with voice coders, it can be implemented much more efficiently.

Acknowledgements

This research was supported by the Ministry of Knowledge Economy, Korea (2008-S-019-02, Development of Portable Korean-English Automatic Speech Translation Technology), and by the 2010 Research Fund of the UNIST (Ulsan National Institute of Science and Technology).

References

1
S. F. Boll, "Suppression of acoustic noise in speech using spectral subtraction," IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 27, no. 2, pp. 113-120, 1979.
10.1109/TASSP.1979.1163209
2
Mi Suk Lee, Hong Kook Kim, Seung Ho Choi, and Hwang Soo Lee, "On the use of LSF intermodel in-terlacing property for spectral quantization," in Proc. 1999 IEEE Workshop on Speech Coding, June 1999, pp. 43-45.
3
Mi Suk Lee, Hong Kook Kim, and Hwang Soo Lee, "A new distortion measure for spectral quantization based on the LSF intermodel interlacing property," Speech Communication, vol. 35, no. 3-4, pp. 191-202, October 2001.
10.1016/S0167-6393(00)00080-7
4
Peter Kabal and Ravi Prakash Ramachandran, "The computation of line spectral frequencies using chebyshev polynomials," IEEE Trans. Acoustics, Speech, Signal Processing, vol. 34, no. 6, pp. 1419-1426, December 1986.
10.1109/TASSP.1986.1164983
5
A. Kindoz and Ahmet M. Kondoz, Digital Speech; Coding for Low Bit Rate Communication Systems, John Wiley & Sons, Inc., New York, USA, 1994.
6
Tom Bäckström and Carlo Magi, "Properties of line spectrum pair polynomials -- a review," Signal Processing, vol. 86, pp. 3286-3298, 2006.
10.1016/j.sigpro.2006.01.010
7
Martin Cooke and Te-Won Lee, "Speech separation challenge," INTERSPEECH, 2006.
8
Steve Young, Gunnar Evermann, , Mark Gales, Thomas Hain, Dan Kershaw, Xun-ying (Andrew) Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, Valtcho Valtchev, and Phil Woodland, "Hidden Markov model toolkit (HTK) version 3.4," December 2006.
9
Telecommunications Industry Association (TIA), "En-hanced variable rate codec, speech service option 3 for wideband spread spectrum digital systems," Tech. Rep., TIA/EIA/IS-127, September 1996.
10
3rd Generation Partnership Project, "Amr speech codec," Tech. Rep., 3GPP TS 26.071, V6.0.0, December 2004.
11
European Telecommunications Standards Institute (ETSI), "Speech processing, transmission and quality aspects (stq); distributed speech recognition; advanced front- end feature extraction algorithm; compression algorithm," Tech. Rep., ES 202 050 v1.1.1, October 2002.
12
David Pearce and Hans-Günter Hirsch, "The AURORA experimental framework for the performance evaluations of speech recognition systems under noisy condition," in Proc. ICSLP, Beijing, China, October 2000.
페이지 상단으로 이동하기