I. Introduction
II. Noise Spectrum Estimation
III. Proposed Noise Suppression Method
IV. Experimental Results
4.1 Database
4.2 Implementation Details
4.3 Illustrations of Noise Suppression Results
4.4 Speech Recognition Performance Comparison
V. Concluding Remarks
I. Introduction
The assumption of spectral subtraction [1] is that the noise is additive and changes slowly over time, so that noise spectrum should be approximated by an average spectrum in non-voice period. The error in estimating true noise spectrum directly accounts for either voice attenuation or less noise suppression, hence the performance is closely related to how reliable the noise spectrum estimates are. Most con-ventional methods rely on detecting whether the instantaneous input frame contains speech, called voice activity detector (VAD), which then enables updating noise estimates when background noise is present only. However, the performance of the VAD varies a lot according to various noise conditions.
This paper proposes a novel procedure for noise spectral magnitude estimation which also eliminates the use of VAD in a very efficient manner. From the basic LSF derivation formulae it is observed that the local maxima of LPC spectra are near the adjacently located LSFs [2,3], and relatively flattened valleys across frequency are around the isolated LSFs. In the proposed method the spectral magnitudes at LSFs are considered as representatives of the peaks and valleys of the corresponding LPC spectra, and participate in estimating noise spectral magnitude. Without any consideration of determining if the current analysis frame contains noise only, the distribution of the log spectral magnitudes at LSFs are modeled by mixture of dual Gaussian probability density functions. The Gaussian with smaller mean is then taken as noise distribution, so the mean is adopted as a noise spectral estimate. An online adaptation algorithm for the parameters of Gaussian distributions is also proposed so that it can handle real-time inputs. The noise Gaussian mean is updated at every time frame. A time-domain Wiener filter suppressing the estimated amount of noise spectral magnitude is computed for every time frame and frequency band, and applied to the input speech signal. The required parameters are LPC coefficients, LSFs, and excitation gains, which are all available in most LP vocoders. Therefore the proposed method can be easily integrated into LP vocoders with much less additional overhead than the other conventional noise suppression methods.
To assess the validity of the proposed method, automatic speech recognition experiments are carried out on speech separation challenge database. Results show significant improvement in speech recognition rates with relatively less speech distortion when compared to ETSI frontend and TIA’s EVRC standard noise suppression.
II. Noise Spectrum Estimation
The proposed method makes use of the properties of LPC analysis. The input speech signal is decom-posed into spectral envelope and excitation signal, such that
, (1)
where
is a digitized sample index,
is the sampled input speech,
are the prediction filter coefficients of order
,
is the excitation signal, and
is a scalar gain so that
has unit variance. Equation 1 is equivalently expressed in the frequency domain as
, (2)
where
and
are z-transforms of
and
.
is spectrally flattened, so that
should contain most of the spectral envelope of the given input speech frame. For a transmission purpose,
is expressed by the two reciprocal polynomials [4]:
(3)
The roots of these two auxiliary polynomials are called line spectral frequencies (LSFs), and known to be most efficient in coding LPC coefficients due to its stability and little sensitivity to quantization error [5,6].
At LSFs either
or
is zero, so
is close to its local minima since
and
are monotonic between any pair of neighboring LSFs. Figure 1 illustrates the behavior of
at LSFs. The two dotted lines in the figure are the frequency responses of
and
, the black solid line is the magnitude of LP filter response expressed by
, and the lightly colored line is the spectral envelope approximated by
. A pre-emphasis filter,
, has been applied to boost high frequency energies. Downward triangles are drawn on
, and upward triangles are on
at the root frequencies of
and
.
|
Fig. 1. Properties of LPC spectrum at line spectral frequencies. |
As adjacent LSFs become closer, for example around 500 Hz,
decreases and hence becomes more resonant around those frequencies [2,3]. However, when a single LSF is isolated, far from its neighbors,
and
change slowly so that
be relatively flattened. Therefore these LPC spectra at LSFs represent either spectral magnitude of speech at their formant frequencies between closely located LSFs, or background noise spectral magnitude at isolated LSFs. These properties are implicitly exploited by the proposed noise suppression procedure.
By the definition of discrete Fourier transform, the impulse response of
at frequency
is expressed by
, (4)
where
. The smoothed spectral magnitude at
LSF at frame
, denoted by
, is approximated by multiplication of its LPC spectral magnitude and frame gain and expressed in log domain by
, (5)
where
is gain at frame
defined in Equation 2 to represent relative magnitude difference across analysis frames. To model the global frequency characteristics of the input sounds, we approximate the long-term average of
by the long-term average frequency response of LPC filters, denoted by
, is updated instantaneously by
, (6)
with an initial value
. The adaptation rate
gives a good performance in our experiments. The LP spectral envelope is then normalized by long-term average, and its log is approximated by the following equation:
(7)
By using
instead of
, we can disregard the global shape of the noise, and a single noise estimate can be used regardless of frequencies.
The distribution of the log spectral magnitudes at LSFs is shown in Figure 2. The x-axis is quantized histogram intervals from the log spectral magnitude, and the y-axis is the number of frames whose x-value is in each interval. The speech spectra is from male speech, and the noise spectra is from car factory noise of SPIB database. The used sources are male speech and factory noise from Signal Processing Information Base (SPIB) database that is available at http://spib.rice.edu/. Since the spectral energy of the factory noise is near stationary over time, there is a significant peak between -2 and -1 on x-axis. Speech spectral magnitude is relatively scattered and varies a lot. The mixed distribution, expressed by lightly colored bars, has a peak around the one of the factory noise, and the portion of small energy components reduced a lot. This is because the noise spectra being consistent over time conceal tiny spectral magnitudes of speech signals. In the high energy regions (greater than -1.5 in Figure 2) where speech is present only, the spectral energy distribution of the speech signal dominates the mixed distribution.
|
Fig. 2. Distribution of log of LPC spectrum at LSFs multiplied by excitation gain. |
III. Proposed Noise Suppression Method
On the log spectral magnitudes in Equation 6 that are globally whitened, a mixture of dual Gaussian probability density functions is used to approximate mean spectral magnitude of noise. For each LSF,
, a substitution
is taken for a compact notation. Denoting
as a set of parameters for noise Gaussian, and
for speech Gaussian, the posterior probability of
belonging to noise Gaussian is expressed by
(8)
where
and
are the prior probabilities of noise and speech presence, with a constraint that
. The likelihood of
given a set of parameters,
, is modeled by a univariate Gaussian density function:
. (9)
The Gaussian parameters are updated online by the following adaptation rules:
(10)
where a positive constant
is step size,
and
are Gaussian parameters of the previous frame.
is the computed adaptation rate for noise at frame
and
LSF. The same formula for speech can be derived by substituting
with
. The adaptation rules for speech parameters are:
(11)
From the mean of the noise Gaussian in Equation 10,
, noise spectral magnitude at frame
is approximated by
. (12)
A Wiener filter suppressing the noise estimate from the spectral magnitude of the mixture signal is derived by
(13)
Then it is floored so that it should be always higher than a certain limit,
, (14)
where a nonnegative constant
is a minimum Wiener filter gain.
IV. Experimental Results
4.1 Database
The proposed method is compared to the con-ventional methods by automatic speech recognition performances on speech separation challenge (SSC) database [7]. The database is designed for assessing the effect of a noise suppression algorithm to a simple speech recognition task. Talkers say short sentences of exactly 6 words, whose format is “com-mand color preposition letter number adverb”. For example, “bin blue at F 2 now”. All of the original sound files are sampled at 25 kHz, and they are downsampled to 8 kHz since the EVRC and ETSI standards support 8 kHz only.
The database has a training set, which consists of 17,000 utterances (500 x 34 talkers). All training sound files are recorded in a quiet environment, i.e., without any background noise. The HMM models are obtained by HTK (hidden Markov model toolkit) [8] as suggested by the coordinators of SSC. The adopted feature is 12 MFCC plus log energy, plus their velocities and accelerations, resulting in a 39- dimensional vector at every 10 ms. A separate testing set of 600 utterances is also provided. There is no overlap between the training and the testing data. The original recordings do not contain environmental noise. Noisy data files are generated by adding speech- shaped noise (ssn), Gaussian random noise with their frequency responses modulated by the average of general speech signals. The simulated SNRs (signal- to-noise ratios) are clean (∞), 6, 0, -6, and -12 dBs, resulting in 3,000 (600 x 5) test sound files.
4.2 Implementation Details
The analysis settings of the proposed method are: sampling frequency 8 kHz, shift size 10 ms (80 samples), analysis frame length 20 ms (160 samples), and hamming windowing in LPC analysis of order 10, which results in 10 LSFs at every frame. A pre- emphasis filter defined by
is used before the analysis to boost high-frequency energies. In re-synthesis, time-domain filters of order 48 are derived from the Wiener filters in Equation 14, applied to the input frames, and the resulting frames are overlap-added by trapezoidal windows of 24 samples overlaps between the neighboring frames. Among commercial standards, the noise suppression frontends in EVRC [9] and ETSI [11,12] standards are compared with the proposed method. They support 8 kHz sampling rate only, and mel-warped filter bank energies are used in voice activity detection and deriving noise estimates. The source codes of the two methods are publicly available by the distributors.
4.3 Illustrations of Noise Suppression Results
Figure 3 illustrates the noise spectrum estimation procedures for a mixture of male speech and factory noise. The x-axis is the log of spectral magnitude
at
LSF and frame
, and the y-axis is the normalized histogram and Wiener filter gains at the same time. The distribution of
is displayed by histogram bars, and the estimated Gaussian density functions are overdrawn on them by solid curves. The Gaussian mixture model generates a noise mean and a speech mean of log spectral magnitudes. The left one is noise Gaussian,
, and the right is speech,
, whose mean values are indicated by wedged vertical lines. The Wiener filter
obtained by Equation 13 is plotted by a thick, dashed line. Wiener filter gain (before flooring by Equation 14) is zero when
is smaller than the mean of noise Gaussian
. The SNR of the input mixture is approximated by the distance between the two Gaussian means,
.
|
Fig. 3.Noise spectral mean and Wiener filter estima-tion result by mixture of Gaussian density functions, for an additive mixture of male speech and factory noise. |
The distribution has a sharp peak around -1 which is well approximated by
. A smaller peak is located around 0, approximated by
. The spectral magnitude of speech signal varies much more than the noise, so its peak location is not as distinct as that of noise. The distribution in Figure 4, male speech only, does not have a sharp peak, and the noise Gaussian mean is around -3 with much bigger variance. The estimated SNR is
= 2.67 = 23.2 dB, which implies that the input signal is almost clean. The noise mean estimate is shifted to the left about 14 dB when compared to Figure 3, while speech Gaussian means of both Figures are very close.
|
Fig. 4.Noise spectral mean and Wiener filter estima-tion result by mixture of Gaussian density functions for male speech only. |
4.4 Speech Recognition Performance Comparison
HMM models trained by clean training dataset is used for the experiments. The ssn mixtures of various SNRs are processed by EVRC standard, ETSI standard, and the proposed method. Speech recognition rates are compared in Table 1, where “bypass”' columns are the results without any processing. The proposed method significantly outperforms all the others in 6 dB and 0 dB, and slightly worse than ETSI in -6 dB and -12 dB conditions. One explanation is that the proposed method limits the minimum Wiener filter gain to -13 dB to obtain reasonable intelligibility loss. ETSI has been developed for high speech recognition performances in adverse environments [11], so it is expected to perform well in harsh noisy conditions. Although the proposed method does not have such features, the performance is not degraded in clean conditions.
Since ssn is an artificial noise, a number of real noise cases are evaluated as well. From AURORA2 database [12], 8 different noise sources (airport, babble, car, exhibition, restaurant, street, subway, and train) are chosen, and added to clean test files. The measured speech recognition rates are in Table 2. Noise mixing levels are 12 dB, 6 dB, and 0 dB, and used as column indexes. Clean condition results are the same as Table 1, and negative SNRs are not considered since the speech recognition rates are too low to be meaningful. Row indexes are various noise suppression methods: bypass (no processing), EVRC standard, ETSI standard, and the proposed method. The last row summarizes speech recognition rates averaged over 8 noises. The top 2 highest values are boldfaced in each mixing SNR.
In terms of average recognition rates, the proposed method is always of higher recognition rates than the other methods with 6 dB and 0 dB SNR mixtures, by up to 7.7 %. In 12 dB SNR, the improvement over ETSI is about 3 %~4 %, and EVRC is the best but the difference to the proposed is only 0.1 %. The pro-posed method is always within the top 2 in all 3 SNR conditions. EVRC works well with relatively higher SNRs (12 dB and 6 dB), and ETSI is better suited to lower SNR cases (0 dB). However, the proposed method guarantees decent speech recognition perfor-mance in all noise levels. In terms of noise types, the proposed method significantly improves the perfor-mance with airport, car, street, and train noises, about the same performance as ETSI’s and EVRC’s with babble and restaurant noises, and EVRC is slightly better with exhibition and subway noises, but the difference is not that large. In summary, the speech recognition results prove that the proposed method is quite stable, and much better than the conventional methods with various noise types and various noise levels.
V. Concluding Remarks
A novel method to reduce near-stationary acoustic noise added to a speech signal recorded by a single microphone is proposed. The noise spectral magnitude is estimated by the smaller mean of dual Gaussian mixture distributions for globally flattened LPC spectra at line spectral frequencies, and a Wiener filter sup-pressing the estimated noise is derived and applied to the input speech signal. The proposed method has several advantages.
Improved noise suppression performance: It sup-presses additive noise with much less speech recog-nition performance degradations when compared to the conventional methods, proved by automatic speech recognition experiments on simulated mixtures of clean speech signals and diverse kinds of real noises in various SNRs. The characteristics of LPC spectral envelopes at line spectral frequencies are actively exploited to estimate noise spectral magnitude, and the need for voice activity detection has been elim-inated.
Computational efficiency: ETSI and EVRC standards use fixed filter bank energies, and there are a lot of control parameters for reliable voice activity detection. The proposed algorithm does not use fixed filter banks, but variable filter banks are chosen according to line spectral frequencies. In 8 kHz sampling fre-quency, the number of filter banks are 23 for EVRC and 16 for ETSI, but the number of line spectral frequencies is only 10, which makes the proposed method much more computationally efficient.
Good match to voice coders: The algorithm is quite simple and requires only LPC coefficients, line spectral frequencies, and excitation gains, which are all available in LPC-based voice coders such as QCELP. Being embedded with voice coders, it can be implemented much more efficiently.






