5.1.11 Background noise energy estimation

26.4453GPPCodec for Enhanced Voice Services (EVS)Detailed algorithmic descriptionRelease 15TS

The background noise energy is estimated (updated) in two stages. In the first stage, noise energy is updated only for critical bands where the current frame signal energy is less than the previously estimated background noise energy. This stage is called the downward noise energy update. In the second stage, noise energy is updated if the signal characteristics are statistically close to the model of background noise. Therefore, in the second stage, noise energy can be updated regardless of the current frame signal energy.

5.1.11.1 First stage of noise energy update

The total noise energy per frame is computed as follows:

(101)

where is the estimated noise energy in the ith critical band of the previous frame.

The noise energy per critical band is initialized to 0.0035 dB. The updated noise energy in the ith critical band, denoted , is computed as follows:

(102)

where corresponds to the energy per critical band calculated in the second spectral analysis in the previous frame, and is the estimated noise energy per critical band also in the previous frame. Noise energy is then updated only in critical bands that have lower energy than the background noise energy. That is

(103)

The superscript [0] in the above equation is used to stress that it corresponds to the current frame.

Another feature used in noise estimation and SAD is an estimate of the frame to frame energy variation. The absolute energy difference between the current and the last frame is calculated,.

. (104)

where the superscript [-1] has been used to denote the previous frame. The frame energy variation is then used to update the feature

(105)

Other energy features that are updated before the SAD and the second stage of the noise estimation are first initialized during the very two frames after encoder initialization. The initialization is done as follows

(106)

After the two frames of initialization the total frame energy is smoothed by means of LP filtering. That is:

(107)

The features and are envelope tracking features of the frame energy and are used to create the long-term minimum energy and an estimate of the energy dynamics . That is

, (108)

To calculate the following processing is applied:

(109)

where is the number of frames since the last harmonic event from the previous frame. See clause 5.1.11.3.2 for details about its computation. The new value of is then used to update its long-term value through an AR process. That is

(110)

where the parameter is set as follows

(106)

The energy dynamics feature is just an LP-filtered version of the difference between and . That is

(107)

5.1.11.2 Second stage of noise energy update

In the second stage of the noise energy update, the critical bands not updated in the first stage are updated only if the current frame is inactive. However, the SAD decision obtained in clause 5.1.12, which is based on the SNR per critical band, is not used for determining whether the current frame is inactive and whether the noise energy is to be updated. Another decision is performed based on other parameters not directly dependent on the SNR per critical band. The basic parameters used for the noise update decision are:

– pitch stability

– signal non-stationarity

– normalized correlation (voicing)

– ratio between 2nd‑order and 16th‑order LP residual error energies

These parameters have generally low sensitivity to the noise level variations. Another set of parameters is calculated to cover harmonic (tonal) signals and, in particular, music. These parameters prevent the noise energy to be updated, when strong harmonicity or tonality is detected even when its energy is low. The parameters related to the detection of tonal signals are

– spectral diversity

– complementary non-stationarity

– HF energy

– tonal stability

The reason for not using the SAD decision for noise update is to make the noise estimation robust to rapidly changing noise levels. If the SAD decision was used for the noise update, a sudden increase in noise level would cause an increase of SNR even for inactive speech frames, preventing the noise estimator to update, which in turn would maintain the SNR high in the following frames. Consequently, the noise update would be blocked and some other logic would be needed to resume the noise adaptation.

5.1.11.2.1 Basic parameters for noise energy update

The pitch stability counter is computed as

(108)

where d[0], d[1] and d[-1] are the OL pitch lags for the first half-frame, second half-frame and the second half-frame of the pervious frame. The pitch stability is true if the value of pc is less than 12. Further, for frames with low voicing, pc is directly set to 12 to indicate pitch instability. That is

if then pc = 12, (109)

where are the normalized raw correlations as defined in clause 5.1.10.7 and re is a correction added to the normalized correlation in order to compensate for the decrease of normalized correlation in the presence of background noise, defined in clause 5.1.10.6. The voicing threshold thCpc = 0.52 for WB inputs, and thCpc = 0.65 for NB inputs.

Signal non-stationarity is analysed based on the product of ratios between the current frame energy per critical band and its long-term average per critical band. The average long-term energy per critical band is calculated as

, for i = bmin to bmax, (110)

where bmin = 0 and bmax = 19 in case of WB signals, and bmin = 1 and bmax = 16 in case of NB signals. The update factor is a linear function of the relative frame energy, defined in clause 5.1.5.2 and it is given as follows

, constrained by (111)

where all negative values of are replaced by 0. The frame non-stationarity is then given by the product of the ratios between the frame energy and its long-term average calculated in the previous frame. That is

(112)

The voicing factor for noise update is given by

(113)

The ratio between the LP residual energy after 2nd‑order and 16th‑order analysis is given by

(114)

where E(2) and E(16) are the LP residual energies after 2nd‑order and 16th‑order analysis, and computed in the Levinson-Durbin recursion (see clause 5.1.9.4). This ratio reflects the fact that, to represent a signal spectral envelope, a higher order of LP is generally needed for speech signal than for noise. In other words, the ratio between E(2) and E(16) is expected to be lower for noise than for active speech.

5.1.11.2.2 Spectral diversity

The basic parameters for noise estimation have their limitations for certain music signals, such as piano concerts or instrumental rock and pop. Spectral diversity gives information about significant spectral changes. The changes are tracked in the frequency domain in critical bands by comparing energies in the first spectral analysis of the current frame with the second spectral analysis two frames ago. The energy per critical band corresponding to the first spectral analysis of the current frame is denoted as and is defined in clause 5.1.5.2. Let the energy per critical band corresponding to the second spectral analysis two frames ago be denoted as . For all bands higher than 9, the maximum and the minimum of the two energies is found as

, for i = 10,..,bmax, (115)

where bmax = 19 in case of WB signals, and bmax = 16 in case of NB signals. The energy ratio is the calculated as

, for i = 10,..,bmax. (116)

The spectral diversity is then calculated as the normalized weighted sum of the ratios in all critical bands with the weight itself being the maximum energy . That is

(117)

The spectral diversity is used as an auxiliary parameter for the complementary non‑stationarity described below.

5.1.11.2.3 Complementary non-stationarity

The complementary non-stationarity is motivated by the fact that the non-stationarity described in clause 5.1.11.2.1 and calculated in equation (112) is low when a sharp energy attack in a harmonic signal is followed by a slow energy decay. In this case, the average long-term energy per critical band, , slowly increases after the attack whereas the current energy per critical band, , slowly decreases. At certain point (few frames after the attack frame) they are the same yielding only a small value of the nonstat parameter. This indicates to the noise estimation logic an absence of active signal which is wrong. It may lead to a false update of the background noise and consequently a collapse of the SAD algorithm.

To overcome this problem, there is an alternative calculation of the average long-term energy per critical band. It is calculated in the same way as in equation (110) but with a different factor. That is

, for i = bmin to bmax. (118)

where is initialized to 0.03. The update factor and reset to 0 if pdiv > 5. The complementary non-stationarity parameter is then calculated in the same way as nonstat but using instead of . That is:

(119)

The complementary non-stationarity must be used by the noise estimation logic only in certain signal passages. These are characterized by the parameter which can be described as the average non-binary decision combined from non-stationarity and tonal stability. That is

if nonstat > thstat OR ptonal = 1 then otherwise

where is in the range [0; 1] and ptonal is the tonal stability described in clause 5.1.11.2.5 and defined in equation (125).

5.1.11.2.4 HF energy content

The HF energy content represents another parameter, which is used for the detection of certain noise‑like musical signals such as cymbals or low-frequency drums. This parameter is calculated as

, constrained by (120)

but only for frames that have at least a minimal HF energy, i.e. when both the numerator and the denominator of the above equation are higher than 100. If this is not fulfilled, . Finally, the long-term value if this parameter is calculated as

(121)

where is initialized to zero.

5.1.11.2.5 Tonal stability

The tonal stability exploits the harmonic spectral structure of certain musical signals. In the spectrum of such signals there are tones which are stable over several consecutive frames. To exploit this feature, it is necessary to track the positions and shapes of strong spectral peaks. The tonal stability is based on a correlation between the spectral peaks in the current frame and the past frame. The input to the algorithm is an average logarithmic energy spectrum, defined as

, , (127)

where is defined in clause 5.1.5.2 and the superscripts [0] and [1] denote the first and the second spectral analysis, respectively. In the following text, the term "spectrum" will refer to the average logarithmic energy spectrum, as defined by the above equation.

The tonal stability is calculated in three stages. In the first stage, indices of local minima of the spectrum are searched in a loop and stored as imin. This is described by the following equation

, , (128)

The index 0 is added to if . Consequently, the index 127 is added to , if . Let us denote the total number of minima found as Nmin. The second stage consists of calculating a spectral floor and its subtraction from the spectrum. The spectral floor is a piece-wise linear function which runs through the detected local minima. Every piece between two consecutive minima and can be described by a linear function as

, , (129)

where k is the slope of the line and . The slope is calculated by

(130)

Thus, the spectral floor is a logical connection of all pieces. The leading bins of the spectrum up to and the terminating bins of the spectrum from are set to the spectral values themselves, i.e.

(131)

Finally, the spectral floor is subtracted from the spectrum by

, (132)

and the result is the residual spectrum. The calculation of the spectral floor and its subtraction is illustrated in the following figure.

Figure 9 : Spectral floor in the tonal stability

The third stage of the tonal stability calculation is the calculation of the correlation map and the long-term correlation map. This is again a piece-wise operation. The correlation map is created on a peak-by-peak basis where each two consecutive minima delimit one peak. Let us denote the residual spectrum of the previous frame as . For every peak in the current residual spectrum, normalized correlation is calculated with the previous residual spectrum. The correlation operation takes into account all indices (bins) of that peak delimited by two consecutive minima, i.e.

, (122)

where the leading bins up to and the terminating bins from are set to zero. The figure below shows a graphical representation of the correlation map.

Figure 10 : Correlation map in the tonal stability calculation

The correlation map of the current frame is used to update its long-term value, which can be expressed as

, (123)

where . If any value of exceeds the threshold of 0.95, the flag fstrong is set to one, otherwise it is set to zero. The long-term correlation map is initialized to zero for all k. Finally, all bins of are summed together by

(124)

In case of NB signals, the correlation map in higher bands is very low due to missing spectral content. To overcome this deficiency, msum is multiplied by 1.53.

The decision about tonal stability is taken by subjecting msum to an adaptive threshold thtonal. This threshold is initialized to 56 and it is updated in every frame by

if msum > 56 then thtonal = thtonal – 0.2 otherwise thtonal = thtonal + 0.2

and is upper limited by 60 and lower limited by 49. Thus, it decreases when the summed correlation map is relatively high, indicating a good tonal segment, and increases otherwise. When the threshold is lower, more frames will be classified as tonal, especially at the end of active music periods. Therefore, the adaptive threshold may be viewed as a hangover.

The ptonal parameter is set to one whenever msum is higher than thtonal or when the flag fstrong is set to one. That is:

if msum > thtonal OR fstrong = 1 then ptonal = 1 otherwise ptonal = 0 (125)

5.1.11.2.6 High frequency dynamic range

From the residual spectrum as described in equation 116, another parameter is computed. This parameter is called the high frequency dynamic is derived from the high band spectral dynamic of the residual spectrum and is used to set the high frequency dynamic range flag which is used inside the GSC to decide about the number of subframe and the bit allocation. The high frequency dynamic is compute as the average of the last 40 bin from the residual spectrum:

(126)

And the high frequency dynamic range flag is set depending on the past values and the actual high frequency dynamic as :

(127)

Where represents the frame at time t and represents the average high frequency dynamic at when the last time the flag was set to 0.

5.1.11.2.7 Combined decision for background noise energy update

The noise energy update decision is controlled through the logical combination of the parameters and flags described in the previous sections. The combined decision is a state variable denoted pnup which is initially set to 6, and which is decremented by 1 if an inactive frame is detected or incremented by 2 if an active frame is detected. Further, pnup is bounded by 0 and 6. The following diagram shows the conditions under which the state variable pnup is incremented by 2 in each frame.

Figure 11 : Incrementing the state variable for background noise energy update

where, for WB signals, thsta = 350000, thCnorm = 0.85 and thresid = 1.6, and for NB signals, thsta = 500000, thCnorm = 0.7 and thresid = 10.4. If pnup is not incremented in any of the conditions from the above diagram, it is automatically decremented by 1. Therefore, it takes at least 6 frames before pnup reaches 0 which signals the subsequent logic that background noise energy can be updated. The final decision about background noise energy update is described in the subsequent section.

5.1.11.3 Energy-based parameters for noise energy update

The parameters in this section are used in addition to the described in the previous section to control when it is possible and safe to allow the noise estimate sub-bands to be increased according to the pre calculated noise estimate calculated in equation (102).

5.1.11.3.1 Closeness to current background estimate

Similar to and the parameter represents a spectral difference. The difference is that it is the closeness/variation compared to the current background noise estimate that is measured. The calculation of the feature also differs in calculation during initialization, that is , or during normal operation. During initialization the comparison is made using a constant, which is the initialization value for the sub-band energies, as shown in

(128)

This is done to reduce the effect of decision errors in the background noise estimation during initialization. After the initialization period the calculation is made using the current background noise estimate of the respective sub-band, according to:

(129)

It is worth noting that the calculation of is not dependent of the band width as it is made over the same sub-bands regardless of the input bandwidth.

5.1.11.3.2 Features related to last correlation or harmonic event

Two related features are created which relate to the occurrence of frames where correlation or harmonic events are detected. The first is a counter, , that keeps track of how many frames that have passed since the last frame where correlation or harmonic event has occurred. That is if a correlation or harmonic event is detected the counter is reset otherwise it is incremented by one, according to:

(141)

where is the normalized correlation in the first or the second half-frame and is the result of the tonal detection in clause 5.1.11.2.5. If the counter is larger than 1 it is limited to 1 if or if AND is 1. Depending on the estimated short term variance of the input frame energy the current value of the counter can be reduced to one quarter of its value (or 1 if it was less than 4). The reduction is made for frames where where and the short therm variance estimate of the frame energy is larger than 8.0. The other feature is the long term measure of the relative occurrence of correlation or tonal frames. It is represented as a scalar value, , which is updated using a first order AR-process with different time constants depending on if the current frame is classified as a correlation/tonal frame or not according to:

(142)

where the test, , represents a detection of a correlation/tonal event.

5.1.11.3.3 Energy-based pause detection

To improve the tracking of the background noise the energy pause detector monitors the number of frames since the frame energy got close to the long-term minimum frame energy estimate. For inactive frames the counter is 0 or higher, , where positive integers represent the number of frames since the start of the current pause. When active content is detected the counter is set and kept at, . Initially =0 so the detector is in a inactive state and checks for an energy increase relative the long term minimum energy tracker that could triggers a transition to and active state:

(130)

If the detector is in an active state the detector checks if the frame energy once again has come close to the long term minimum energy

(131)

The final step in the update of this parameter is to increment the counter if the detector is in an inactive state

(132)

5.1.11.3.4 Long-term linear prediction efficiency

This section describes how the residual energies from the linear prediction analysis made in clause 5.1.9 can be used to create a long term feature that can be used to better determine when the input signal is active content or background noise based on the input signal alone.

The analysis provides several new features by analysing the linear prediction gain going from 0th-order to 2nd-order linear prediction and going from 2nd-order to 16th-order prediction. Starting with the 2nd order prediction residual energy that is compared to the 0th-order prediction residual energy, which is the energy of the input signal. For a more stable long term feature the gain is calculated and limited as

(133)

where is the energy of the input signal and is the residual energy after the second-order linear prediction (see clause 5.1.9.4). The limited prediction gain is then filtered in two steps to create long term estimate of this gain. The first is made using

(134)

and typically this will become either 0 or 8 depending on the type of background noise in the input once there is a segment of background only input. A second feature is then created using the difference between the first long term feature and the frame by frame limited prediction gain according to:

. (148)

This will give an indication of the current frames prediction gain compared to the long term gain. This difference is used to create a second long term feature, this is done using a filter with different filter coefficient depending on if the long term difference is higher or lower than the currently estimated average difference according to

. (149)

This second long term feature is then combined with the frame difference to prevent the filtering from masking occasional high frame differences, the final parameter is the maximum of the frame and the long term version of the feature

. (150)

The feature created using the difference between 2nd order prediction and 16th order prediction is analysed slightly differently. The first step here is also to calculate prediction gain as

(151)

where represents the residual energy after a 2nd order linear prediction and is the residual energy after a 16th order linear prediction, see clause 5.1.9.4. This limited prediction gain is then used for two long term estimates of this gain, one where the filter coefficient differs if the long term estimate is to be increased or not as shown in

. (152)

The second long term estimate uses a constant filter coefficient, according to

. (153)

For most types of background signals both will be close to 0, but have different responses to content where the 16th order linear prediction is needed (typically for speech and other active content). The first will usually be higher than the second. This difference between the long term features is measured according to

(154)

which is used as an input to the filter which creates the third long term feature according to

. (155)

Also, this filter uses different filter coefficients depending on if the third long term signal is to be increased or not. Also here the long term signal is combined with the input signal to prevent the filtering from masking occasional high inputs for the current frame. The final parameter is then the maximum of the frame and the long term version of the feature

. (156)

Note that also some of the other calculated features in this sub section are used in the combination logic for the noise estimation, , , , , and.

5.1.11.3.5 Additional long-term parameters used for noise estimation

Some additional parameters that processed to create long term estimates are three measures the relation of the current frames energy compared to the energy of the noise estimate. The first calculates the difference between the current frame energy and the level of the current noise estimate this is then filtered to build a long term estimate according to

(135)

Another feature estimates a long term estimate of how often the current frame energy is close to the level of the background estimate using:

(136)

The third estimate is a second order estimate for the number of frames that the current input has been close to the noise estimate. This is simply a counter is reset if the long term estimate is higher than a threshold and incremented otherwise, as shown in

(137)

The last additional features calculates an long term estimate of the difference in the current frame energy to the long term minimum energy feature, this is done by low pass filtering the calculated energy difference according to

(138)

5.1.11.4 Decision logic for noise energy update

Already in the first step of the noise estimation (see clause 5.1.11.1), the current noise estimate has been reduced in sub-bands where the background noise energy was higher than the sub-band energy for the current frame. The decision logic described in this subsection shows how it is decided when to update the background noise estimate and how large that update should be allowed to be by setting the step size, . The update is adapted based on the earlier described features or combinations thereof.

Every frame an attempt is made to adjust the background noise estimate upwards, where it is important not to do the update in active content. Several conditions are evaluated in order to decide if an update is possible and how large an allowed update should be. As it is always allowed to make downwards updates it is equally important that possible updates are not prevented for extended times as this will affect the efficiency of the SAD. The noise update uses a flag to keep track of the number of prevented noise updates, , the same flag is also used to indicate that no update has taken place. The counter is initialized to the value 0 to indicate that no update has been done so far. When updates are successful it is set to 1 and for failed updates the counter is incremented by 1.

The major decision step in the noise update logic is whether an update is to be made or not and this is formed by evaluation of the following logical expression

(161)

where ensures that it is safe to do an update provided that any of the four pause detectors, , , , and indicate that an update is allowed. Note that the last term in the condition is not is not combined with as it handles the noise estimation during initialization.

Starting with the mask which ensures that the normal updates only can occur when the current frame energy is close to the estimated long-term minimum energy, (see clause 5.1.11.1), is adjusted with a level dependent scaling of the estimated frame energy variations, , according to

(162)

The first pause detector is based on the metric control logic described in subclause 5.1.11.2.7, when is 0 updates are allowed, that is

(163)

The second pause detector allows for updates for low energy frames if the estimated signal dynamics is high and a sufficient number of frames have passed since the last correlation event, that is

(139)

The third pause detector allows updates when there are consecutive frames that are similar in energy to the current low level frames in a row,

(140)

The last detector is itself a combination of a mask and two pause detectors and mainly uses the additional features described in subclause 5.1.11.3.4, the detector is evaluated using

(141)

where is the mask for the detector and and are the additional detectors. For this detector the following seven flags are first evaluated. The first flag signals that the frame energy close to background noise energy where the threshold is adapted to the estimated frame to frame energy variations, as

(142)

The second flag signals a high linear prediction gain with 2nd order model for a stationary signal, and is defined as follows:

(168)

The third flag signals that there is a low linear prediction gain for 16th order linear prediction

(143)

The fourth flag signals that the current frame has low spectral fluctuation

(144)

The fifth flag signals that the long term correlation is low

(145)

The sixth flag signals low long term correlation value including the current frame

(146)

The seventh and last flag signals a non-speech like input signal

(147)

Using the above flags it is possible to express the mask as

(148)

The two additional detectors and , are also those built using sub detectors and additional conditions. Starting with the sub detectors are:

(149)

where the combination metrics and are combinations where the maximum of a number of metrics are used for the comparison

(150)

(151)

For the sub detector

(152)

where the combination metric is calculated as

(153)

The last term in the handles the special conditions of noise update during the initialization, which occurs during the 150 first frames after the codec start. Also the initialization flag is evaluated as a combination of two flags according to

(154)

where the first flag test for initialization period and a sufficient number of frames without correlation event, according to

(155)

The second flag evaluates a number of earlier calculated features against initialization specific thresholds according to

(156)

Every frame an attempt is made to adjust the background noise estimate upwards, as it is important not to do the update in active content several conditions are evaluated in order to decide if update is possible and how large an update that should be allowed. At the same time it is important that possible updates are not prevented for extended times. The noise update uses a flag to keep track of the number of prevented noise updates. The same flag is also used to indicate that no update has taken place. The flag is initialized to the value 0 to indicate that no update has been done so far. When updates are successful it is set to 1 and for failed updates the counter is incremented by 1.

If the above condition is evaluated to 0, the noise estimation only checks if the current content might be music by evaluating the following condition

. (183)

If this is evaluated to 1 the sub-band noise level estimates are reduced. This is done to recover from noise updates made before or during music. The reduction is made per sub-band depending on if the current estimate is high enough, according to

. (184)

and is updated according to the definition in equation (198) before noise estimation is terminated for this frame.

The following steps are taken when is evaluated to 1. First the step size, , is initially set to 0, before the process of determining if the noise update should be set to 1.0, 0.1, or 0.01. For the update to be set to 1.0 the following condition

(157)

and any of the following conditions

(158)

(159)

(188)

needs to be evaluated to 1. When this happens is also set to 1 before the noise estimation for the current frame is updated using the previously calculated new value, according to

, (189)

where is the pre-calculated new noise estimate from subclause 5.1.11.1. The noise estimation procedure is done for the current frame after the in equation (198) is updated.

If the above condition has failed then the is set to 0.1 if any of the four following conditions are met

(160)

(161)

(192)

. (193)

If the has been set to 0.1 it will be reduced to 0.01 if

(194)

and if the following condition is met

. (195)

If the is set to 0.1 or 0.01, is set to 1 before the noise estimation for the current frame is made according to

, (196)

and the noise estimation procedure is done for the current frame after is updated in equation (198).

If the conditions to set the to 0.1 or 0.01 have failed, the step size is still 0 and noise update has potentially failed. After testing if the following condition is true

(197)

the variable is incremented to keep track of potentially failed updates and the noise estimation is done after the following update of .

In all cases the noise estimation updates end with an update of which is the long-term estimate of how frequent noise estimations could be possible according to

(198)

and where is calculated in clause 5.1.11.2.6.