5.1.12 Signal activity detection

26.4453GPPCodec for Enhanced Voice Services (EVS)Detailed algorithmic descriptionRelease 15TS

In this module active signal is detected in each frame and the main flags for external use are the three flags, and the combined . These flags are set to one for the active signal, which is any useful signal bearing some meaningful information. Otherwise, they are set to zero indicating an inactive signal, which has no meaningful information. The inactive signal is mostly a pause or background noise. The three flags represent different trade-offs between quality and efficiency, and are used respectively by various subsequent processing modules.

The entire signal activity detection (SAD) module described in this section consists of three sub-SAD modules. Two of the modules, namely the SAD1 and the SAD2, work on the spectral analysis of the 12.8kHz sampled signal, see subclause 5.1.12.1 and 5.1.12.2 respectively for detailed descriptions. The third module, namely the SAD3, operates on the CLDFB that runs on the input sampling frequency, see subclause 5.1.12.6. A preliminary activity decision, , is first obtained by combining two of the three sub-SAD modules, the SAD1 and SAD2, for input with bandwidth greater than NB, or directly from SAD1 for NB input. This preliminary decision is then further combined with the decision output of the third sub-SAD module, the SAD3, depending upon the codec mode of operation and the input signal characteristic. The resulting decision is then feed to a DTX hangover module to produce the final output .

Internally the flag is used to always produce a flag with DTX hangover whether DTX is on or off. When this no longer is needed and DTX is on replaces the combined to reduce the number of variables used externally.

5.1.12.1 SAD1 module

The SAD1 module is a sub-band SNR based SAD with hangover that utilizes significance thresholds to reduce the amount off false detections for energy variations in the background noise. During SAD initialization period the following variables are set as follows

(162)

The output of the SAD1 module is two binary flags (signal activity decisions) and . The difference between them is due to the setting of parameters for the significance thresholds. The first binary decision is used by the speech/music classification algorithm described in clause 5.1.13.5. The second binary decision is developed further and leads to the final SAD1 decision, . Note that all decisions can be modified by the subsequent modules.

The spectral analysis described in clause 5.1.5 is performed twice per frame. Let and denote the energy per critical band for the first and second spectral analysis, respectively (as computed in clause 5.1.5.2). The average energy per critical band for the whole frame and part of the previous frame is computed as

(163)

where denotes the energy per critical band from the second analysis of the previous frame, and hereafter denote respectively the minimum and the maximum critical band involved in the computation, where= 1, = 16 for NB input signals and = 0, = 19 for WB signals (see Table 2 in subclause 5.1.5.1). The signal-to-noise ratio (SNR) per critical band is then computed as

(164)

where is estimated noise energy per critical band, as explained in clause 5.1.11.1. The average SNR per frame, in dB, is then computed using significance thresholds with two different settings

(165)

where , , and are control parameters that differ between codec modes and sampling rates.

Table 6: Control parameters for the significance thresholds for different bandwidths

Bandwidth

NB

2.65

0.05

1.75

0.25

WB

2.5

0.2

1.3

0.8

SWB

2.5

0.2

1.75

0.25

The signal activity is detected by comparing the two average SNR’s per frame to a certain threshold the first is then used without hangover and the second has a hangover period added to prevent frequent switching at the end of an active speech period. The threshold is a function of the long-term SNR and the estimated frame to frame energy variations, mainly the variation in noise but without the need to identify noise frames. The initial estimate of the long-term SNR is given by

(166)

where is the long-term active signal energy, calculated in equation (167) and is the long-term noise energy, calculated in equation (168). If this estimate is lower than the signal dynamics estimate calculated in equation (169). Then the estimate is adjusted according to

(170)

The energy variation is the calculated in equation (105) in clause 5.1.11.1.

The threshold calculation is calculated in three steps, one initial value and two sequential modifications. The initial value is calculated as

(171)

Where the function parameters, are set according to the current input bandwidth summarized in the following table

Table 7: Functional parameters for the initial calculation for different bandwidths

Bandwidth

NB

0.1

16.0

4.00

1.15

WB and SWB

0.1

16.1

2.05

1.65

If the estimated SNR conditions are good, i.e. if , the threshold is updated and upper limited for certain low-level NB signals. That is

(172)

5.1.12.1.1 SNR outlier filtering

The average SNR per frame, , that is estimated as shown in equation (202) is updated such that any sudden instantaneous SNR variations in certain sub-bands do not cause spurious deviations in the average SNR from the long term behaviour. A set of bands and SNRs per band are determined and accumulated based on noise characteristics as shown in equations (209), (210). The critical band that contains the maximum average SNR is identified initially as the outlier band whose index is represented as, , and the outlier band SNR is given by,

(173)

(174)

The background noise energy is accumulated in bands through and in bands through .

(175)

(176)

The average SNR, , is modified for WB and SWB signals through outlier filtering as follows,

(177)

The outlier filtering parameters used in updating the average SNR are listed in the table below.

Table 8: SNR outlier filtering parameters

Parameter

value

MAX_SNR_OUTLIER_1

10

MAX_SNR_OUTLIER_2

25

MAX_SNR_OUTLIER_3

50

SNR_OUTLIER_WGHT_1

1.0

SNR_OUTLIER_WGHT_2

1.01

SNR_OUTLIER_WGHT_3

1.02

OUTLIER_THR_1

10

OUTLIER_THR_2

6

Maximum outlier band index

(MAX_SNR_OUTLIER_IND)

17

TH_CLEAN

35

Based on the outlier band estimated in equation (207), a weighting is determined as per equation (211) and applied to SNRs per band (through outlier filtering by subtracting the SNR in the outlier band) or on the average SNR. The threshold, , is updated based on the outlier filtering and further statistics from background noise level variations, previous frame coder type, and the weighting of SNR per band. The threshold update is not performed when the long-term SNR, is below the clean speech threshold, TH_CLEAN = 35dB.

(178)

where the smoothed average SNR, , is calculated after the SNR outlier filtering is performed in equation (179).

(180)

The updated threshold, , as shown in equation (181) and the updated average SNR, , as shown in equation (182) are used in signal activity detection logic as described in Clause 5.1.12.3.

5.1.12.2 SAD2 module

The SAD2 module is also a sub-band SNR based SAD and makes an activity decision for each frame by measuring the frame’s modified segmental SNR. The output of SAD2 module is a binary flag which is set to 1 for active frame and set to 0 for inactive frame. For each frame, the SNR per critical band is first computed. The average energy per critical band for the whole frame and part of the previous frame is computed as

(183)

where denotes the energy per critical band from the second spectral analysis of the previous frame, and denote respectively the energy per critical band for the first and second spectral analysis of the current frame, = 0, = 19. More weighting is given to the energy of the second spectral analysis for the current frame if the energy of the second spectral analysis is higher than the first spectral analysis. This is designed to improve the detection of signal onsets. The SNR per critical band is then computed as

(184)

where is the estimated noise energy per critical band, as described in clause 5.1.11. The SNR per critical band is then converted to a logarithmic domain as

(185)

The log SNR per critical band is then modified by

(186)

where is the modified SNR per critical band, is an offset value which is a function of the critical band and the long-term SNR of the input signal as calculated in equation (166), summation of is constrained to be not greater than 2, and is an exponential factor used to re-shape the mapping function between and , is also a function of the long-term SNR of the input signal. The offset value is determined as shown in the following table

Table 9: Determination of

<2

2<7

7<18

18

>24

0

0

0

0

18<24

0.1

0.2

0.2

0.2

18

0.2

0.4

0.3

0.4

and is determined as

(187)

The modified segmental SNR is then computed as

(188)

and a relaxed modified segmental SNR is also computed. The procedure of calculating the relaxed modified segmental SNR is similar to the calculation of the modified segmental SNR with the only difference being that, besides , another offset value is also added to the log SNR per critical band when calculating the relaxed modified SNR per critical band. The relaxed modified SNR per critical band is therefore computed as

(189)

where is a function of the critical band and is determined as

(190)

The relaxed modified segmental SNR is used in the hangover process at a later stage of the algorithm.

A further enhancement (increase in value) is made to the modified segmental SNR if an unvoiced signal is detected. Unvoiced signal is detected if both of the two critical bands covering the highest frequency range have SNRs greater than a threshold of 5, i.e. if and . In this case, the contributions of the overall modified segmental SNR from the two critical bands is boosted. The boost is performed over the two critical bands where the number of critical bands is extended from two to eight and the corresponding modified segmental SNR is re-computed over the extended bands as

(191)

where multiplication by 20/26 effectively performs the mapping of the modified segmental SNR calculated on the extended scale back onto the same scale as if it were computed over the original 20 critical bands. The re-computation of modified segmental SNR is only conducted if the computed value is greater than before. If no unvoiced signal of above type is detected, , which is the number of critical bands whose SNR is greater than a threshold of 2 is determined. If >13, a second type unvoiced signal is detected, and if the long-term SNR of the input signal is further below a threshold of 24, the modified segmental SNR in this case is re-computed as

(192)

where and is limited to be a positive value.

The primary signal activity decision is made in SAD2 by comparing the modified segmental SNR to a decision threshold . The decision threshold is a piece wise linear function of the long-term SNR of the input signal and is determined as

(193)

If the modified segmental SNR is greater than the decision threshold, the activity flag is set to 1, and a counter of consecutive active frames, , used by SAD2 is incremented by 1, and if the current frame is ineither a soft or a hard hangover period as described later in this subclause, the corresponding hangover period elapses by 1. Otherwise, the consecutive active frames counter is set to 0 and the setting of is further evaluated by a hangover process.

The hangover scheme used by SAD2 consists of a soft hangover process followed by a hard hangover process. The soft hangover is designed to prevent low level voiced signals during a speech offset from being cut. When within the soft hangover period, the SAD2 is operating in an offset working state where the relaxed modified segmental SNR calculated earlier is used to compare to the decision threshold(compared to the normal working state where the modified segmental SNR is used). If the relaxed modified segmental SNR is greater than the decision threshold, the activity flag is set to 1 and the soft hangover period elapses by 1. Otherwise, if the relaxed modified segmental SNR is not greater than the decision threshold, the soft hangover period is quit and the setting of is finally evaluated by a hard hangover process. When within the hard hangover period, the activity flag is forced to 1 and the hard hangover period elapses by 1. The soft hangover period is initialized if the number of consecutive voiced frames exceeds 3. The frame is considered a voiced frame if the pitch correlation is not low and the pitch stays relatively stable, that is, if and where ,, are respectively the normalized pitch correlation for the second half of the previous frame, the first half of the current frame and the second half of the current frame as calculated, is the noise correction factor, ,,, are respectively the OL pitch lag for the second half of the previous frame, the first half of the current frame, the second half of the current frame and the look-ahead as described in subclause 5.1.10. The value to which the soft hangover period is initialized is a function of the long-term SNR of the input signal and the noise fluctuation computed in equation (194), and is determined as

Table 10: Determination of the soft hangover period initialization length

1

1

2

1

3

4

The hard hangover period is initialized if the consecutive active frames counter reaches a threshold of 3. The value to which the hard hangover period is initialized is also a function of the long-term SNR of the input signal and the noise fluctuation, and is determined as

Table 11: Determination of the hard hangover period initialization length

1

1

2

1

1

3

The noise fluctuation is estimated over background frames declared as inactive by the final SAD flag of SAD2, , by measuring the moving average of the segmental SNR in the logarithm domain. The noise fluctuation is computed as

(195)

where denotes the noise fluctuation of the previous frame, is the forgetting factor controlling the update rate of the moving average filter and is set to 0.99 for an increasing update ( when ) and 0.9992 for a decreasing update ( when ). is constrained by for decreasing updates and for increasing updates. To speed up the initialization of noise fluctuation, for the first 50 background frames, is set to 0.9 for increasing updates and 0.95 for decreasing updates, is constrained by for decreasing updates and for increasing updates.

5.1.12.3 Combined decision of SAD1 and SAD2 modules for WB and SWB signals

The decision of the SAD1 module is modified by the decision of the SAD2 module for WB and SWB signals.

For the decision logic is direct if the average SNR per frame is larger than the SAD decision threshold and if the final SAD2 flag is set to 1. That is,

(196)

Likewise, for the decision logic is direct if the average SNR per frame is larger than the SAD decision threshold and if the final SAD2 flag is set to 1.That is,

(197)

otherwise, is set to 0 and the hangover logic decides if should be set to active or not.

The hangover logic works as a state machine that keeps track of the number of frames since the last active primary decision and if a sufficient number of consecutive active frames have occurred in a row to allow the final decision to remain active even if the primary decision has already gone inactive. Thus, if there has not been a sufficient number of primary decisions in a row there is no hangover addition and the final decision is set to inactive, that is is set to 0 if is 0.

The hangover length depends on , initially set to 0 frames and if it is set to 4 frames and if it is set to 3 frames. The counting of hangover frames is reset only if at least 3 consecutive active speech frames () were present, meaning that no hangover is used ifin only one or two adjacent frames. This is to avoid adding the hangover after short energy bursts in the acoustic signal, increasing the average data rate in the DTX operation.

5.1.12.4 Final decision of the SAD1 module for NB signals

Similarly to the WB case two primary decisions are generated but in this case there is no dependency on the SAD2 module. If the frame is declared as active and the primary SAD flag, is set to1. Otherwise, is set to 0.

To get the final SAD decision there is a difference in how the primary decision is made and how the hangover is handled compared to WB. For NB signals the hangover has a window of 8 frames after the last run of three consecutive active primary decision, that is . During this hangover period the SAD decision is not automatically set to active; instead, the threshold is decreased by 5.2 if, and by 2 if . The SAD decision is then made by comparing the average SNR to the corrected threshold following condition (172). Again, counting of hangover frames is reset only if at least 3 consecutive active frames were present.

5.1.12.5 Post-decision parameter update

After the final decision is formed, some SAD1-related long term-parameters are updated according to the primary and final decisions.

(198)

These are then used to form a measure of the long term primary activity of

(199)

Similar metrics are generated for the

(200)

These are then used to form an measure of the long term primary activity of

(201)

To keep track of the history decisions the registers are updated so that keeps track of the latest 50 frames with regard to the decisions by removing the oldest decision and adding the latest and updating so that it reflects the current number of active frames in the registers.

Similarly to keep track of the history for decisions the registers are updated so that keeps track of the latest 16 frames with regard to the decisions by removing the oldest decision and adding the latest and updating so that it reflects the current number of active frames in the registers.

When the SAD decisions have been made, the speech music classifier decision has been made and the noise estimation for the current frame has been completed the long term estimates of active speech level, , and long term noise level estimate can be updated. During the four first frames the initialization is made for both variables using mainly the current input as follows

(202)

Where is the total sub band noise level after update for the current frame. For the long term noise level estimate the initialization uses different filter coefficient during the remainder of the 150 frames initialization as follows

(203)

For the active speech level the update after the initial four frames only occurs if is 1 AND the from the speech music classifier is 0. Where is generated as based on the features calculated based on equation (329) in subclause 5.1.13.6.3. If those conditions are met then the speech level estimate is updated according to

(204)

5.1.12.6 SAD3 module

The SAD3 module is shown in Figure 12. The processing steps are described as follows:

  1. Extract features of the signal according to the sub-band signals from CLDFB.
  2. Calculate some SNR parameters according to the extracted features of the signal and make a decision of background music.
  3. Make a pre-decision of SAD3 according to the features of the signal, the SNR parameters, and the output flag of the decision of background music and then output a pre-decision flag.
  4. The output of SAD3 is generated through the addition of SAD3 hangover.

Figure 12: Block diagram of SAD3

5.1.12.6.1 Sub-band FFT

Sub-band FFT is used to obtain spectrum amplitude of signal. Let X[k, l] denote the output of CLDFB applied to the lth sample in the k th sub-band. X[k,l] is converted into a frequency domain representation from the time domain by FFT as follows:

(235)

The spectrum amplitude of each sample is computed in the following steps:

Step 1: Compute the energy of as follows:

(205)

where , are the real part and the imaginary part of , respectively.

Step 2: If k is even, the spectrum amplitude, denoted by , is computed by

(206)

If k is odd, the spectrum amplitude is computed by

(207)

5.1.12.6.2 Computation of signal features

In Pre-decision Energy Features (EF), Spectral Centroid Features (SCF), and Time-domain Stability Features (TSF) of the current frame are computed by using the sub-band signal; Spectral Flatness Features (SFF) and Tonality Features (TF) are computed by using the spectrum amplitude.

5.1.12.6.2.1 Computation of EF

The energy features of the current frame are computed by using the sub-band signal. The energy of background noise of the current frame, including both the energy of background noise over individual sub-band and the energy of background noise over all sub-bands, is estimated with the updated flag of background noise, the energy features of the current frame, and the energy of background noise over all sub-bands of the previous frame. The energy of background noise of the current frame will be used to compute the SNR parameters of the next frame (see subclause 5.1.12.6.3). The energy features include the energy parameters of the current frame and the energy of background noise. The energy parameters of the frame are the weighted or non-weighted sum of energies of all sub-bands.

The frame energy is computed by:

(208)

The energy of sub-band divided non-uniformly is computed by:

(209)

Where is the sub-band division indices of . The sub-bands based on this kind of division are also called SNR sub-bands and are used to compute the SNR of sub-band. is the number of SNR sub-bands.

The energy of sub-band background noise of the current frame is computed by:

(241)

Where is the energy of sub-band background noise of the previous frame.

The energy of background noise over all sub-bands is computed according to the background update flag, the energy features of the current frame and the tonality signal flag, and it is defined as follows:

(210)

If certain conditions that include at least that the background update flag is 1 and the tonality signal flag is 0 are met, and are computed by:

(211)

(212)

Otherwise, and are computed by:

(213)

(214)

Where and are the sum of and the counter of , respectively. The superscript [-1] denotes the previous frame and [0] denotes the current frame.

5.1.12.6.2.2 Computation of SCF

The spectral centroid features are the ratio of the weighted sum to the non-weighted sum of energies of all sub-bands or partial sub-bands, or the value is obtained by applying a smooth filter to this ratio. The spectral centroid features can be obtained in the following steps:

a) Divide the sub-bands for computing the spectral centroids as shown in Table 12.

Table 12: Sub-band division for computing spectral centroids

Spectral centroid feature number (i)

2

0

9

3

1

23

b) Compute two spectral centroid features, i.e.: the spectral centroid in the first interval and the spectral centroid in the second interval, by using the sub-band division for computing spectral centroids in Step a) and the following equation:

(247)

c) Smooth the spectral centroid in the second interval, , to obtain the smoothed spectral centroid in the second interval by

(248)

5.1.12.6.2.3 Computation of SFF

Spectral Flatness Features are the ratio of the geometric mean to the arithmetic mean of certain spectrum amplitude, or this ratio multiplied by a factor. The spectrum amplitude, is smoothed as follows:

(215)

where and are the smoothed spectrum amplitude of the current frame and the previous frame, respectively. is the number of spectrum amplitude.

Then the smoothed spectrum amplitude is divided into three frequency regions as shown in Table 12 and the spectral flatness features are computed for these frequency regions.

Table 12: Sub-band division for computing spectral flatness

Spectral flatness number

(k)

0

5

19

1

20

39

2

40

64

The spectral flatness features are the ratio of the geometric mean to the arithmetic mean of the spectrum amplitude or the smoothed spectrum amplitude.

Let +1 be the number of the spectrum amplitudes used to compute the spectral flatness feature. We have

(250)

The spectral flatness features of the current frame are further smoothed as follows:

(251)

Where and are the smoothed spectral flatness features of the current frame and the previous frame respectively.

5.1.12.6.2.4 Computation of TSF

The time-domain stability features are the ratio of the variance of the sum of energy amplitudes to the expectation of the squared sum of energy amplitudes, or this ratio multiplied by a factor. The time-domain stability features are computed with the energy features of the most recent N frame. Let the energy of the nth frame be . The energy amplitude of is computed by

(252)

By adding together the energy amplitudes of two adjacent frames from the current frame to the Nth previous frame, N/2 sums of energy amplitudes are obtained as

(253)

Where is the energy amplitude of the current frame for k = 0 and the energy amplitude of the previous frames for k < 0.

Then the ratio of the variance to the average energy of the N/2 recent sums is computed and the time-domain stability is obtained as follows:

(254)

Note that the value of N is different when computing different time-domain stabilities.

5.1.12.6.2.5 Computation of TF

The tonality features are computed with the spectrum amplitudes. More specifically, they are obtained by computing the correlation coefficient of the amplitude difference of two adjacent frames, or with a further smoothing of the correlation coefficient, in the following steps:

a) Compute the spectrum-amplitude difference of two adjacent spectrum amplitudes in the current frame. If the difference is smaller than 0, set it to 0.

(255)

b) Compute the correlation coefficient between the non-negative amplitude difference of the current frame obtained in Step a) and the non-negative amplitude difference of the previous frame to obtain the first tonality features as follows:

(256)

where is the amplitude difference of the previous frame.

Various tonality features can be computed as follows:

(257)

where are tonality features of the previous frame.

5.1.12.6.3 Computation of SNR parameters

The SNR parameters of the current frame are computed with the background energy estimated from the previous frame, the energy parameters and the energy of the SNR sub-bands of the current frame.

The SNR of all sub-bands is computed by:

(258)

The average total SNR of all sub-bands is computed by:

(259)

where N is number of the most recent frames and is of the ith frame.

The frequency-domain SNR is computed by:

(260)

where is the number of SNR sub-band and is the SNR of the ith sub-band by:

(261)

The first long-time SNR is computed by:

(262)

The computation method of and can be found in subclause 5.1.12.6.6.

The second long-time SNR is obtained by accordingly adjusting a parameter  associated with as follows:

(263)

where:

(264)

where is the long-time background spectral centroid. If the current frame is active frame and the background-update flag is 1, the long-time background spectral centroid of the current frame is updated as follows:

(265)

where is the long-time background spectral centroid of the previous frame.

The initial long-time frequency-domain SNR of the current frame is computed by:

(266)

where and are respectively the frequency-domain SNR accumulator and frequency-domain SNR counter when the current frame is pre-decided as active sound, and and are respectively accumulator and counter when the current frame is pre-decided as inactive sound. The superscript [-1] denotes the previous frame. The details of computation can be found in Steps e) and i) of subclause 5.1.12.6.6.

The smoothed average long-time frequency-domain SNR is computed by:

(267)

The long-time frequency-domain SNR is computed by:

(268)

where MAX_LF_SNR is the maximum of .

5.1.12.6.4 Decision of background music

With the energy features, , , , and of the current frame, the tonality signal flag of the current frame is computed and used to determine whether the current frame is tonal signal. If it is a tonal signal, the current frame is music and the following procedure is carried out:

a) Suppose the current frame is a non-tonal signal, and a flag is used to indicate whether the current frame is a tonal frame. If = 1, the current frame is a tonal frame. If = 0, the current frame is a non-tonal frame.

b) If >0.6 or its smoothed value is greater than 0.86., go to Step c). Otherwise, go to Step d).

c) Verify the following three conditions:

(1) The time-domain stability feature is smaller than 0.072;

(2) The spectral centroid feature is greater than 1.2;

(3) One of three spectral flatness features is smaller than its threshold, .

If all the above conditions are met, the current frame is considered as a tonal frame and the flag is set to 1. Then go to Step d).

d) Update the tonal level feature according to the flag . The initial value of is set in the region [0, 1] when the active-sound detector begins to work.

(269)

Where and are respectively the tonal level of the current frame and the previous frame.

e) Determine whether the current frame is a tonal signal according to the updated and set the tonality signal flag .

If is greater than 0.5, the current frame is determined as a tonal signal. Otherwise, the current frame is determined as a non-tonal signal.

(270)

5.1.12.6.5 Decision of background update flag

The background update flag is used to indicate whether the energy of background noise is updated and its value is 1 or 0. When this flag is 1, the energy of background noise is updated. Otherwise, it is not updated.

The initial background update flag of the current frame is computed by using the energy features, the spectral centroid features, the time-domain stability features, the spectral flatness features, and the tonality features of the current frame. The initial background update flag is updated with the VAD decision, the tonality features, the SNR parameters, the tonality signal flag, and the time-domain stability features of the current frame to obtain the final background update flag. With the obtained background update flag, background noise is detected.

First, suppose the current frame is background noise. If any one of the following conditions is met, the current frame is not noise signal.

a) The time-domain stability > 0.12;

b) The spectral centroid > 4.0 and the time-domain stability > 0.04;

c) The tonality feature > 0.5 and the time-domain stability > 0.1;

d) The spectral flatness of each sub-band or the average obtained by smoothing the spectral flatness is smaller than its specified threshold, or one of three spectral flatness features is smaller than its threshold: ;

e) The energy of the current frame is greater than a specified threshold: , where is the long time smoothed energy of the previous frame and of kth frame is computed : ;

f) The tonality features are greater than their corresponding thresholds: >0.60 OR >0.86;

g) The initial background update flag can be obtained in Steps a) – f). The initial background update flag is then updated. When the SNR parameters, the tonality features, and the time-domain stability features are smaller than their corresponding thresholds : <0.3 AND <1.2 AND<0.5 AND <0.1 and both the combined and are set to 0, the background update flag is updated to 1.

5.1.12.6.6 SAD3 Pre-decision

The SAD3 decision is computed with the tonality signal flag, the SNR parameters, the spectral centroid features, and the energy features. The SAD3 decision is made in the following steps:

a) Obtain the second long-time SNR by computing and adjusting the ratio of the average energy of long-time active frames to the average energy of long-time background noise for the previous frame;

b) Compute the average of for a number of recent frames to obtain ;

c) Compute the SNR threshold for making SAD3 decision, denoted by, with the spectral centroid features , the second long-time SNR , the long-time frequency-domain SNR , the number of previous continuous active frames , and the number of previous continuous noise frames . Set the initial value of to. First, adjust with the spectral centroid features, if the spectral centroids are located in the different regions, an appropriate offset may be added to. Then, is further adjusted according to , , , and . When is greater than its threshold, the SNR threshold is appropriately decreased. When is greater than its threshold, the SNR threshold is appropriately increased. If is greater than a specified threshold, the SNR threshold may be accordingly adjusted.

d) Make an initial VAD decision with the SAD3 decision threshold and the SNR parameters such as and of the current frame. First is set to 0. If >, or , is set to 1. The initial VAD decision can be used to compute the average energy of long-time active frames . The value of is used to make SAD3 decision for the next frame.

(271)

where and is computed by:

(272)

(273)

e) Update the initial SAD3 decision according to the tonality signal flag, the average total SNR of all sub-bands, the spectral centroids, and the second long-time SNR. If the tonality signal flag is 1, is set to 1. The parameters and are updated by:

(274)

(275)

If >(B+ *A), where A and B are two constants, is set to 1. If any one of the following conditions is met:

condition 1:

condition 2:

is to 1. Where , and are the thresholds.

f) Update the number of hangover frames for active sound according to the decision result, the long-time SNR, and the average total SNR of all sub-bands for several previous frames, and the SNR parameters and the SAD3 decision for the current frame; See subclause 5.1.12.6.7 for details;

g) Add the active-sound hangover according to the decision result and the number of hangover frames for active sound of the current frame to make the SAD3 decision;

h) Make a combined decision with and . The output flag of the combined decision is namely combined. See subclause 5.1.12.7;

i) After Steps g) and h), the average energy of long-time background noise, denoted by, can be computed with the SAD decisions combined and . is used to make the SAD decision for the next frame. If both combined and are 0, , are updated and is computed as follows:

(276)

(277)

(278)

where and is computed by:

(279)

(280)

The functions of the Pre-decision module are described in Steps a) – e) in this subclause.

5.1.12.6.7 SAD3 Hangover

The long-time SNR and the average total SNR of all sub-bands are computed with the sub-band signal (See subclause 5.1.12.6.2.1 and 5.1.12.6.3). The current number of hangover frames for active sound is updated according to the SAD3 decision of several previous frames, , , other SNR parameters, and the SAD3 decision of the current frame. The precondition for updating the current number of hangover frames for active sound is that the flag of active sound indicates that the current frame is active sound. If both the number of previous continuous active frames <8 and <4.0, the curent number of hangover frames for active sound is updated by subtracting from the minimum number of continuous active frames. Suppose the minimum number of continuous active frames is 8. The updated number of hangover frames for active sound, denoted by , is computed as follows:

(281)

Otherwise, if both > 0.9 and > 50, the number of hangover frames for active sound is set according to the value of . Otherwise, this number of hangover frames is not updated.

is set to 0 for the first frame. When the current frame is the second frame and the subsequent frames, is updated according to the previous combined as follows:

If the previous combined is 1, is increased by 1;

If the previous combined is 0, is set to 0.

5.1.12.7 Final SAD decision

The feature parameters mentioned above are divided into two categories. The first feature category includes the number of continuous active frames, the average total SNR of all sub-bands, and the tonality signal flag . is the average of SNR over all sub-bands for a predetermined number of frames. The second feature category includes the flag of noise type, the smoothed average long-time frequency-domain SNR in a predetermined period of time, the number of continuous noise frames, frequency-domain SNR.

First, the parameters in the first and second feature categories and and are obtained. The first and second feature categories are used for the SAD detection.

The combined decision is made in the following steps:

  1. Compute the energy of background noise over all sub-bands for the previous frame with the background update flag, the energy parameters, and the tonality signal flag of the previous frame and the energy of background noise over all sub-bands of the previous 2 frames. Computing the background update flag is described in subclause 5.1.12.6.5.
  2. Compute the above-mentioned with the energy of background noise over all sub-bands of the previous frame and the energy parameters of the current frame.
  3. Determine the flag of noise type according to the above-mentioned parameters and . First, the noise type is set to non-silence. Then, when is greater than the first preset threshold and is greater than the second preset threshold, the flag of noise type is set to silence.

Then, the features in the first and second feature categories, and are used for active-sound detection in order to make the combined decision of SAD.

When the input sampling frequency is 16 kHz and 32 kHz, the decision procedure is carried out as follows:

a) Select as the initial value of the combined;

b) If the noise type is silence, and the frequency-domain SNR is greater than 0.2 and the combined set 0, is selected as the output of the SAD, combined . Otherwise, go to Step c).

c) If the smoothed average long-time frequency-domain SNR is smaller than 10.5 or the noise type is not silence, go to Step d). Otherwise, the initial value of the combined in Step a) is still selected as the decision result of the SAD;

d) If any one of the following conditions is met, the result of a logical operation OR of and is used as the output of the SAD. Otherwise, go to Step e):

Condition 1: The average total SNR of all sub-bands is greater than the first threshold, e.g. 2.2;

Condition 2: The average total SNR of all sub-bands is greater than the second threshold, e.g. 1.5 and the number of continuous active frames is greater than 40;

Condition 3: The tonality signal flag is set to 1.

e) When the input sampling frequency is 32 kHz: If the noise type is silence, is selected as the output of the SAD and the decision procedure is completed. Otherwise, the initial value of the combined in Step a) is still selected as the decision result of the SAD. When the input sampling frequency is 16 kHz: is selected as the output of the SAD and the decision procedure is completed.

When the input sampling frequency is neither 16 kHz nor 32 kHz, the procedure of the combined decision is performed as follows:

a) Select as the initial value of the combined;

b) If the noise type is silence, go to Step c). Otherwise, go to Step d);

c) If the smoothed average long-time frequency-domain SNR is greater than 12.5 and =0, the combined is set to. Otherwise, the initial value of combined in Step a) is selected as the decision result of the SAD;

d) If any one of the following conditions is met, the result of a logical operation OR of and is used as the output of the final SAD, combined . Otherwise, the initial value of combined in Step a) is selected as the decision result of the SAD;

Condition 1: The average total SNR of all sub-bands is greater than 2.0;

Condition 2: The average total SNR of all sub-bands is greater than 1.5 and the number of continuous active frames is greater than 30;

Condition 3: The tonality signal flag is set to 1.

After the combined is obtained by using the above-mentioned method, it needs to be modified as follows:

a) Compute the number of background-noise updates, according to the background update flag, specifically:

When the current frame is indicated as background noise by the background update flag and is smaller than 1000, increases by 1. Note that is set to zero at the initialization of the codec.

b) Compute number of modified frames for active sound, according to the SAD3 decision , the number of background-noise updates , and the number of hangover frames for active sound , specifically:

When the current frame is indicated as active sound by and is smaller than 12, is selected as max(20, ).

c) Compute the final decision of SAD for the current frame according to the number of modified frames for active sound and the combined , specifically:

When the current frame is indicated as inactive sound by the combined and is greater than 0, the final decision of SAD for the current frame, the combined is modifiedas active sound and decreases by 1.

5.1.12.8 DTX hangover addition

For better DTX performance a version of the combined is generated through the addition of hangover. In this case there are two concurrent hangover logics that can extend the active period. One is for DTX in general and one specifically to add additional DTX hangover in the case of music.

During the SAD initialization period the following variables are set as follows

(216)

The general DTX hangover works in the same way as the SAD1 hangover the main difference is in the hangover length. Also here the initial DTX hangover length depends on , initially the hangover is set to 2 frames and if the current input bandwidth is NB and or it is set to 3, then follows a number of steps that may modify this start value. The modification depends on other input signal characteristics and codec mode.

The first two modifications increases the hangover length if there has already been a high activity, additional activity after a long burst has little effect on the total activity but can better cover short pauses. If there has been 12 or more active frames from the primary detector in SAD1 during the 16 last frames, that is , the allowed number of hangover frames is increased with 2 frames. Similarly if there has been 40 or more active frames for the final decision of SAD1 during the last 50 frames , that is , the allowed number of hangover frames is increased with 5 frames. At this point the allowed number of hangover frames may have been increased with 7 frames over the initial value, and to limit the total number of hangover frames it is therefore limited to . Another condition for limiting the hangover addition is if the primary activity becomes low there are different limits for different codec conditions, for AMR_WB_IO core the limit is 2, the same limit is also used for high SNR for WB or SWB input in other conditions the limit is 3 frames. The condition for applying the limit is if the primary activity or if the .

The DTX hangover can also be reduced if the final decision from SAD3 already includes a long hangover.

According to the noise type in SAD3, the decrement of the DTX hangover is set as shown in Table 13.

Table 13: Setting of the decrement of the DTX hangover

Bandwidth

Silence-type noise

Non-silence-type noise

NB

0

1

WB

2

3

SWB

2

1

As for the hangover in SAD1 the counting of DTX hangover frames is reset only if at least 3 consecutive active speech frames () or if the SAD1 final decision has been active for over 45 of the 50 latest frames.

For the music hangover to start counting music hangover frames AND AND AND at which point the for the next 15 frames or until hangover is terminated by the hangover termination logic, which can be triggered by the flag which is described below.

The DTX hangover and the hangover described in subclause 5.1.12.3 when decisions from SAD1 and SAD2 are combined may be early terminated. The early hangover termination helps to increase the system capacity by saving unnecessary hangover frames. At each hangover frame, the comfort noise which will be produced at the decoder side is estimated at the encoder side, assuming if the current hangover frame would be encoded as the first SID frame after active burst. If the estimated comfort noise is found close to the noise characteristic maintained in the local CNG module in the encoder side, then no more hangover frame is considered needed and the hangover is terminated. Otherwise, hangover keeps on as long as the initial hangover length is not reached.

Specifically, the energy and the LSP spectrum of the comfort noise which will be produced at the decoder side are estimated at the encoder side. The energy of the current frame excitation is calculated

(217)

which is then converted to log domain

(218)

where is the LP excitation of the current frame calculated in subclause 5.6.2.1.5, is the frame length, is limited to non-negative value. An age weighted average energy, , is calculated from hangover frames except the current frame in the same way as described in sub-clause 6.7.2.1.2. The , together with the energy of the current frame excitation are used to compute the estimated excitation energy for the comfort noise, .

(219)

where is a smoothing factor, = 0.8 if , the number of hangover frames used for calculation is less than 3, otherwise, = 0.95. The estimated excitation energy for the comfort noise is then converted to log domain

(220)

where is bounded to non-negative value. An average LSP vector, , is calculated over the same hangover frames where the age weighted average energy is calculated in the same way as described in sub-clause 6.7.2.1.2. The , together with the end-frame LSP vector of the current frame are used to compute the estimated LSP vector for the comfort noise, .

(221)

A set of energy and LSP difference parameters are calculated. The difference between the current frame log excitation energy and the log hangover average excitation energy is calculated.

(222)

The difference between the current frame end-frame LSP vector and the hangover average LSP vector is calculated.

(223)

where is the order of LP filter. The difference between the estimated log excitation energy for the comfort noise and the current log excitation energy for the comfort noise kept in the local CNG module is calculated.

(224)

where is the comfort noise excitation energy kept in the local CNG module as calculated in subclause 5.6.2.1.6. The difference between the estimated LSP vector for the comfort noise and the current LSP vector for the comfort noise kept in the local CNG module is calculated.

(225)

where is the comfort noise LSP vector kept in the local CNG module as calculated in subclause 5.6.2.1.4. The maximum difference per LSP element between the estimated LSP vector for the comfort noise and the current LSP vector for the comfort noise kept in the local CNG module is calculated.

(226)

The hangover termination flag is set to 1 if and and and and when operating in VBR mode, or if and and and and when operating in non-VBR mode. Otherwise is set to 0. A set to 1 means the current frame can be encoded as a SID frame even it is still in the hangover period. For safety reason of prevent CNG on short pauses between speech utterances, the actual encoding of SID frame is delayed by one frame.