5.1.13 Coding mode determination

26.4453GPPCodec for Enhanced Voice Services (EVS)Detailed algorithmic descriptionRelease 15TS

To get the maximum encoding performance, the LP-based core uses a signal classification algorithm with six distinct coding modes tailored for each class of signal, namely the Unvoiced Coding (UC) mode, Voiced Coding (VC) mode, Transition Coding (TC) mode, Audio Coding (AC) mode, Inactive Coding (IC) mode and Generic Coding (GC) mode. The signal classification algorithm uses several parameters, some of them being optimized separately for NB and WB inputs.

Figure 13 shows a simplified high-level diagram of the signal classification procedure. In the first step, the SAD decision is queried whether the current frame is active or inactive. In case of inactive frame, IC mode is selected and the procedure is terminated. In the IC mode the inactive signal is encoded either in the transform domain by means of the AVQ technology or in the time/transform domain by means of the GSC technology, described below. In case of active frames, the speech/music classification algorithm is run to decide whether the current frame shall be coded with the AC mode. The AC mode, has been specifically designed to efficiently encode generic audio signals, particularly music. It uses a hybrid encoding technique, called the Generic Signal audio Coder (GSC) which combines both, LP-based coder operated in the time domain and a transform-domain coder. If the frame is not classified as “audio”, the classification algorithm continues with selecting unvoiced frames to be encoded with the UC mode. The UC mode is designed to encode unvoiced frames. In the UC mode, the adaptive codebook is not used and the excitation is composed of two vectors selected from a linear Gaussian codebook.

If the frame is not classified as unvoiced, then detection of stable voiced frames is applied. Quasi-periodic segments are encoded with the VC mode. VC selection is conditioned by a smooth pitch evolution. It uses ACELP technology, but given that the pitch evolution is smooth throughout the frame, more bits are assigned to the algebraic codebook than in the GC mode.

The TC mode has been designed to enhance the codec performance in the presence of frame erasures by limiting the usage of past information [19]. To minimize at the same time its impact on a clean channel performance, it is used only on the most critical frames from a frame erasure point of view – these are voiced frames following voiced onsets.

If a frame is not classified in one of the above coding modes, it is likely to contain a non-stationary speech segment and is encoded using a generic ACELP model (GC).

Figure 13: High-level diagram of the coding mode determination procedure

The selection of the coding modes is not uniform across the bitrates and input signal bandwidth. These differences will be described in detail in the subsequent sections. The classification algorithm starts with setting the current mode to GC.

5.1.13.1 Unvoiced signal classification

The unvoiced parts of the signal are characterized by a missing periodic component. The classification of unvoiced frames exploits the following parameters:

– voicing measures

– spectral tilt measures

– sudden energy increase from a low level to detect plosives

– total frame energy difference

– energy decrease after spike

5.1.13.1.1 Voicing measure

The normalized correlation, used to determine the voicing measure, is computed as part of the OL pitch searching module described in clause 5.1.10. The average normalized correlation is then calculated as

(293)

where is defined in subclause 5.1.11.3.2.

5.1.13.1.2 Spectral tilt

The spectral tilt parameter contains information about frequency distribution of energy. The spectral tilt is estimated in the frequency domain as a ratio between the energy concentrated in low frequencies and the energy concentrated in high frequencies, and is computed twice per frame.

The energy in high frequencies is computed as the average of the energies in the last two critical bands

(294)

where are the critical band energies, computed in subclause 5.1.5.2 and is the maximum useful critical band (= 19 for WB inputs and = 16 for NB inputs).

The energy in low frequencies is computed as the average of the energies in the first 10 critical bands for WB signals and in 9 critical bands for NB signals. The middle critical bands have been excluded from the computation to improve the discrimination between frames with high-energy concentration in low frequencies (generally voiced) and with high-energy concentration in high frequencies (generally unvoiced). In between, the energy content is not informative for any of the classes and increases the decision uncertainty.

The energy in low frequencies is computed differently for voiced half-frames with short pitch periods and for other inputs. For voiced female speech segments, the harmonic structure of the spectrum is exploited to increase the voiced-unvoiced discrimination. These frame are characterized by the following the condition

(295)

where are computed as defined in subclause 5.1.10.4 and the noise correction factor as defined in subclause 5.1.10.6. For these frames,is computed bin-wise and only frequency bins sufficiently close to the speech harmonics are taken into account in the summation. That is

(296)

where is the first bin (= 1 for WB inputs and = 3 for NB inputs) and are the bin energies, as defined in subclause 5.1.5.2, in the first 25 frequency bins (the DC component is not considered). Note that these 25 bins correspond to the first 10 critical bands and that the first 2 bins not included in the case of NB input constitute the first critical band. In the summation above, only the terms related to the bins close to the pitch harmonics are considered. So is set to 1, if the distance between the nearest harmonics is not larger than a certain frequency threshold (50 Hz) and is set to 0 otherwise. The counter is the number of the non-zero terms in the summation. In other words, only bins closer than 50 Hz to the nearest harmonics are taken into account. Thus, only high-energy terms will be included in the sum if the structure is harmonic at low frequencies. On the other hand, if the structure is not harmonic, the selection of the terms will be random and the sum will be smaller. Thus, even unvoiced sounds with high energy content in low frequencies can be detected. This processing cannot be done for longer pitch periods, as the frequency resolution is not sufficient. For frames not satisfying the condition (295), the low frequency energy is computed per critical band as

(297)

for WB and NB inputs, respectively. The resulting low- and high-frequency energies are obtained by subtracting the estimated noise energy from the valuesand calculated above. That is

(298)

(299)

where is the average noise energy in critical bands 18 and 19 for WB inputs, and 15 and 16 for NB inputs andis the average noise energy in the first 10 critical bands for WB input and in the critical bands 1-9 for NB inputs. They are computed similarly as in the two equations above. The estimated noise energies have been integrated into the tilt computation to account for the presence of background noise.

Finally, the spectral tilt is given by

(300)

For NB signals, the missing bands are compensated by multiplying by 6. Note that the spectral tilt computation is performed twice per frame to obtainand, corresponding to both spectral analyses per frame. The average spectral tilt used in unvoiced frame classification is given by

(301)

where is the tilt in the second half of the previous frame.

5.1.13.1.3 Sudden energy increase from a low energy level

The maximum energy increase from a low signal level is evaluated on eight short-time segments having the length of 32 samples. The energy increase is then computed as the ratio of two consecutive segments provided that the first segment energy was sufficiently low. For better resolution of the energy analysis, a second pass is computed where the segmentation is done with a 16 sample offset. The short-time maximum energies are computed as

(302)

where corresponds to the last segment of the previous frame, and corresponds to the current frame. The second set of maximum energies is computed by shifting the speech indices in equation (303) by 16 samples. That is

(304)

(305)

The maximum energy variation is computed as follows:

(306)

5.1.13.1.4 Total frame energy difference

The classification of unvoiced frames is further improved by taking into account the difference of total frame energy. This difference is calculated as

where is the total frame energy calculated in subclause 5.1.5.2 and is the total frame energy in the previous frame.

5.1.13.1.5 Energy decrease after spike

The detection of energy decrease after a spike prevents the UC mode on significant temporal events followed by relatively rapid energy decay. Typical examples of such signals are castanets.

The detection of the energy decrease is triggered by detecting a sudden energy increase from a low level as described in subclause 5.1.13.1.3. The maximum energy variation must be higher than 30 dB. Further, for NB inputs the mean correlation must not be too high, i.e. the condition must be satisfied too.

The energy decrease after a spike is searched within 10 overlapped short-time segments (of both sets of energies) following the detected maximum energy variation. Let’s call the index for which was found, and the corresponding set of energies . If , then the searched interval is for both sets. If , then the searched interval is for the 2nd set, but for the 1st set of energies.

The energy decrease after a spike is then searched as follows. As the energy can further increase beyond the segment for which was found, the energy increase is tracked beyond that segment to find the last segment with energy still monotonically increasing. Let’s denote the energy of that segment . Starting from that segment until the end of the searched interval, the minimum energy is then determined. The detection of an energy decrease after spike is based on the ratio of the maximum and minimum energies

(307)

This ratio is then compared to a threshold of 21 dB for NB inputs and 30 dB for other inputs.

The detection of energy decrease after a spike further uses a hysteresis in the sense that UC is prevented not only in the frame where is above the threshold (), but also in the next frame (). In subsequent frames (), the hysteresis is reset () only if the following condition is met:

. (308)

Given that the searched interval of overlapped segments is always 10, it can happen that the detection cannot be completed in the current frame if a rapid energy increase happens towards the frame end. In that case, the detection is completed in the next frame, however, as far as the hysteresis logic is concerned, the detection of energy decrease after a spike still pertains to the current frame.

5.1.13.1.6 Decision about UC mode

To classify frames for encoding with UC mode, several conditions need to be met. As the UC mode does not use the adaptive codebook and no long-term prediction is thus exploited, it is necessary to make sure that only frames without periodic content are coded with this mode. The decision logic is somewhat different for WB and NB inputs and is described for both cases separately.

For WB inputs, all of the following conditions need to be satisfied to select the UC mode for the current frame.

  1. Normalized correlation is low:
  2. Energy is concentrated in high frequencies.
  3. The current frame is not in a segment following voiced offset:

where is the raw coding mode selected in the previous frame (described later in this document).

  1. There is no sudden energy increase:
  2. The current frame is not in a decaying segment following sharp energy spike:

For NB inputs, the following conditions need to be satisfied to classify the frame for NB UC coding.

  1. Normalized correlation is low:
  2. Energy is concentrated in high frequencies.
  3. The current frame is not in a segment following voiced offset:

where is the raw coding mode selected in the previous frame (described later in this document).

  1. There is no sudden energy increase:
  2. The current frame is not in a decaying segment following sharp energy spike:

5.1.13.2 Stable voiced signal classification

The second step in the signal classification algorithm is the selection of stable voiced frames, i.e. frames with high periodicity and smooth pitch contour. The classification is mainly based on the results of the fractional open-loop pitch search described in section 5.1.10.9. As the fractional open-loop pitch search is done in a similar way as the closed-loop pitch search, it is assumed that if the open-loop search gives a smooth pitch contour within predefined limits, the optimal closed-loop pitch search would give similar results and limited quantization range can then be used. The frames are classified into the VC mode if the fractional open-loop pitch analysis yields a smooth contour of pitch evolution over all four subframes. The pitch smoothness condition is satisfied if , for i = 0, 1, 2, where is the fractional open-loop pitch lag found in subframe i (see section 5.1.10.9 for more details). Furthermore, in order to select VC mode for the current frame the maximum normalized correlation must be greater than 0.605 in each of the four subframes. Finally, the spectral tilt must be higher than 4.0.

The decision about VC mode is further improved for frames with stable short pitch evolution and high correlation (e.g. female or child voices or opera voices). Pitch smoothness is again satisfied if , for i = 0, 1, 2. High correlation is achieved in frames for which the mean value of the normalized correlation in all four subframes is higher than 0.95 and the mean value of the smoothed normalized correlation is higher than 0.97. That is

(309)

The smoothing of the normalized correlation is done as follows

(310)

Finally, VC mode is also selected in frames for which the flag = Stab_short_pitch_flag = flag_spitch has been previously set to 1 in the module described in sub-clause 5.1.10.8. Further, when the signal has very high pitch correlation, is also set to 1 so that the VC mode is maintained to avoid selecting Audio Coding (AC) mode later, as follows,

If (=1 or

(dpit1 <= 3 AND dpit2 <= 3 AND dpit3 <= 3 AND > 0.95 AND > 0.97))

{

VC = 1;

=1

}

wherein , , , and are defined in subclause 5.1.10.8.

The decision taken so far (i.e. after UC and VC mode selection) is called the “raw” coding mode, denoted. The value of this variable from the current frame and from the previous frame is used in other parts of the codec.

5.1.13.3 Signal classification for FEC

This subclause describes the refinement of the signal classification algorithm described in the previous section in order to improve the codec’s performance for noisy channels. The classification used to select UC and VC frames cannot be directly used in the FEC as the purpose of the classification is not the same. Instead, better performance could be achieved by tuning both classification aspects separately.

The basic idea behind using a different signal classification approach for FEC is the fact that the ideal concealment strategy is different for quasi-stationary speech segments and for speech segments with rapidly changing characteristics. Whereas the best processing of erased frames in non-stationary speech segments can be summarized as a rapid drop of energy, in the case of quasi-stationary signal, the speech-encoding parameters do not vary dramatically and can be kept practically unchanged during several adjacent erased frames before being damped. Also, the optimal method for a signal recovery following an erased block of frames varies with the classification of the speech signal.

Furthermore, this special classification information is also used to select frames to be encoded with the TC mode (see subclause 5.1.13.4).

To distinguish the signal classification algorithm for FEC from the signal classification algorithm for coding mode determination (described earlier in subclauses 5.1.13.1 and 5.1.13.2), we will refer here to “signal class” rather than “coding mode” and denote it .

5.1.13.3.1 Signal classes for FEC

The frame classification is done with the consideration of the concealment and recovery strategy in mind. In other words, any frame is classified in such a way that the concealment can be optimal if the following frame is missing, and that the recovery can be optimal if the previous frame was lost. Some of the classes used in the FEC do not need to be transmitted, as they can be deduced without ambiguity at the decoder. Here, five distinct classes are used and defined as follows:

• INACTIVE CLASS comprises all inactive frames. Note, that this class is used only in the decoder.

• UNVOICED CLASS comprises all unvoiced speech frames and all frames without active speech. A voiced offset frame can also be classified as UNVOICED CLASS if its end tends to be unvoiced and the concealment designed for unvoiced frames can be used for the following frame in case it is lost.

• UNVOICED TRANSITION CLASS comprises unvoiced frames with a possible voiced onset at the end. The onset is however still too short or not built well enough to use the concealment designed for voiced frames. The UNVOICED TRANSITION CLASS can only follow a frame classified as UNVOICED CLASS or UNVOICED TRANSITION CLASS.

• VOICED TRANSITION CLASS comprises voiced frames with relatively weak voiced characteristics. Those are typically voiced frames with rapidly changing characteristics (transitions between vowels) or voiced offsets lasting the whole frame. The VOICED TRANSITION CLASS can only follow a frame classified as VOICED TRANSITION CLASS, VOICED CLASS or ONSET CLASS.

• VOICED CLASS comprises voiced frames with stable characteristics. This class can only follow a frame classified as VOICED TRANSITION CLASS, VOICED CLASS or ONSET CLASS.

• ONSET CLASS comprises all voiced frames with stable characteristics following a frame classified as UNVOICED CLASS or UNVOICED TRANSITION CLASS. Frames classified as ONSET CLASS correspond to voiced onset frames where the onset is already sufficiently built for the use of the concealment designed for lost voiced frames. The concealment techniques used for frame erasures following the ONSET CLASS are the same as those following the VOICED CLASS. The difference is in the recovery strategy.

• AUDIO CLASS comprises all frames with harmonic or tonal content, especially music. Note that this class is used only in the decoder.

5.1.13.3.2 Signal classification parameters

The following parameters are used for the classification at the encoder: normalized correlation, , spectral tilt measure, , pitch stability counter, pc, relative frame energy, , and zero crossing counter, zc. The computation of these parameters which are used to classify the signal is explained below.

The normalized correlation, used to determine the voicing measure, is computed as part of the OL pitch analysis module described in subclause 5.1.10. The average correlation is defined as

(310)

where and are the normalized correlation of the second half-frame and the look ahead, respectively.

The spectral tilt measure, , is computed as the average (in dB) of both frame tilt estimates, as described in subclause 5.1.13.1.2. That is

(311)

The pitch stability counter, pc, assesses the variation of the pitch period. It is computed as follows:

(312)

where the values , and correspond to the three OL pitch estimates evaluated in each frame (see subclause 5.1.10).

The last parameter is the zero-crossing rate, zc, computed on the second half of the current speech frame and the look-ahead. Here, the zero-crossing counter, zc, counts the number of times the signal sign changes from positive to negative during that interval. The zero-crossing rate is calculated as follows

(313)

where the function sign[.] returns +1 if the value is positive or -1 it is negative.

5.1.13.3.3 Classification procedure

The classification parameters are used to define a function of merit, . For that purpose, the classification parameters are first scaled between 0 and 1 so that each parameter translates to 0 for an unvoiced signal and to 1 for a voiced signal. Each parameter, , is scaled by a linear function as follows:

(314)

and clipped between 0 and 1 (except for the relative energy which is clipped between 0.5 and 1). The function coefficients, and , have been found experimentally for each of the parameters so that the signal distortion due to the concealment and recovery techniques used in the presence of frame erasures is minimal. The function coefficients used in the scaling process are summarized in the following table.

Table 15: Coefficients of the scaling function for FEC signal classification

parameter

description

normalized correlation

2.857

-1.286

spectral tilt

0.04167

0

pitch stability counter

-0.07143

1.857

relative frame energy

0.05

0.45

zero-crossing counter

-0.04

2.4

The function of merit has been defined as

(315)

where the superscript s indicates the scaled version of the parameters. The classification is then done using the function of merit, , and following the rules summarized in the following table.

Table 16: Rules for FEC signal classification

previous class

rule

selected class

VOICED CLASS

ONSET CLASS

VOICED TRANSITION CLASS

VOICED CLASS

VOICED TRANSITION CLASS

UNVOICED CLASS

UNVOICED CLASS

UNVOICED TRANSITION CLASS

ONSET CLASS

UNVOICED TRANSITION CLASS

UNVOICED CLASS

For the purpose of FEC signal classification, all inactive speech frames, unvoiced speech frames and frames with very low energy are directly classified as UNVOICED CLASS. This is done by checking the following condition

(316)

which has precedence over the rules defined in the above table.

5.1.13.4 Transient signal classification

As a compromise between the clean-channel performance of the codec and its robustness to channel errors, the use of the TC mode is limited only to a single frame following voiced onsets and to transitions between two different voiced segments. Voiced onsets and transitions are the most problematic parts from the frame erasure point of view. Therefore, the frame after the voiced onset and voiced transitions must be as robust as possible. If the transition/onset frame is lost, the following frame is encoded using the TC mode, without the use of the past excitation signal, and the error propagation is broken.

The TC mode is selected according to the counter of frames from the last detected onset/transition . The onset/transition detection logic is described by the state machine in the following diagram.

Figure 14 : TC onset/transition state machine

In the above logic, is set to 0 for all inactive frames, resp. frames for which the FEC signal class is either UNVOICED CLASS or UNVOICED TRANSITION CLASS. When the first onset frame is encountered is set to 1. This is again determined by the FEC signal class. The onset/transition frame is always coded with the GC mode, i.e. the coding mode is set to GC in this frame. The next active frame after the onset frame increases to 2 which means that there is a transition and TC mode is selected. The counter is set to -1 if none of the above situations happens waiting for the next inactive frame. Naturally, the GC/TC mode is selected by the above logic only when the current frame is an active frame, i.e. when .

5.1.13.5 Modification of coding mode in special cases

In some special situations, the decision about coding mode is further modified. This is e.g. due to unsuitability of the selected coding mode at particular bitrate or due to signal characteristics which make the selected mode inappropriate. For example, the UC mode is supported only up to 9.6 kbps after which it is replaced by the GC mode. The reason is that for bitrates higher than 9.6 kbps the GC mode already has enough bits to fully represent the random content of an unvoiced signal. The UC mode is also replaced by the GC mode at 9.6 kbps if the counter of previous AC frames is bigger than 0. The counter of AC frames is initialized to the value of 200, incremented by 10 in every AC frame and decremented by 1 in other frames but not in IC frames. The counter is upper limited by 1000 and lower limited by 0.

At 32 and 64 kbps, only GC and TC modes are employed, i.e. if the coding mode has been previously set to UC or VC, it is overwritten to GC. Finally, for certain low-level signals, it could happen that the gain quantizer in the NB VC mode goes out of its dynamic range. Therefore, the coding mode is changed to GC mode if the relative frame energy dB but only at 8.0 kbps and lower bitrates.

The coding mode is also changed to the TC mode in case of mode switching. If, in the previous frame, 16 kHz ACELP core was used but the current frame uses 12.8 kHz ACELP core, it is better to prevent potential switching artefacts resulting from signal discontinuity and incorrect memory. This modification is done only for frames other than VC, i.e. if , where is the raw coding mode described in subclause 5.1.13.2. Further, the modification takes place only for active frames in case of DTX operation.

The coding mode is overridden to the TC mode if MDCT-based core was used in the last frame but the current frame is encoded with an LP-based core. Finally, the coding mode is changed to the TC mode if the EVS codec is operated in the DTX mode and if the last frame was a SID frame encoded with the FD_CNG technology.

5.1.13.6 Speech/music classification

Music signals, in general, are more complex than speech and conform less to any known LP-based model. Therefore, it is of interest to distinguish music signals (generic audio signals) from speech signals. Speech/music classification then allows using a different coding approach to such signals. This new approach has been called the Generic audio Signal Coding mode (GSC), or the Audio Coding (AC) mode.

The speech/music classification is done in two stages. The first stage of the speech/music classifier is based on the Gaussian Mixture Model (GMM) and performs the best statistically based discrimination of speech from generic audio. The second stage has been optimized directly for the GSC mode. In other words, the classification in the second stage is done in such a way that the selected frames are suitable for the AC mode. Each speech/music classifier stage yields its own binary decision, and which is either 1 (music) or 0 (speech or background noise). The speech decision and the background noise decision have been grouped together only for the purposes of the speech/music classification. The selection of the IC mode for inactive signals incl. background noise is done later in the codec and described in subclause 5.1.13.5.7.

The decisions of the first and the second stage are refined and corrected for some specific cases in the subsequent modules, described below. The final decision about the AC mode is done based on and but the two flags are also used for the selection of the coder technology which is described in subclause 5.1.16.

5.1.13.6.1 First stage of the speech/music classifier

The GMM model has been trained on a large database of speech and music signals covering several male and female speakers, multiple languages and various genres of instrumental and vocal music. The statistical model uses a vector of 12 unique features, all normalized to a unit interval and derived from the basic parameters that have been calculated in the pre-processing part of the encoder. There are three statistical models inside the GMM: speech, music and noise. The statistical model of the background noise has been added to improve the SAD algorithm described in subclause 5.1.12. Each statistical model is represented by a mixture of six normal (Gaussian) distributions, determined by their relative weight, mean and full covariance matrix. The speech/music classifier exploits the following characteristics of the input signal:

– OL pitch

– normalized correlation

– spectral envelope (LSPs)

– tonal stability

– signal non-stationarity

– LP residual error

– spectral difference

– spectral stationarity

Figure 15 : Schematic diagram of the first stage of the speech/music classifier

The OL pitch feature is calculated as the average of the three OL pitch estimates, i.e.

(317)

where are computed as in subclause 5.1.10.7. In onset/transition frames and in the TC frame after, it is better to use only the OL pitch estimate of the second analysis window, i.e. .

The normalized correlation feature used by the speech/music classifier is the same one as used in the unvoiced signal classification. See subclause 5.1.13.1.1. for the details of its computation. In onset/transition frames and in the TC frame after, it is better to use only the correlation value of the second analysis window, i.e. .

There are five LSF parameters used as features inside the first stage of the speech/music classifier. These are calculated as follows

(318)

Another feature used by the speech/music classifier is the correlation map which is calculated as part of the tonal stability measure in subclause 5.1.11.2.5. However, for the purposes of speech/music classification, it is not the long-term correlation map which is summed but rather the correlation map of the current frame. The reason is to limit the impact of past information on the speech/music decision in the current frame. That is

(319)

In case of NB signals, the value of is multiplied by 1.53.

Signal non-stationarity is also used in the speech/music classifier but its calculation is slightly different than in the case of background noise estimation described in subclause 5.1.11.2.1. Firstly, the current log-energy per band is defined as

, i=2,.,16 (320)

Then,

(321)

The LP residual log-energy ratio is calculated as

(322)

where the superscript [-1] denotes values from the previous frame. In case of NB signals, the statistical distribution of is significantly different than in case of WB signals. Thus, for NB signals .

For the last two features, the power spectrum must be normalized as follows:

, k=3,..,69 (323)

and difference spectrum calculated as follows:

, k=3,..,69 (324)

Then we calculate the spectral difference as the sum of in the log domain. That is

(325)

The spectral non-stationarity is calculated as the product of ratios between power spectrum and the difference spectrum. This is done as follows

(326)

5.1.13.6.2 Scaling of features in the first stage of the speech/music classifier

The feature vector FVi, i=0,..,11 is scaled in such a way that all its values are approximately in the range [0;1]. This is done as

i=0,..,11 (327)

where the scaling factors sfai and sfbi have been found on a large training database. The scaling factors are defined in the following table.

Table 17: Scaling factors for feature vector in the speech/music classifier

i

WB

NB

0

0.048

-0.0952

0.0041

0

1

1.0002

0

0.8572

0.1020

2

0.6226

-0.0695

0.6739

-0.1000

3

0.5497

-0.1265

0.6257

-0.1678

4

0.4963

-0.2230

0.5495

-0.2380

5

0.5049

-0.4103

0.5793

-0.4646

6

0.5069

-0.5717

0.2502

0

7

0.0041

0

0.0041

0

8

0.0022

-0.0029

0.0020

0

9

0.0630

1.0015f

0.0630

1.0015

10

0.0684

0.9103f

0.0598

0.8967

11

0.1159

-0.2931

0.0631

0

5.1.13.6.3 Log-probability and decision smoothing

The multivariate Gaussian probability distribution is defined as

(328)

where FV is the feature vector, µ is the vector of means and Σ is the variance matrix. As stated before, the dimension of the feature vector is k=12. The means and the variance matrix are found by the training process of the Gaussian Mixture Model (GMM). This training is done by means of the EM (Expectation-Maximization) algorithm . The speech/music classifier is trained with a mixture of 6 Gaussian distributions. The log-likelihood of each mixture is defined as

i=1,..,6 (329)

where wi is the weight of each mixture. The term is calculated in advance and stored in the form of a look-up table. The probability over the complete set of 6 Gaussians is then calculated in the following way:

(330)

and the log-likelihood over the complete set as

(331)

Since there are three trained classes in the GMM model (speech, music and noise), three log-likelihood values are obtained by the above estimation process: . Only and are used in the subsequent logic. is used in the SAD algorithm to improve detection accuracy. Therefore, in case of inactive signal when fLSAD_HE is 0

(332)

The difference of log-likelihood is calculated as

(333)

which can be directly interpreted as a speech/music decision without hangover. This decision has low dynamic range and fluctuates a lot around zero, especially for mixed signals. To improve the detection accuracy, the decision is smoothed by means of AR filtering

(334)

where the superscript [-1] denotes the previous frame and is the filtering factor which is adaptively set on a frame-by-frame basis. The filtering factor is in the range [0;1] and is based on two measures (weighting factors). The first weighting factor is related to the relative frame energy and the second weighting factor is designed to emphasize rapid negative changes of .

The energy-based weight is calculated as follows

(335)

where is relative frame energy. The result of the addition means that the weight has values close to 0.01 in low-energy segments and close to 1 in high energy, or more important, segments. Therefore, the smoothed decision follows the current decision more closely if the signal energy is relatively high and leads to past information being disregarded more readily. On the other hand, if the signal energy is low, the smoothed decision puts more emphasis on previous decisions rather than the current one. This logic is motivated by the observation that discrimination between speech and music is more difficult when the SNR of the signal is low.

The second weighting factor is designed to track sudden transitions from music to speech. This situation happens only in frames where and at the same time . In these frames

(336)

where the parameter is a quantitative measure of sudden falls, or drops, in the value of . This parameter is set to 0 in all frames that do not fulfil the previous condition. In the first frame when falls below 0, it is set equal to its negative value and it is incremented when continues to fall in consecutive frames. In the first frame when stops decreasing it is reset back to 0. Thus, is positive only during frames of falling and the bigger the fall the bigger its value. The weighting factor is then calculated as

(337)

The filtering factor is then calculated by from the product of both weights, i.e.

(338)

5.1.13.6.4 State machine and final speech/music decision

The state machine is an event-based decision system in which the state of the speech/music classifier is changed from INACTIVE to ACTIVE and vice-versa. There are two intermediate states: ENTRY and INSTABLE. The classifier must always go through the ENTRY state in order to be ACTIVE. During the ENTRY period, there is no relevant past information that could be exploited by the algorithm for hangover additional described later in this section. When the classifier is in the ACTIVE state but the energy of the input signal goes down up to the point when it is almost equal to the estimated background noise energy level, the classifier is in an UNSTABLE state. Finally, when SAD goes to 0, the classifier is in INACTIVE state. The following state-flow diagram shows the transitions between the states.

Figure 16 : State machine for the first stage of the speech/music classifier

The conditions for changing the states are described in from of decision tree in figure 17. The processing starts at the top-left corner and stops at bottom-right corner. The counter of inactive states is initialized to 0 and the state variable is initialized to -8. The state variable stays within the range [-8;+8] where the value of -8 means INACTIVE state and the value of +8 means ACTIVE STATE. If the classifier is in ENTRY state and if the classifier is in INSTABLE state.

If the speech/music classifier is in INACTIVE state, i.e. if then the smoothed decision is automatically set to 0, i.e. .

The final decision of the speech/music classifier is binary and it is characterized by the flag fSM. The flag is set to 0 if and the classifier is in INACTIVE STATE, i.e. when . If there is a transition from the ACTIVE state to the INACTIVE or INSTABLE state, characterized by , the flag retains its value from the previous frame. If the classifier is in ENTRY state, characterized by , the flag is set according to weighted average of past non-binary decisions. This is done as follows

(339)

where the weighting coefficients are given in the following table:

Table 18: Weighting coefficients for the ENTRY period of the speech/music classifier

1

2

3

4

5

6

7

8

1

1

0

0

0

0

0

0

0

2

0.6

0.4

0

0

0

0

0

0

3

0.47

0.33

0.2

0

0

0

0

0

4

0.4

0.3

0.2

0.1

0

0

0

0

5

0.3

0.25

0.2

0.15

0.1

0

0

0

6

0.233

0.207

0.18

0.153

0.127

0.1

0

0

7

0.235

0.205

0.174

0.143

0.112

0.081

0.05

0

8

0.2

0.179

0.157

0.136

0.114

0.093

0.071

0.05

Figure 18 : Decision tree for transitions between INACTIVE and ACTIVE states of the speech/music classifier

The flag is set to 1 when > 2. If the classifier is in a stable ACTIVE state, the flag retains its value from the previous frame unless one of the following two situations happens. If but the decisions in the previous three frames were all “speech”, i.e. for i=-1,-2,-3, there is a transition from speech to music and is set to 1. If but the decision in the previous frame was “music”, i.e. , there is a transition from music to speech and is set to 0.

The speech/music decision obtained by the algorithm described so far will be denoted in the following text to distinguish it from the second stage of the speech/music classifier in which the decision will be denoted .

5.1.13.6.5 Improvement of the classification for mixed and music content

The speech/music decision obtained above is further refined with the goal of improving the classification rate on music and mixed content. A set of feature parameters are extracted from the input signal and buffered. Statistical analysis is performed on each feature parameter buffer and a binary speech/music decision is obtained using a tree-based classification. During the processing the value ‘1’ indicates music and the value ‘0’ indicates non-music. As a result of this refinement, the earlier speech/music decision may be adjusted from ‘0’ to ‘1’ if has a final value of ‘1’ in the situation that the and are not aligned with each other.

The feature parameters used to form the feature parameter buffers include a spectral energy fluctuation parameter,, a tilt parameter of the LP analysis residual energies, , a high-band spectral peakiness parameter, , a parameter of correlation map sum, , a voicing parameter, , and finally three tonal parameters , and . Since music is assumed to only existing during high SNR active regions of the input signal, the classification refinement is only applied for active frames and when the long-term SNR is high. So, if the SAD flag indicates that the current frame is an inactive frame or the long-term SNR is below a threshold of 25, i.e. , then the classification refinement is terminated without executing fully. In the early termination case, the speech/music decision is kept unchanged and two long-term speech/music decisions , , as will be described later in this subclause, are both initialized to 0.5.

Before computing the various feature parameters, percussive music is first detected. Percussive music is characterized by temporal spike-like signals. First the log maximum amplitude of the current frame is found as

(340)

where is the time-domain input frame. The difference between the log maximum amplitude and its moving average from the previous frame is calculated

(341)

where the superscript [-1] denotes the value from the previous frame. is updated at each frame after the calculation of , if both the normalized pitch correlations of the current frame and as calculated in defined in subclause 5.1.11.3.2 are greater than 0.9 as

(342)

where the value is the forgetting factor and is set to 0.75 for increasing updates () and 0.995 for decreasing updates (). The , the total frame energy calculated in subclause 5.1.5.2, the normalized pitch correlation defined in subclause 5.1.11.3.2 and the long-term active signal energy are used to identify the temporal spike-like signals. First, certain energy relationship between several past frames is checked. If and and and and , where the superscript denotes the -th frame in the past, the energy envelope of temporal spike-like signal is considered found. Then if the voicing is low, that is if, the percussive music flag is set to 1 indicating the detection of spike-like signal, if the the normalized pitch correlations for the second half of the previous frame, the first half of the current frame and the second half of the current frame, ,and are all less than 0.75 and the is greater than 10, or if simply the long-term speech/music decision is greater than 0.8.

Besides the detection of percussive music, sound attacks are also detected using , , and , where denotes the of the previous frame. If and and and , sound attack is detected and the attack flag is set to 3. The attack flag is decremeted by 1 in each frame afterthe calculation and buffering of the spectral energy fluctuation parameterwhich is claculated from the log energy spectrum of the current frame as follows: Firstly, all local peaks and valleys in the log spectrum , as calculated in equation (127) are identified. A value of is considered as a local peak if and . A value of is considered as a local valley if and . Besides, the first local valley is found as the with and, the last local valley is found as the with and. For each local peak, its peak to valley distance is calculated as

(343)

where denotes the peak to valley distance of the -th local peak, denotes the log energy of the -th local peak and , denote the respect log energy of the local valleys adjacent to the -th local peak at the lower frequency side and the higher frequency side, denotes the number of local peaks. An array called peak to valley distance map is then obtained as

(344)

where denotes the peak to valley distance map, denotes the index (or the location) of the -th local peak in the log spectrum . The spectral energy fluctuation parameter is defined as the average energy deviation between the current frame spectrum and the spectrum two frames ago at locations of the local spectral peaks . The is computed as

(345)

where and denote respectively the log energy spectrum of the current frame and the log energy spectrum of the frame two frames ago, denotes the number of local peaks. If = 0, is set to 5. The computed is stored into a buffer of 60 frames if there is no sound attack in the past 3 frames (including the current frame), that is if . Moreover, if the long-term speech/music decision is greater than 0.8 meaning a strong music signal in previous classifications, then the value of is upper limited to 20 before it is stored into the . The buffer is altered at every first active frame after an inactive segment (flagged by) that all values in the buffer excluding the one just calculated and stored for the current frame are changed to negative values.

The effective portion of the buffer is determined in each frame after the calculation and buffering of the parameter. The effective portion is defined as the portion in the buffer which contains continuous non-negative values starting from the value of the latest frame . If percussive music is detected, that is if the percussive music flag is set to 1, each value in the effective portion of the buffer is initialized to 5.

The tilt parameter of the LP analysis residual energies is calculated as

(346)

where is the LP error energies computed by the Levinson-Durbin algorithm. The computed is stored into a buffer of 60 frames.

The high-band spectral peakiness parameter reflects an overall tonality of the current frame at its higher frequency band and is calculated from the peak to valley distance map as

(345)

The calculated is stored into a buffer of 60 frames.

The three tonal parameters , and are also calculated from the peak to valley distance map . denotes the first number of harmonics found from the spectrum of the current frame. is calculated as

(346)

denotes the second number of harmonics also found from the spectrum of the current frame. is defined more strictly than and is calculated as

(347)

denotes the number of harmonics found only at the low frequency band of the current frame’s spectrum and is calculated as

(348)

The calculated values of , and are stored into their respective buffers , and all of 60 frames.

The sum of correlation map as calculated by

(349)

is also stored into a buffer of 60 frames, where is the correlation map calculated in subclause 5.1.11.2.5.

The voicing parameter is defined as the difference of log-likelihood between speech class and music class as calculated in subclause 5.1.13.6.3. The is calculated as

(350)

where , are the log-likelihood of speech class and the log-likelihood of music class respectively. is stored into a buffer of 10 frames.

The speech/music decision is obtained through a tree-based classification. The is first initialized as a hysteresis of the long-term speech/music decision from the previous frame, i.e.

(351)

where the superscript [-1] denotes the value from the previous frame. Then, the can be altered through successive classifications. Let denotes the length of the effective portion in. Depending on the actual value of , different classification procedures are followed. If , insufficient data is considered in the feature parameter buffers. The classification is terminated and the initialized is used as the final . If , the respective mean values , and are calculated from , and over the effective portion and the variance , calculated over the effective portion from is also obtained. In addition, the number of positive values among the 6 latest values in is counted. The speech/music decision is then set to 1 if and any of the following conditions is fulfilled; or or or . Otherwise, if , the feature buffers are first analysed over the portion containing the latest 10 values. The mean values , and are calculated from, and over and for the same portion the variance is also calculated from .Besides, the mean value of, over a shorter portion of the latest 5 frames is also calculated. The is found as the number of positive values in. The speech/music decision is determined without the need to analyse any longer portion if strong speech or music characteristics are found within , that is, the and are both set to 1 if and and any of the following conditions is fulfilled: or or or . The and are both set to 0 if any of the following conditions is fulfilled: or or or . If no class is determined for over the values, the is determined iteratively over portions starting from until the whole effective portion is reached. For each iteration, the respective mean values, and are calculated from, and over the portion under analysis and for the same portion the variance is also calculated from . The mean value is calculated from over , and the number of positive values in ,, is also counted. The value of is set to 1 if and any of the following conditions is fulfilled: or or or. If through the above iteration procedure the is not set and if the effective portion reaches the maximum of 60 frames, a final speech/music discrimination is made from , and . The mean value of the , the sum value of the , and the sum value of the are calculated over the whole buffers. A low frequency tonal ratio is calculated as

(352)

The is set to 1 if or . Otherwise, if , the is set to 0.

If is greater than 30, then both the two long-term speech/music decisions and are updated at each frame with as

(353)

(354)

where the superscript [-1] denotes the value from the previous frame. If the total frame energy calculated in subclause 5.1.5.2 is greater than 1.5 and is less than 2 and the raw coding mode is either UNVOICED or INACTIVE, then an unvoiced counter initialized to 300 at the first frame is updated by

(355)

Otherwise, is incremented by 1. The value of is bounded between [0, 300]. The is further smoothed by an AR filtering as

(356)

where is the smoothed , the superscript [-1] denotes the value from the previous frame. If is set to 1 in any previous stage, the flag is overridden by unless the long-term speech/music decision as calculated in equation (357) is close to speech and the smoothed unvoiced counter exhibits strong unvoiced characteristic, that is, the is set to 1 if and .

5.1.13.6.6 Second stage of the speech/music classifier

The second stage of the speech/music classifier has been designed and optimized for the GSC technology. Not all frames classified as music in the first stage can be encoded directly with the GSC technology due to its inherent limitations. Therefore, in the second stage of the speech/music classifier, a subset of frames that have been previously classified as “music”, i.e. for which , are reverted to speech and encoded with one of the CELP modes. The decision in the second stage of the speech/music classifier is denoted . The second stage is run only for WB, SWB and FB signals, not for NB. The reason for this limitation is purely due to the fact that GSC technology is only applied at bandwidths higher than NB.

The second stage of the speech/music classifier starts with signal stability estimation which is based on frame-to-frame difference of the total energy of the input signal. That is

(358)

Then, the mean energy difference is calculated as

(359)

i.e. over the period of the last 40 frames. Then, the statistical deviation of the delta-energy values around this mean is calculated as

(360)

i.e. over the period of the last 15 frames.

After signal stability estimation, correlation variance is calculated as follows. First, mean correlation is estimated over the period of the last 10 frames. This is done as

(361)

where and the superscript [-1] is used to denote past frames. Then, the correlation variance is defined as

(362)

In order to discriminate highly-correlated stable frames, long-term correlation is calculated as

(363)

The flag is set to 1 if and at the same time .

In the next step, attacks are detected in the inputs signal. This is done by dividing the current frame of the input signal into 32 segments where each segment has the length of 8 samples. Then, energy is calculated in each segment as

(364)

The segment with the maximum energy is then found by

(365)

and this is the position of the candidate attack. In all active frames where and for which the coding mode was set to GC, the following logic is executed to eliminate false attacks, i.e. attacks that are not sufficiently strong. First, the mean energy in the first 3 sub-frames is calculated as

(366)

and the mean energy after the detected candidate attack is defined

(367)

and the ratio of these two energies is compared to a certain threshold. That is

(368)

Thus, the candidate attack position is set to 0 if the attack is not sufficiently strong. Further, if the FEC class of the last frame was VOICED CLASS and if then is also set to 0.

To further reduce the number of falsely detected attacks, the segment with maximum energy is compared to other segments. This comparison is done regardless of the selected coding mode and .

(369)

Thus, if the energy in any of the above defined segments, other than , is close to that of the candidate attack, the attack is eliminated by setting to 0.

Initially the speech/music decision in the second stage is set equal to the speech/music decision from the first stage, i.e. . In case the decision is “music”, it could be reverted to “speech” in the following situations.

The decision is reverted from music to speech for highly correlated stable signals with higher pitch period. These signals are characterized by

(370)

Further, if the above condition is fulfilled and the selected coding mode was TC, it is changed to GC. This is to avoid any transition artefacts during stable harmonic signal.

In case there is an energetic event characterized by and at the same time it could mean that an attack has occurred in the input signal and the following logic takes place. If, in this situation, the counter of frames from the last detected onset/transition , described in subclause 5.1.13.4, has been set to 1 the attack is confirmed and the decision is changed to speech, i.e. . Also, the coding mode is changed to TC. Otherwise, if there has been an attack found by the attack tracking algorithm described above, and the position of this attack is beyond the end of the third sub-frame, the decision is also changed to speech and the coding mode is changed to TC. That is

(371)

Furthermore, an attack flag is set to 1 if the detected attack is located after the first quarter of the first sub-frame, i.e. when . This flag is later used by the GSC technology. Finally, the attack flag is set to 1 in all active frames () that have been selected for GC coding and for which the decision in the first stage of the speech/music classifier was “speech”. However, it is restricted only to frames in which the attack is located in the fourth subframe. In this case, the coding mode is also changed to TC for better representation of the attack.

As previously described, if =flag_spitch=1, VC mode is maintained and AC mode is set to 0; that is,

if (=1 and sampling rate = 16kHz and bit rate < 13.2kbps )

{

=0;

}

5.1.13.6.7 Context-based improvement of the classification for stable tonal signals

By using context-based improvement of the classification, an error in the classification in the previous stage can be corrected. If the current frame has been provisionally classified as “speech”, the classification result can be corrected to “music”, and vice versa. To determine a possible error in the current frame, the values of 8 consecutive frames including the current frame are considered for some features.

Figure 19 shows the multiple coding mode signal classification method. If the current frame has been provisionally classified as “speech” after the first- and second-stage classification, then the frame is encoded using the CELP-based coding. On the other hand, if the current frame is initially classified as “music” after the first- and second-stage classification, then the frame is further analysed for fine-classification of “speech” or “music” to select either the GSC-based coding or MDCT-based transform coding, respectively. The parameters used to perform the fine-classification in multiple coding mode selection include:

  • Tonality
  • Voicing
  • Modified correlation
  • Pitch gain, and
  • Pitch difference

The tonality in the sub-bands of 0-1kHz, 1-2 kHz, and 2-4 kHz are estimated as , , and as follows:

where is the power spectrum. The maximum tonality, , is estimated as,

The voicing feature, , is the same one as used in the unvoiced signal classification. See equation (237) in subclause 5.1.13.1.1. for the details of its computation. The voicing feature from the first analysis window is used, i.e.

The modified correlation, , is the normalized correlation from the previous frame.

The pitch gain, , is the smoothed closed-loop pitch gain estimated from the previous frame, i.e.,

where is the ACB gain in each of the sub-frames from the previous frame.

The pitch deviation is estimated as the sum of pitch differences between the current frame open-loop pitch , and the open loop pitch in the previous three frames,

The features, , , and are smoothed to minimize spurious instantaneous variations as follows:

where and are 0.1 in active frames (i.e., SAD = 1), and 0.7 in background noise and inactive frames. Similarly, is 0.1 in active frames and 0.5 in inactive frames.

Figure 20 : Multiple coding mode signal classification

The following condition is evaluated to select the GSC or MDCT based coding,

A hangover logic is used to prevent frequent switching between coding modes of GSC and MDCT-based coding. A hangover period of 6 frames is used. The coding mode is further modified as per below.

Figure 21 shows two independent state machines, which are defined in the context-based classifier, SPEECH_STATE and MUSIC_STATE. Each state machine has two states . In each state a hangover of 6 frames is used to prevent frequent transitions. If there is a change of decision within a given state, the hangover in each state is set to 6, and the hangover is then reduced by 1 for each subsequent frame. A state change can only occur once the hangover has been reduced to zero. The following six features are used in the context-based classifier (the superscript [-i] is used below to denote the past frames).

The tonality in the region of 1~2 kHz, is defined as

(372)

The tonality in the region of 2~4 kHz, is defined as

(373)

The long-term tonality in the low band, is defined as

(374)

The difference between the tonality in 1~2 kHz band and the tonality in 2~4 kHz band is defined as

(375)

The linear prediction error is defined as

(376)

where has been defined in equation (377).

The difference between the scaled voicing feature defined equation (378) and the scaled correlation map feature defined in equation (379) is defined as

(380)

The following two independent state machines are used to correct errors in the previous stages of the speech/music classification. The are two state machines are called SPEECH_STATE and MUSIC_STATE. There are also two hangover variables denoted and which are initialized to the value of 6 frames. The following four conditions are evaluated to determine the transition of one state to another.

Condition A is defined as

(381)

Condition B is then defined as

(382)

Condition C is defined as

(383)

and finally condition D is defined as

(384)

Figure 22 : State machines for context-based speech/music correction

The decisions from the speech/music classifier, and are changed to 0 (“speech”) if was previously set to 1 (“music”) and if the context-based classifier is in SPEECH_STATE. Similarly, the decisions from the speech/music classifier, and are changed to 1 (“music”) if was previously set to 0 (“speech”) and if the context-based classifier is in MUSIC_STATE.

5.1.13.6.8 Detection of sparse spectral content

At 13.2kbps, the coding of music signal benefits from combining the advantages of MDCT and GSC technologies. For frames classified as music after the context-based improvement, coding mode producing better quality is selected between MDCT and GSC based on an analysis of signal spectral sparseness and linear prediction efficiency (depending on the input bandwidth).

For each active frame, the sum of the log energy spectrum is calculated to determine the spectral sparseness analysis.

(385)

Then the log energy spectrum is sorted in descending order of magnitude. Each element in the sorted log energy spectrum is accumulated one by one along in descending order until the accumulated value exceeds 75% of the . The index (or the position),, of the last element added into the accumulation can be regarded as a kind of representation of the spectral sparseness of the frame and is stored into a sparseness buffer of 8 frames.

If the input bandwidth is WB, some parameters dedicated to WB are calculated, including the mean of the sparseness buffer, the long-term smoothed sparseness, the high-band log energy sum, the high-band high sparseness flag, the low-band high sparseness flag, the linear prediction efficiency and the voicing metric. For other input bandwidths the above parameters are not calculated. The mean of the sparseness buffer is obtained as

(386)

Then the long-term smoothed sparseness is calculated as

(387)

where denotes the long-term smoothed sparseness in the previous frame, denotes the average of the four smallest values in the sparseness buffer . The reason of using is to reduce the possible negative impact to the from interfering frames. The long-term smoothed sparseness is initialized to the value of and the sparseness buffer is also initialized to the value of for all its elements if the current frame is the first active frame after a pause. The high-band log energy sum is calculated over

(388)

To obtain the high-band high sparseness flag, the high-band log energy spectrum is first sorted in descending order. The ratio of the sum of the first 5 elements (or the 5 largest values) of the sorted high-band log energy spectrum to the high-band log energy sum is calculated

(389)

where is the sorted high-band log energy spectrum. The ratio can be regarded as a kind of representation of the high-band spectral sparseness of the frame and is stored into a high-band sparseness buffer . The mean of the buffer is calculated. If the mean is greater than 0.2, the high-band high sparseness flag is set to 1 indicating a high sparseness of the high-band spectrum, otherwise set to 0. Similarly, to obtain the low-band high sparseness flag, the low-band log energy spectrum is sorted in descending order and the ratio of the sum of the first 5 elements (or the 5 largest values) of the sorted low-band log energy spectrum to the low-band log energy sum is calculated

(390)

where is the sorted low-band log energy spectrum. The ratio can be regarded as a kind of representation of the low-band sparseness of the frame. If the ratio is greater than 0.18, the low-band high sparseness flag is set to 1 indicating a high sparseness of the low-band spectrum, otherwise set to 0. The LP residual log-energy ratio of the current frame , as calculated in subclause 5.1.13.5.1 which is shown again below

(391)

is stored into a LP residual log-energy ratio buffer of 8 frames. The mean of the buffer , , is calculated and used to represent the short-term linear prediction efficiency at the current frame. The lower the is the higher the short-term linear prediction efficiency is. The scaled normalized correlation as calculated in subclause 5.1.13.5.2 is stored into a voicing buffer of 8 frames. The mean of the buffer , , is calculated and used to represent the voicing metric at the current frame.

Decision on which coding mode to use (MDCT or GSC) is made for each frame previously classified as music, that is, for each frame where is set to 1. GSC coding mode is selected by setting to 1 and changing to 0. GSC coding mode is selected for frame with extremely non-sparse spectrum, that is, when is greater than 90. In this case, the GSC hangover flag is also set to 1 meaning that a soft hangover period will be applied. Otherwise, if is set to 1, the current frame is in a soft hangover period where the determination of extremely non-sparse spectrum is slightly relaxed, that is, GSC coding mode is selected if is greater than 85. If in above case, is not greater than 85, GSC coding mode is still selected if of the current frame is deviating from the average of its adjacent GSC coded frames by less than 7. A maximum of 7 frames are used for the averaging. The selection between MDCT coding mode and GSC coding mode ends here if the input bandwidth is SWB. For WB input bandwidth, one more step is applied. In this case, GSC coding mode is also selected if the various sparseness measures calculated all do not exhibit strong sparseness characteristics and the linear prediction efficiency is assumed high. Specifically, GSC coding mode is selected if and and and and is set to 0 and or if condition is not met but is not set to 1. In above case, the GSC hangover flag is also set to 1. The flag is set to 0 if GSC coding mode is not selected through the whole procedure described above.

5.1.13.6.9 Decision about AC mode

The decisions in the first and in the second stage of the speech/music classifier, refined and corrected by the modules described so far are used to determine the usage of the AC mode. As mentioned before, in the AC mode GSC technology is used to encode the input signal. The decision about the AC mode is done always but the GSC technology is used only at certain bitrates. This is described in subclause 5.1.16.

Before making decision about the AC mode, the speech/music classification results are overridden for certain noisy speech signals. If the level of the background noise is higher than 12dB, i.e. when , then . This is a protection against mis-classification of active noisy speech signals.

For certain unvoiced SWB signals, GSC technology is preferred over the UC or GC mode which would normally be selected. In order to override the selection of the coding mode, there is a flag denoted which is set to 1 under the following condition

(392)

where is the raw coding mode, described in subclause 5.1.13.2.

The AC mode is selected if or if .

5.1.13.6.10 Decision about IC mode

The IC mode has been designed and optimized for inactive signals which are basically the background noise. Two encoding technologies are used for the encoding of these frames, the GSC and the AVQ. The GSC technology is used at bitrates below 32kbps and the AVQ is used otherwise. The selection of the IC mode at bitrates below 32kbps are conditioned by whereas for higher bitrates, the condition is changed to .

The TC mode and the AC mode are not used at 9.6, 16.4 and 24.4 kbps. Thus, at these bitrates, the coding mode is changed to the GC mode if it was previously set to the TC or AC mode. Furthermore, the selection of the IC mode at the previously mentioned bitrates is conditioned only by .