5.1.14 Coder technology selection

26.4453GPPCodec for Enhanced Voice Services (EVS)Detailed algorithmic descriptionRelease 15TS

Multiple coding technologies are employed within the EVS codec, based on one of the following two generic principles for speech and audio coding, the LP-based (analysis-by-synthesis) approach and the transform-domain (MDCT) approach. There is no clearly defined borderline between the two approaches in the context of this codec. The LP-based coder is essentially based on the CELP technology, optimized and tuned specifically for each bitrate. The transform-domain approach is adopted by the HQ MDCT technology. There are also two hybrid schemes in which both approaches are combined, the GSC technology and the TCX technology. The selection of the coder technology depends on the actual bitrate, the bandwidth, speech/music classification, the selected coding mode and other parameters. The following table shows the allocation of technologies based on bitrate, bandwidth and content.

Table 19: Allocation of coder technologies per bitrate, bandwidth and content

bitrate

7.2

8

9.6

13.2

16.4

24.4

32

48

64

NB

speech

ACELP

ACELP

ACELP

ACELP

ACELP

ACELP

audio

HQ MDCT

HQ MDCT

TCX

TCX/HQ MDCT

TCX/HQ MDCT

TCX

noise

GSC

GSC

TCX

GSC

TCX

TCX

WB

speech

ACELP

ACELP

ACELP

ACELP

ACELP

ACELP

ACELP

TCX

ACELP

audio

GSC

GSC

TCX

GSC/TCX/HQ MDCT

TCX/HQ MDCT

TCX

HQ MDCT

TCX

HQ MDCT

noise

GSC

GSC

TCX

GSC

TCX

TCX

ACELP

TCX

ACELP

SWB

speech

ACELP

ACELP

ACELP

ACELP

TCX

ACELP

audio

GSC/TCX/HQ MDCT

TCX/HQ MDCT

TCX/HQ MDCT

TCX/HQ MDCT

TCX

HQ MDCT

noise

GSC

TCX

TCX

ACELP

TCX

ACELP

FB

speech

ACELP

ACELP

ACELP

TCX

ACELP

audio

TCX

TCX/HQ MDCT

TCX/HQ MDCT

TCX

HQ MDCT

noise

TCX

TCX

ACELP

TCX

ACELP

The TCX technology is used for any content at bitrates higher than 64 kbps.

At 9.6kbps, 16.4kbps and 24.4kbps a specific technology selector is used to select either ACELP or an MDCT-based technology (HQ MDCT or TCX). This selector is described in clause 5.1.14.1.

At all other bitrates, the division into “speech”, “audio” and background “noise” is based on the decision of the SAD and on the decision of the speech/music classifier.

The decision between the TCX technology and the HQ MDCT technology is done adaptively on a frame-by-frame basis. There are two selectors, one for 13.2 and 16.4 kbps and the second for 24.4 and 32 kbps. There is no adaptive selection beyond these bitrates as shown in the above table. These two selectors are described in detail in the subclauses 5.1.14.2 and 5.1.14.3.

5.1.14.1 ACELP/MDCT-based technology selection at 9.6kbps, 16.4 and 24.4 kbps

At 9.6kbps, 16.4kbps and 24.4kbps the decision to choose either ACELP or an MDCT-based technology is not based on the decision of the speech/music classifier as it is done for other bitrates, but on a specific technology selector described below.

The technology selector is based on two estimates of the segmental SNR, one estimate corresponding to the transform-based technology (described in subclause 5.1.14.1.1), another estimate corresponding to the ACELP technology (described in subclause 5.1.14.1.2). Based on these two estimates and on a hysteresis mechanism, a decision is taken (described in subclause 5.1.14.1.3).

5.1.14.1.1 Segmental SNR estimation of the MDCT-based technology

The segmental SNR estimation of the TCX technology is based on a simplified TCX encoder. The input audio signal is first filtered using a LTP filter, then windowed and transformed using a MDCT, the MDCT spectrum is then shaped using weighted LPC, a global gain is then estimated, and finally the segmental SNR is derived from the global gain. All these steps are described in detail in the following clauses.

5.1.14.1.1.1 Long term prediction (LTP) filtering

The LTP filter parameters (pitch lag and gain) are first estimated. The LTP parameters are not only used for filtering the audio input signal for estimating the segmental SNR of the transform-based technology. The LTP parameters are also encoded into the bitstream in case the TCX coding mode is selected, such that the TCX LTP postfilter described in subclause 6.9.2.2 can use them. Note that the LTP filter parameter estimation is also performed at 48kbps, 96kbps and 128kbps even though the parameters are not used to filter the audio input signal in this case.

A pitch lag with fractional sample resolution is determined, using the open-loop pitch lag and an interpolated autocorrelation. The LTP pitch lag has a minimum value of , a maximum value of and a fractional pitch resolution . Additionally, two thresholds and are used. If the pitch lag is less than , the full fractional precision is used. If the pitch lag is greater than , no fractional lag is used. For pitch lags in between, half of the fractional precision is used. These parameters depend on the bitrate and are given in the table below.

Table 20: LTP parameters vs bitrate

Bitrate

Bandwidth

LTP sampling rate

LTP frame length

9.6 kbps

NB, WB, SWB

12.8kHz

256

4

29

231

154

121

16.4-24.4 kbps

NB

12.8kHz

256

4

29

231

154

121

16.4-24.4 kbps

WB, SWB, FB

16kHz

320

6

36

289

165

36

48 kbps

WB, SWB, FB

25.6kHz

512

4

58

463

164

58

96-128 kbps

WB, SWB, FB

32kHz

640

6

72

577

75

72

First the parameter is initialized depending on the fractional pitch resolution:

. (391)

Then the search range for the pitch lag is determined as follows:

. (392)

For the search range, autocorrelation of the weighted input signal (including the look-ahead part) is computed (note that at 48kbps, 96kbps and 128kbps, the weighted input signal is not available, so the non-weighted input signal is used instead), extended by 4 additional samples in both directions required for subsequent interpolation filtering:

. (393)

Within the search range, the index and value of the maximum correlation are determined:

. (394)

The maximum correlation value is normalized as follows:

. (395)

The fractional precision of the transmitted pitch lag is determined by the initial pitch lag , the maximum fractional resolution , and the thresholds and .

. (396)

For determining fractional pitch lag the autocorrelation is interpolated around the maximum value by FIR filtering:

. (397)

. (398)

. (399)

The integer and fractional parts of the refined pitch lag ( and ) are then determined by searching the maximum of the interpolated correlation:

. (400)

For transmission in the bitstream, the pitch lag is encoded to an integer index (that can be encoded with 9 bits) as follows:

. (401)

The decision if LTP is activated is taken according to the following condition:

. (402)

with the temporal flatness measure and the maximum energy change are computed as described in clause 5.1.8.

If LTP is activated, the predicted signal is computed from the input signal (including the lookahead part) by interpolating the past input signal using a polyphase FIR filter. The polyphase index of the filter is determined by the fractional pitch lag:

. (403)

. (404)

. (405)

The LTP gain is computed from the input and predicted signals:

. (406)

For transmission in the bitstream, the gain is quantized to an integer index (that can be encoded with 2 bits)

. (407)

The quantized gain is computed as:

. (408)

If the quantized gain is less than zero, LTP is deactivated:

. (409)

If LTP is not active, the LTP parameters are set as follows:

. (410)

The LTP filtered signal is then computed, except at 48kbps, 96kbps and 128kbps. The LTP filtered signal is computed by multiplying the predicted signal with the LTP gain and subtracting it from the input signal. To smooth parameter changes, a zero input response is added for a 5ms transition period. If LTP was not active in the previous frame, a linear fade-in is applied to the gain over a 5ms transition period.

If LTP was active in the previous frame, the zero input response is computed:

. (411)

. (412)

The zero input response is then computed by LP synthesis filtering with zero input, and applying a linear fade-out to the second half of the transition region:

. (413)

. (414)

. (415)

with the LP coefficients are obtained by converting the mid-frame LSP vector of the current frame using the algorithm described in subclause 5.1.9.7. Finally the LTP filtered signal is computed:

. (416)

5.1.14.1.1.2 Windowing and MDCT

The LTP filtered signal is windowed using a sine-based window whose shape depends on the previous mode. If the past frame was encoded with a MDCT-based coding mode, the window is defined as

, for . (417)

, for . (418)

, for . (419)

, for . (420)

, for . (421)

If the past frame was encoded with the ACELP coding mode, the window is defined as

, for . (422)

, for . (423)

, for . (424)

, for . (425)

with , at 12.8kHz, and , at 16kHz. The total length of the window is (40ms) when the past frame was encoded with a MDCT-based coding mode and (50ms) when the past frame was encoded with the ACELP coding mode.

The windowed LTP-filtered signal is transformed with a MDCT using time domain aliasing (TDA) and a discrete cosine transform (DCT) IV as described in subclause 5.3.2.2, producing the MDCT coefficients with , is when the past frame was encoded with a MDCT-based coding mode and is when the past frame was encoded with the ACELP coding mode.

5.1.14.1.1.3 MDCT spectrum shaping

The mid-frame LSP vector of the current frame is converted into LP filter coefficients using the algorithm described in clause 5.1.9.7. The LP filter coefficients are then weighted as described in clause 5.1.10.1, producing weighted LP filter coefficients with at 12.8kHz and at 16kHz. The weighted LP filter coefficients are then transformed into the frequency domain as described in subclause 5.3.3.2.3.2. The obtained LPC gains are finally applied to the MDCT coefficients as described in subclause 5.3.3.2.3.3, producing the LPC shaped MDCT coefficients .

When the encoded bandwidth is NB, the MDCT coefficients corresponding to the frequencies above 4kHz are set to zeros:, .

5.1.14.1.1.4 Global gain estimation

A global gain is estimated similarly to the first step described in subclause 5.3.3.2.8.1.1. The energy of each block of 4 coefficients is first computed:

. (426)

A bisection search is performed with a final resolution of 0.125dB:

Initialization: Set fac = offset = 128 and target = 500 if NB, target = 850 otherwise

Iteration: Do the following block of operations 10 times

1- fac=fac/2

2- offset = offset – fac

2-

3- if(ener>target) then offset=offset+fac

If offset<=32, then offset= -128.
The gain is then given by:

5.1.14.1.1.5 Segmental SNR estimation of the MDCT-based technology

The estimated TCX SNR in one subframe is given by

. (427)

Finally, the estimated segmental SNR of the whole encoded TCX frame is obtained by converting the per-subframe SNRs into dB and averaging them over all subframes.

5.1.14.1.2 Segmental SNR estimation of the ACELP technology

The segmental SNR estimation of the ACELP technology is based on the estimated SNR of the adaptive-codebook and the estimated SNR of the innovative-codebook. This is described in detail in the following clauses.

5.1.14.1.2.1 SNR estimation of the adaptive-codebook

An integer pitch-lag per subframe is derived from the refined open-loop pitch lags (see clause 5.1.10.9).
When the sampling-rate is 12.8kHz, the number of subframe is four, and the integer pitch lags are simply equal to the refined open-loop pitch lags rounded to the nearest integer.
When the sampling-rate is 16kHz, the number of subframe is five. The refined open-loop pitch lags are first scaled by a factor of 1.25, then they are rounded to the nearest integer and finally the four obtained integer pitch lags are mapped to the five subframes. The first integer pitch-lag is mapped to the first subframe, the second integer pitch-lag is mapped to the second subframe, the third integer pitch-lag is mapped to the third and fourth subframes, and the fourth integer pitch-lag is mapped to the fifth subframe.

A gain is then computed for each subframe

. (428)

The estimated SNR of the adaptive-codebook is then computed for each subframe

. (429)

5.1.14.1.2.2 SNR estimation of the innovative-codebook

The estimated SNR of the innovative-codebook is assumed to be a constant, which depends on the encoded bandwidth and on the bitrate. at 9.6kbps NB, at 9.6kbps WB and at 16.4 and 24.4kbps WB and SWB.

5.1.14.1.2.3 Segmental SNR estimation of ACELP

The estimated SNR of one ACELP encoded subframe is then computed by combining the adaptive-codebook SNR and the innovative-codebook SNR.

. (430)

Finally, the estimated segmental SNR of the whole encoded ACELP frame is obtained by converting the per-subframe SNRs into dB and averaging them over all subframes.

5.1.14.1.3 Hysteresis and final decision

The ACELP technology is selected if

. (431)

otherwise the MDCT-based technology is selected.

adds hysteresis in the decision, in order to avoid switching back and forth too often between the two coding technologies. is computed as described below ( is 0 by default). Further, in 12.8 kHz core (i.e., 9.6 kbps and 13.2 kbps), the dssnr is updated as shown in equation (433a).

. (432)

. (433)

. (433a)

where is described in clause 5.1.14.1.1.4, and are described in clause 5.1.13.6, and is described in 5.1.11.2.1.

. (434)

with is the temporal flatness measure described in clause 5.1.8, is a stability factor described in subclause 6.1.1.3.2 but using the unquantized LSF parameters estimated at 12.8kHz,is the number of consecutive previous ACELP frames (if the previous frame was not ACELP, ), is the long-term SNR as described in clause 5.1.12, is the SAD decision as described in clause 5.1.12, and indicates whether DTX is enabled or not.

5.1.14.2 TCX/HQ MDCT technology selection at 13.2 and 16.4 kbps

The selection between TCX and HQ MDCT (Low Rate HQ) technology at 13.2 kbps (NB, WB and SWB) and 16.4 kbps (WB and SWB) is done on a frame-by-frame basis and is based on the following measures.

– Voicing measures

– Spectral noise floor

– SAD decision

– High-band energy

– High-band sparseness (with hysteresis)

The boundaries of frequency bands for the purposes of the TCX/HQ technology selection is set according to the following table.

Table 21: Boundaries of frequency bands for TCX/HQ MDCT (Low Rate HQ) selection

Band width

Low band CLDFB

High band CLDFB

Low band FFT

High band FFT

NB

8

10

/4

*5/16

WB

12

20

*3/8

/2

SWB

16

40

/2 for sparseness,

*3/8 otherwise

/2

Voicing measure is defined as the average of pitch gain of the former half-frame and of the latter half-frame defined in (81),

. (432)

Sparseness measure is defined as

, (433)

whereis a number of bins which attain following condition within low band:

, (434)

where is an averaged energy of all spectrum bands.

High energy measureis defined in terms of CLDFB energy as

. (435)

Flag indicating the sparseness for high bands, = TRUE when

, (436)

where is a number of FFT bins within and which attain

. (437)

Otherwise, =FALSE.

Flag indicating the sparseness for high bands with hysteresis, = TRUE when

. (438)

Otherwise, =FALSE.

Additionally, is set TRUE when following is satisfied:

. (439)

is the averaged energy only for the local minima of the spectrum. With the notation of 5.1.11.2.5, it is defined as:

. (440)

Correlation map sum, is defined in 5.1.11.2.5.

Indication of possible switching, =TRUE when previous core was not Transform coding, or followings are satisfied.

, (444)

where and are and at the previous frames. Note that is integer from -1 to 2, while others are all Boolean.

Indication of preference for TCX, = TRUE when followings are satisfied:

(445)

Indication of preference for HQ MDCT, = TRUE when followings are satisfied:

, (441)

where transient_frame is the output of the time-domain transient detector (see 5.1.8). For 16.4 kbps, is set to FALSE and to TRUE when transient_frame is detected.

Based on the above definitions and thresholds listed in the table below, switching between HQ and MDCT based TCX is carried out as follows. Switching between HQ and TCX can only occur when is TRUE. In this case, TCX is used if is TRUE, or otherwise HQ is used if is TRUE. In any other case, the same kind of transform coding is applied as in the previous frame. If the previous frame was not coded by transform coding, HQ is used for the low rate (13.2 kbps) and TCX for the high rate (16.4 kbps).

In case input signal is noisy speech (noisy_speech_flag==TRUE && vadflag== FALSE) , transition from TCX to HQ is prohibited at 16.4 kbps.

is reset to 0 if is FALSE, otherwise it is incremented by one (with a maximum allowed value of 2)

and are reset to FALSE and -1, respectively, upon encoder initialization or when a non-transform-coded frame is encountered.

Table 22: List of thresholds used in TCX/HQ MDCT (Low Rate HQ) selection

Parameter

Meaning

13.2 kbps

16.4 kbps

SIG_LO_LEVEL_THR

Low level signal

22.5

23.5

SIG_HI_LEVEL_THR

High level signal

28.0

19.0

COR_THR

correlation

80.0

62.5

VOICING_THR

voicing

0.6

0.4

SPARSENESS_THR

sparseness

0.65

0.4

HI_ENER_LO_THR

High energy low limit

9.5

12.5

HYST_FAC

Hysteresis control

0.8

0.8

MDCT_SW_SIG_LINE_THR

Significant Spectrum

2.85

2.85

MDCT_SW_SIG_PEAK_THR

Significant peak

36.0

36.0

5.1.14.3 TCX/HQ MDCT technology selection at 24.4 and 32 kbps

The decision between using the TCX technology or the HQ MDCT (high rate HQ) technology at 24.4 kbps and 32 kbps for SWB signals is based on the average energy values and peak-to-average ratios of different sub-bands, furthermore, the average energy values and peak-to-average ratios are calculated by the CLDFB band energy analysis , spectral analysisand the bit-rate.

First, the average energy of three CLDFB sub-bands: 0~3.2kHz, 3.2~6.4kHz and 6.4~9.6kHz , are calculated according to

(442)

Second, the spectral peak and spectral average , of the FFT sub-bands: 1~2.6kHz and 4.8~6.4kHz are calculated according to

(443)

At 24.4kbps, the CLDFB sub-band (4.8~9.6kHz) average energy , and the CLDFB sub-band (400-3.2kHz) average energy are also calculated according to

(444)

The peak energy and average energy of the CLDFB sub-band (8~10kHz) are also calculated according to

(445)

To identify the MDCT coding mode, three conditions are identified:

Condition I:

(446)

Condition II:

(447)

Condition III:

(448)

The primary classifier decision at 24.4kbps is formed according to

(449)

At 32kbps, further spectral analysis is needed. First, a noise-floor envelope and a peak envelope are calculated as

(455)

and

(456)

respectively, where the smoothing factors and depend on the instantaneous magnitude spectrum

(450)

(451)

The noise-floor energy and the peak envelope energy are formed by averaging the noise-floor and peak envelopes, respectively. That is,

(452)

(453)

Spectral peaks are identified in two steps. First, all for which holds true are marked as peak candidates. Second, for each sequence of consecutive, the largest spectral magnitude is kept as a peak representative for that sequence. Peak sparseness measure is formed by averaging the peak distances among the peak representatives, with if less than 2 peaks are identified. Two decision variables are formed

(454)

The peak energy and average energy of the CLDFB sub-band at10~12 kHz are calculated according to

(455)

Three conditions are then checked.

Condition I:

(456)

Condition II:

(457)

Condition III:

(465)

The primary classifier decision at 32kbps is formed according to

(466)

To increase the classifier stability for both 24.4kbps and 32kbps, the primary classifier decision is low-pass filtered from frame to frame.

(467)

Finally, hysteresis is applied such that the classifier decision from the previous frame is only changed if the decision passes the switching range

(468)

If none of these conditions are met, the previous classifier is kept, i.e. .and the buffers are updated as follows

(469)

5.1.14.4 TD/Multi-mode FD BWE technology selection at 13.2 kbps and 32 kbps

The input WB or SWB signal is divided into low band signal and high band signal (wideband input) or super higher band signal (super wideband input). Firstly, the low band signal is classified based on the characteristics of the low band signal and coded by the LP-based approach or the transform-domain approach.

The selection between TD BWE and multi-mode FD BWE technology of super higher band signal or high band signal at 13.2 kbps (WB and SWB) and 32 kbps (SWB) is performed based on the characteristic of the input signal and coding modes of the low band signal. Except for MDCT mode, if the input signal is classified as music signal, the high band signal or the super higher band signal is encoded by multi-mode FD BWE;if the input signal is classified as speech signal, the high band signal or the super higher band signal is encoded by TD BWE. In the case that the low band signal is classified as IC mode, the high band signal or the super higher band signal is also encoded by multi-mode FD BWE.

If the decision in the first stage of the speech/music classifier, i.e. the input signal is classified as music signal, or the decision in the first stage of the speech/music classifierand the decision in the second stage of the speech/music classifier, or the low band signal is classified as IC mode, the high band or the super higher band signal is encoded by multi-mode FD BWE, otherwise, the high band or super higher band signal is encoded by TD BWE. It is noted that, when the flag of the super wideband noisy speech, the super higher band is encoded by TD BWE. It is the same TD/multi-mode FD BWE technology selection for FB inputs.