4 Outline description

06.603GPPEnhanced full rate speech transcodingTS

The present document is structured as follows.

Subclause 4.1 contains a functional description of the audio parts including the A/D and D/A functions. Subclause 4.2 describes the conversion between 13‑bit uniform and 8‑bit A‑law or -law (PCS 1900) samples. Subclauses 4.3 and 4.4 present a simplified description of the principles of the GSM EFR encoding and decoding process respectively. In clause 4.5, the sequence and subjective importance of encoded parameters are given.

Clause 5 presents the functional description of the GSM EFR encoding, whereas clause 6 describes the decoding procedures. Clause 7 describes variables, constants and tables of the C‑code of the GSM EFR codec.

4.1 Functional description of audio parts

The analogue‑to‑digital and digital‑to‑analogue conversion will in principle comprise the following elements:

1) analogue to uniform digital PCM:

– microphone;

‑ input level adjustment device;

‑ input anti‑aliasing filter;

‑ sample‑hold device sampling at 8 kHz;

‑ analogue‑to‑uniform digital conversion to 13‑bit representation.

The uniform format shall be represented in two’s complement.

2) uniform digital PCM to analogue:

‑ conversion from 13‑bit/8 kHz uniform PCM to analogue;

‑ a hold device;

‑ reconstruction filter including x/sin( x ) correction;

‑ output level adjustment device;

‑ earphone or loudspeaker.

In the terminal equipment, the A/D function may be achieved either:

‑ by direct conversion to 13‑bit uniform PCM format;

‑ or by conversion to 8‑bit/A‑law or -law (PCS 1900) compounded format, based on a standard A‑law or -law (PCS 1900) codec/filter according to ITU‑T Recommendations G.711 [8] and G.714, followed by the 8‑bit to 13‑bit conversion as specified in clause 4.2.1.

For the D/A operation, the inverse operations take place.

In the latter case it should be noted that the specifications in ITU‑T G.714 (superseded by G.712) are concerned with PCM equipment located in the central parts of the network. When used in the terminal equipment, the present document does not on its own ensure sufficient out‑of‑band attenuation. The specification of out‑of‑band signals is defined in GSM 03.50 [2] in clause 2.

4.2 Preparation of speech samples

The encoder is fed with data comprising of samples with a resolution of 13 bits left justified in a 16‑bit word. The three least significant bits are set to ‘0’. The decoder outputs data in the same format. Outside the speech codec further processing must be applied if the traffic data occurs in a different representation.

4.2.1 PCM format conversion

The conversion between 8‑bit A‑Law or -law (PCS 1900) compressed data and linear data with 13‑bit resolution at the speech encoder input shall be as defined in ITU‑T Rec. G.711 [8].

ITU‑T Recommendation G.711 [8] specifies the A‑Law or -law (PCS 1900) to linear conversion and vice versa by providing table entries. Examples on how to perform the conversion by fixed‑point arithmetic can be found in ITU‑T Recommendation G.726 [9]. Subclause 4.2.1 of G.726 [9] describes A‑Law and -law (PCS 1900) to linear expansion and clause 4.2.7 of G.726 [9] provides a solution for linear to A‑Law and -law (PCS 1900) compression.

4.3 Principles of the GSM enhanced full rate speech encoder

The codec is based on the code‑excited linear predictive (CELP) coding model. A 10th order linear prediction (LP), or short‑term, synthesis filter is used which is given by:

(1)

where are the (quantified) linear prediction (LP) parameters, and is the predictor order. The long‑term, or pitch, synthesis filter is given by:

(2)

where is the pitch delay and is the pitch gain. The pitch synthesis filter is implemented using the so‑called adaptive codebook approach.

The CELP speech synthesis model is shown in figure 2. In this model, the excitation signal at the input of the short‑term LP synthesis filter is constructed by adding two excitation vectors from adaptive and fixed (innovative) codebooks. The speech is synthesized by feeding the two properly chosen vectors from these codebooks through the short‑term synthesis filter. The optimum excitation sequence in a codebook is chosen using an analysis‑by‑synthesis search procedure in which the error between the original and synthesized speech is minimized according to a perceptually weighted distortion measure.

The perceptual weighting filter used in the analysis‑by‑synthesis search technique is given by:

(3)

where is the unquantized LP filter and are the perceptual weighting factors. The values and are used. The weighting filter uses the unquantized LP parameters while the formant synthesis filter uses the quantified ones.

The coder operates on speech frames of 20 ms corresponding to 160 samples at the sampling frequency of 8 000 sample/s. At each 160 speech samples, the speech signal is analysed to extract the parameters of the CELP model (LP filter coefficients, adaptive and fixed codebooks’ indices and gains). These parameters are encoded and transmitted. At the decoder, these parameters are decoded and speech is synthesized by filtering the reconstructed excitation signal through the LP synthesis filter.

The signal flow at the encoder is shown in figure 3. LP analysis is performed twice per frame. The two sets of LP parameters are converted to line spectrum pairs (LSP) and jointly quantified using split matrix quantization (SMQ) with 38 bits. The speech frame is divided into 4 subframes of 5 ms each (40 samples). The adaptive and fixed codebook parameters are transmitted every subframe. The two sets of quantified and unquantized LP filters are used for the second and fourth subframes while in the first and third subframes interpolated LP filters are used (both quantified and unquantized). An open‑loop pitch lag is estimated twice per frame (every 10 ms) based on the perceptually weighted speech signal.

Then the following operations are repeated for each subframe:

The target signal is computed by filtering the LP residual through the weighted synthesis filter with the initial states of the filters having been updated by filtering the error between LP residual and excitation (this is equivalent to the common approach of subtracting the zero input response of the weighted synthesis filter from the weighted speech signal).

The impulse response, of the weighted synthesis filter is computed.

Closed‑loop pitch analysis is then performed (to find the pitch lag and gain), using the target and impulse response , by searching around the open‑loop pitch lag. Fractional pitch with 1/6th of a sample resolution is used. The pitch lag is encoded with 9 bits in the first and third subframes and relatively encoded with 6 bits in the second and fourth subframes.

The target signal is updated by removing the adaptive codebook contribution (filtered adaptive codevector), and this new target, , is used in the fixed algebraic codebook search (to find the optimum innovation). An algebraic codebook with 35 bits is used for the innovative excitation.

The gains of the adaptive and fixed codebook are scalar quantified with 4 and 5 bits respectively (with moving average (MA) prediction applied to the fixed codebook gain).

Finally, the filter memories are updated (using the determined excitation signal) for finding the target signal in the next subframe.

The bit allocation of the codec is shown in table 1. In each 20 ms speech frame, 244 bits are produced, corresponding to a bit rate of 12.2 kbit/s. More detailed bit allocation is available in table 6. Note that the most significant bits (MSB) are always sent first.

Table 1: Bit allocation of the 12.2 kbit/s coding algorithm for 20 ms frame

Parameter

1st & 3rd subframes

2nd & 4th subframes

total per frame

2 LSP sets

38

Pitch delay

9

6

30

Pitch gain

4

4

16

Algebraic code

35

35

140

Codebook gain

5

5

20

Total

244

4.4 Principles of the GSM enhanced full rate speech decoder

The signal flow at the decoder is shown in figure 4. At the decoder, the transmitted indices are extracted from the received bitstream. The indices are decoded to obtain the coder parameters at each transmission frame. These parameters are the two LSP vectors, the 4 fractional pitch lags, the 4 innovative codevectors, and the 4 sets of pitch and innovative gains. The LSP vectors are converted to the LP filter coefficients and interpolated to obtain LP filters at each subframe. Then, at each 40‑sample subframe:

‑ the excitation is constructed by adding the adaptive and innovative codevectors scaled by their respective gains;

‑ the speech is reconstructed by filtering the excitation through the LP synthesis filter.

Finally, the reconstructed speech signal is passed through an adaptive postfilter.

4.5 Sequence and subjective importance of encoded parameters

The encoder will produce the output information in a unique sequence and format, and the decoder must receive the same information in the same way. In table 6, the sequence of output bits s1 to s244 and the bit allocation for each parameter is shown.

The different parameters of the encoded speech and their individual bits have unequal importance with respect to subjective quality. Before being submitted to the channel encoding function the bits have to be rearranged in the sequence of importance as given in table 6 in 05.03 [3].