4 Outline description

06.903GPPAdaptive Multi-Rate speech transcodingTS

The present document is structured as follows:

Section 4.1 contains a functional description of the audio parts including the A/D and D/A functions. Section 4.2 describes the conversion between 13‑bit uniform and 8‑bit A‑law or -law samples. Sections 4.3 and 4.4 present a simplified description of the principles of the AMR codec encoding and decoding process respectively. In subclause 4.5, the sequence and subjective importance of encoded parameters are given.

Section 5 presents the functional description of the AMR codec encoding, whereas clause 6 describes the decoding procedures. In section 7, the detailed bit allocation of the AMR codec is tabulated.

4.1 Functional description of audio parts

The analogue‑to‑digital and digital‑to‑analogue conversion will in principle comprise the following elements:

1) Analogue to uniform digital PCM

‑ microphone;

‑ input level adjustment device;

‑ input anti‑aliasing filter;

‑ sample‑hold device sampling at 8 kHz;

‑ analogue‑to‑uniform digital conversion to 13‑bit representation.

The uniform format shall be represented in two’s complement.

2) Uniform digital PCM to analogue

‑ conversion from 13‑bit/8 kHz uniform PCM to analogue;

‑ a hold device;

‑ reconstruction filter including x/sin( x ) correction;

‑ output level adjustment device;

‑ earphone or loudspeaker.

In the terminal equipment, the A/D function may be achieved either

‑ by direct conversion to 13‑bit uniform PCM format;

‑ or by conversion to 8‑bit A‑law or -law compounded format, based on a standard A‑law or -law codec/filter according to ITU‑T Recommendations G.711 [8] and G.714, followed by the 8‑bit to 13‑bit conversion as specified in subclause 4.2.1.

For the D/A operation, the inverse operations take place.

In the latter case it should be noted that the specifications in ITU‑T G.714 (superseded by G.712) are concerned with PCM equipment located in the central parts of the network. When used in the terminal equipment, the present document does not on its own ensure sufficient out‑of‑band attenuation. The specification of out‑of‑band signals is defined in GSM 03.50 [2] in clause 2.

4.2 Preparation of speech samples

The encoder is fed with data comprising of samples with a resolution of 13 bits left justified in a 16‑bit word. The three least significant bits are set to ‘0’. The decoder outputs data in the same format. Outside the speech codec further processing must be applied if the traffic data occurs in a different representation.

4.2.1 PCM format conversion

The conversion between 8‑bit A‑Law or -law compressed data and linear data with 13‑bit resolution at the speech encoder input shall be as defined in ITU‑T Rec. G.711 [8].

ITU‑T Rec. G.711 [8] specifies the A‑Law or -law to linear conversion and vice versa by providing table entries. Examples on how to perform the conversion by fixed‑point arithmetic can be found in ITU‑T Rec. G.726 [9]. Section 4.2.1 of G.726 [9] describes A‑Law or -law to linear expansion and subclause 4.2.8 of G.726 [9] provides a solution for linear to A‑Law or -law compression.

4.3 Principles of the GSM adaptive multi-rate speech encoder

The AMR codec uses eight source codecs with bit-rates of 12.2, 10.2, 7.95, 7.40, 6.70, 5.90, 5.15 and 4.75 kbit/s.

The codec is based on the code‑excited linear predictive (CELP) coding model. A 10th order linear prediction (LP), or short‑term, synthesis filter is used which is given by:

, (1)

where are the (quantified) linear prediction (LP) parameters, and is the predictor order. The long‑term, or pitch, synthesis filter is given by:

, (2)

where is the pitch delay and is the pitch gain. The pitch synthesis filter is implemented using the so‑called adaptive codebook approach.

The CELP speech synthesis model is shown in figure 2. In this model, the excitation signal at the input of the short‑term LP synthesis filter is constructed by adding two excitation vectors from adaptive and fixed (innovative) codebooks. The speech is synthesized by feeding the two properly chosen vectors from these codebooks through the short‑term synthesis filter. The optimum excitation sequence in a codebook is chosen using an analysis‑by‑synthesis search procedure in which the error between the original and synthesized speech is minimized according to a perceptually weighted distortion measure.

The perceptual weighting filter used in the analysis‑by‑synthesis search technique is given by:

, (3)

where is the unquantized LP filter and are the perceptual weighting factors. The values (for the 12.2 and 10.2 kbit/s mode) or (for all other modes) and are used. The weighting filter uses the unquantized LP parameters.

The coder operates on speech frames of 20 ms corresponding to 160 samples at the sampling frequency of 8 000 sample/s. At each 160 speech samples, the speech signal is analysed to extract the parameters of the CELP model (LP filter coefficients, adaptive and fixed codebooks’ indices and gains). These parameters are encoded and transmitted. At the decoder, these parameters are decoded and speech is synthesized by filtering the reconstructed excitation signal through the LP synthesis filter.

The signal flow at the encoder is shown in figure 3. LP analysis is performed twice per frame for the 12.2 kbit/s mode and once for the other modes. For the 12.2 kbit/s mode, the two sets of LP parameters are converted to line spectrum pairs (LSP) and jointly quantized using split matrix quantization (SMQ) with 38 bits. For the other modes, the single set of LP parameters is converted to line spectrum pairs (LSP) and vector quantized using split vector quantization (SVQ). The speech frame is divided into 4 subframes of 5 ms each (40 samples). The adaptive and fixed codebook parameters are transmitted every subframe. The quantized and unquantized LP parameters or their interpolated versions are used depending on the subframe. An open‑loop pitch lag is estimated in every other subframe (except for the 5.15 and 4.75 kbit/s modes for which it is done once per frame) based on the perceptually weighted speech signal.

Then the following operations are repeated for each subframe:

The target signal is computed by filtering the LP residual through the weighted synthesis filter with the initial states of the filters having been updated by filtering the error between LP residual and excitation (this is equivalent to the common approach of subtracting the zero input response of the weighted synthesis filter from the weighted speech signal).

The impulse response, of the weighted synthesis filter is computed.

Closed‑loop pitch analysis is then performed (to find the pitch lag and gain), using the target and impulse response , by searching around the open‑loop pitch lag. Fractional pitch with 1/6th or 1/3rd of a sample resolution (depending on the mode) is used.

The target signal is updated by removing the adaptive codebook contribution (filtered adaptive codevector), and this new target, , is used in the fixed algebraic codebook search (to find the optimum innovation).

The gains of the adaptive and fixed codebook are scalar quantified with 4 and 5 bits respectively or vector quantified with 6-7 bits (with moving average (MA) prediction applied to the fixed codebook gain).

Finally, the filter memories are updated (using the determined excitation signal) for finding the target signal in the next subframe.

The bit allocation of the AMR codec modes is shown in table 1. In each 20 ms speech frame, 95, 103, 118, 134, 148, 159, 204 or 244 bits are produced, corresponding to a bit-rate of 4.75, 5.15, 5.90, 6.70, 7.40, 7.95, 10.2 or 12.2 kbit/s. More detailed bit allocation among the codec parameters is given in tables 9a-9h. Note that the most significant bits (MSB) are always sent first.

Table 1: Bit allocation of the AMR coding algorithm for 20 ms frame

Mode

Parameter

1st subframe

2nd subframe

3rd subframe

4th subframe

total per frame

2 LSP sets

38

12.2 kbit/s

Pitch delay

9

6

9

6

30

(GSM EFR)

Pitch gain

4

4

4

4

16

Algebraic code

35

35

35

35

140

Codebook gain

5

5

5

5

20

Total

244

LSP set

26

10.2 kbit/s

Pitch delay

8

5

8

5

26

Algebraic code

31

31

31

31

124

Gains

7

7

7

7

28

Total

204

LSP sets

27

7.95 kbit/s

Pitch delay

8

6

8

6

28

Pitch gain

4

4

4

4

16

Algebraic code

17

17

17

17

68

Codebook gain

5

5

5

5

20

Total

159

LSP set

26

7.40 kbit/s

Pitch delay

8

5

8

5

26

(DAMPS EFR)

Algebraic code

17

17

17

17

68

Gains

7

7

7

7

28

Total

148

LSP set

26

6.70 kbit/s

Pitch delay

8

4

8

4

24

Algebraic code

14

14

14

14

56

Gains

7

7

7

7

28

Total

134

LSP set

26

5.90 kbit/s

Pitch delay

8

4

8

4

24

Algebraic code

11

11

11

11

44

Gains

6

6

6

6

24

Total

118

LSP set

23

5.15 kbit/s

Pitch delay

8

4

4

4

20

Algebraic code

9

9

9

9

36

Gains

6

6

6

6

24

Total

103

LSP set

23

4.75 kbit/s

Pitch delay

8

4

4

4

20

Algebraic code

9

9

9

9

36

Gains

8

8

16

Total

95

4.4 Principles of the GSM adaptive multi-rate speech decoder

The signal flow at the decoder is shown in figure 4. At the decoder, based on the chosen mode, the transmitted indices are extracted from the received bitstream. The indices are decoded to obtain the coder parameters at each transmission frame. These parameters are the LSP vectors, the fractional pitch lags, the innovative codevectors, and the pitch and innovative gains. The LSP vectors are converted to the LP filter coefficients and interpolated to obtain LP filters at each subframe. Then, at each 40-sample subframe:

‑ the excitation is constructed by adding the adaptive and innovative codevectors scaled by their respective gains;

‑ the speech is reconstructed by filtering the excitation through the LP synthesis filter.

Finally, the reconstructed speech signal is passed through an adaptive postfilter.

4.5 Sequence and subjective importance of encoded parameters

The encoder will produce the output information in a unique sequence and format, and the decoder must receive the same information in the same way. In table 9a-9h, the sequence of output bits and the bit allocation for each parameter is shown.

The different parameters of the encoded speech and their individual bits have unequal importance with respect to subjective quality. Before being submitted to the channel encoding function the bits have to be rearranged in the sequence of importance as given in 05.03 [3].