4 General description of the coder

26.4453GPPCodec for Enhanced Voice Services (EVS)Detailed algorithmic descriptionRelease 15TS

4.1 Introduction

The present document is a detailed description of the signal processing algorithms of the Enhanced Voice Services coder. The detailed mapping from 20ms input blocks of audio samples in 16 bit uniform PCM format to encoded blocks of bits and from encoded blocks of bits to output blocks of reconstructed audio samples is explained. Four sampling rates are supported; 8 000, 16 000, 32 000 and 48 000 samples/s and the bit rates for the encoded bit stream of may be 5.9, 7.2, 8.0, 9.6, 13.2, 16.4, 24.4, 32.0, 48.0, 64.0, 96.0 or 128.0 kbit/s. An AMR-WB Interoperable mode is also provided which operates at bit rates for the encoded bit stream of 6.6, 8.85, 12.65, 14.25, 15.85, 18.25, 19.85, 23.05 or 23.85 kbit/s.

The procedure of this document is mandatory for implementation in all network entities and User Equipment (UE)s supporting the EVS coder.

The present document does not describe the ANSI-C code of this procedure. For a description of the reference fixed-point ANSI C code specifications, see [3]; for a description of the reference floating-point ANSI C code specification see [43].

In the case of discrepancy between the procedure described in the present document and its ANSI-C code specifications contained in [3] the procedure defined by the [3] prevails. In the case of discrepancy between the procedure described in the present document and its ANSI-C code specifications contained in [43] the procedure defined by [43] prevails.

4.2 Input/output sampling rate

The encoder can accept fullband (FB), superwideband (SWB), wideband (WB) or narrow-band (NB) signals sampled at 48, 32, 16 or 8 kHz. Similarly, the decoder output can be 48, 32, 16 or 8 kHz, FB, SWB, WB or NB.

4.3 Codec delay

The input signal is processed using 20 ms frames. The codec delay depends on the output sampling rate. For WB, SWB and FB output (i.e. output sampling rate > 8 kHz) the overall algorithmic delay is 32 ms. It consists of one 20 ms frame, 0.9375 ms delay of input resampling filters on the encoder-side, 8.75 ms for the encoder look-ahead, and 2.3125 ms delay of time-domain bandwidth extension on the decoder-side. For 8 kHz decoder output the decoder delay is reduced to 1.25 ms needed for resampling using a complex low-delay filterbank, resulting in 30.9375 ms overall algorithmic delay.

4.4 Coder overview

The EVS codec employs a hybrid coding scheme combining linear predictive (LP) coding techniques based upon ACELP (Algebraic Code Excited Linear Prediction), predominantly for speech signals, with a transform coding method, for generic content, as well as inactive signal coding in conjunction with VAD/DTX/CNG (Voice Activity Detection/Discontinuous Transmission/ Comfort Noise Generation) operation. The EVS codec is capable of switching between these different coding modes without artefacts.

The EVS codec supports 5.9kbps narrowband and wideband variable bit rate (VBR) operation based upon the ACELP coding paradigm which also provides the AMR-WB interoperable encoding and decoding. In addition to perceptually optimized waveform matching, the codec utilizes parametric representations of certain frequency ranges. These parametric representations constitute coded bandwidth extensions or noise filling strategies.

The decoder generates the signal parameters represented by the indices transmitted in the bit-stream. For the bandwidth extension and noise fill regions estimates from the coded regions are used in addition to the decoded parametric data to generate the signals for these frequency regions.

The description of the coding algorithm of this Specification is made in terms of bit-exact, fixed-point mathematical operations. The ANSI C code described in [3], which constitutes an integral part of this Specification, reflects this bit-exact, fixed-point descriptive approach. The mathematical descriptions of the encoder (clause 5), and decoder (clause 6), can be implemented in different ways, possibly leading to a codec implementation not complying with this Specification. Therefore, the algorithm description of the ANSI C code of [3] shall take precedence over the mathematical descriptions of clauses 5 and 6 whenever discrepancies are found. A non-exhaustive set of test signals that can be used with the ANSI C code are available described in [4].

4.4.1 Encoder overview

Figure 1 represents a high-level overview of the encoder. The signal resampling block corrects mismatches between the sampling frequency and the signal bandwidth command line parameter that is specified on the command line or through a file containing a bandwidth switching profile as explained in TS 26.442 and TS 26.443. For the case that the signal bandwidth is lower than half the input sampling frequency, the signal is decimated to a the lowest possible sampling rate out of the set of (8, 16, 32 kHz) that is larger than twice the signal bandwidth.

Figure 2: Encoder overview

The signal analysis determines which of three possible encoder strategies to employ: LP based coding (ACELP), frequency domain encoding and inactive coding (CNG). In some operational modes the signal analysis step includes a closed loop decision to determine which encoding method will result in the lowest distortion. Further parameters derived in the signal analysis aid the operation of these coding blocks and some of the analysis parameters, such as the coding strategy to be employed, are encoded into the bit-stream. In each of the coding blocks the signal analysis is further refined to obtain parameters relevant for the particular coding block.

The signal analysis and sub-sequent decision of the coding mode is performed independently for each 20ms frame and the switching between different modes is possible on a frame-by-frame basis. In the switching instance parameters are exchanged between the coding modes to ensure that the switching is as seamless as possible and closed-loop methods are sometimes employed in this case. In addition, switching between different bandwidths and/or bit-rates (including both the EVS Primary mode and the EVS AMR-WB IO mode) is possible on frame boundaries.

The signal analysis and all other blocks have full access to the command line parameters such as bit-rate, sampling rate, signal bandwidth, DTX activation as signalling information.

4.4.1.1 Linear Prediction Based Operation

The input signal is split into high frequency band and low frequency band paths; where the cut-off frequency between these two bands is determined from the operational mode (bandwidth and bit-rate) of the codec.

The linear-prediction coefficient estimation is performed for every 20ms frame. Within a frame, several interpolation points are established depending upon the bitrate and the optimum interpolation is transmitted to the decoder. The linear-prediction residual is further analysed and quantized using different quantization schemes dependent upon the nature of the residual. For the 5.9kbps VBR operation additional low-rate coding modes at rates conforming to the design constraints are employed.

The high-frequency portion of the signal is represented with several different parametric representations. The parameters used for this representation vary as a function of the bit-rate and the residual quantization strategy. The transmitted parameters include some or all of spectral envelope, energy information and temporal evolution information.

The LP based core can be configured so that both the linear prediction coefficients and the residual quantization are interoperable with the AMR-WB decoder. For this purpose the configuration of the LP coefficient estimation, parametric HF representation and the residual quantization is similar to those of AMR-WB. For the AMR-WB interoperable operation modes, identical codebooks to the AMR-WB quantizers are used.

Figure 3: Linear prediction based operation including parametric HF representation

4.4.1.2 Frequency Domain Operation

For the frequency domain coding the encoding block can be envisaged as being separated into a control layer and a signal processing layer. The control layer performs signal analysis to derive several control and configuration parameters for the signal processing layer. The time-to-frequency transformation is based on the Modified Discrete Cosine Transform (MDCT) and provides adaptive time-frequency resolution. The control layer derives measures of the time distribution of the signal energy in a frame and controls the transform.

The MDCT coefficients are quantized using a variety of direct and parametric representations depending upon bit rate signal type and operating mode.

Figure 4: Frequency domain encoder

4.4.1.3 Inactive Signal coding

When the codec is operated in DTX on mode the signal classifier depicted in Figure 1 selects the discontinuous transmission (DTX) mode for frames that are determined to consist of background noise. For these frames a low-rate parametric representation of the signal is transmitted no more frequently than every 8 frames (160ms).

The low-rate parametric representation is used in the decoder for comfort noise generation (CNG) and includes parameters describing the frequency envelope of the background signal, energy parameters describing the overall energy and its time evolution.

4.4.1.4 Source Controlled VBR Coding

VBR coding describes a method that assigns different number of bits to a speech frame in the coded domain depending on the characteristics of the input speech signal [20] [21]. This method is often called source-controlled coding as well. Typically, a source-controlled coder encodes speech at different bit rates depending on how the current frame is classified, e.g., voiced, unvoiced, transient, or silence. Note that DTX operation can be combined with VBR coders in the same way as with Fixed Rate (FR) coders; the VBR operation is related to active speech segments.

The VBR solution provides narrowband and wideband coding using the bit rates 2.8, 7.2 and 8.0 kbps and produces an average bit rate at 5.9 kbps.

Due to the finer bit allocation, in comparison to Fixed Rate (FR) coding, VBR offers the advantage of a better speech quality at the same average active bit rate than FR coding at the given bit rate. The benefits of VBR can be exploited if the transmission network supports the transmission of speech frames (packets) of variable size, such as in LTE and UMTS networks.

4.4.2 Decoder overview

The decoder receives all quantized parameters and generates a synthesized signal. Thus, for the majority of encoder operations it represents the inverse of the quantized value to index operations.

For the AMR-WB interoperable operation the index lookup is performed using the AMR-WB codebooks and the decoder is configured to generate an improved synthesized signal from the AMR-WB bitstream.

4.4.2.1 Parametric Signal Representation Decoding (Bandwidth Extension)

In addition to the generation of signal components specifically represented by the transmitted indices, the decoder performs estimates of signal regions where the transmitted signal representation is incomplete, i.e. for the parametric signal representations and noise fill as well as blind bandwidth extension in some cases.

4.4.2.2 Frame loss concealment

The EVS codec includes frame loss concealment algorithms [21]. For all coding modes an extrapolation algorithm is in place that estimates the signal in a lost frame. For the LP based core this estimation operates on the last received residual and LP coefficients. For the frequency domain core in some cases the last received MDCT coefficients are extrapolated and in addition the resulting time domain signal is guaranteed to give a smooth time evolution from the last received frame into the missing frames.

Once the frame loss is recovered, i.e., the first good frame is received the codec memory is updated and frame boundary mismatches towards the last lost frame are minimized.

For situations of sustained frame loss the signal is either faded to background noise or its energy is reduced and finally muted when no reasonable extrapolation can be assumed.

The EVS codec also includes the “channel aware” mode, which may be employed for improved performance under packet loss conditions in a VoIP system. In the channel aware mode, partial copies (secondary frames) of the current speech frame are piggybacked on future speech frames (primary frames), without any increase in the total bit rate for the primary and secondary frames. If the current frame is lost, then its partial copy can be retrieved by polling the de-jitter buffer to enable faster and improved recovery from the packet loss.

4.4.3 DTX/CNG operation

The codec is equipped with a signal activity detection (SAD) algorithm for classifying each input frame as active or inactive. It supports a discontinuous transmission (DTX) operation in which the comfort noise generation (CNG) module is used to approximate and update the statistics of the background noise at a variable bit rate. Thus, the transmission rate during inactive signal periods is variable and depends on the estimated level of the background noise. By default in the command line the transmission rate of CNG update is fixed to 8 frames. However, the CNG update rate can also be set to another fixed value or a variable rate by means of a command line parameter; when the transmission rate during inactive signal periods is variable, it depends on the estimated level of the background noise.

4.4.3.1 Inactive Signal coding

When the codec is operated in DTX on mode the signal classifier depicted in Figure 1 selects the discontinuous transmission (DTX) mode for frames that are determined to consist of background noise. For these frames a low-rate parametric representation of the signal is transmitted no more frequently than every 8 frames (160ms).

The low-rate parametric representation is used in the decoder for comfort noise generation (CNG) and includes parameters describing the frequency envelope of the background signal, energy parameters describing the overall energy and its time evolution.

4.4.4 AMR-WB-interoperable option

As mentioned previously, EVS can operate in a mode which is fully interoperable with the AMR-WB codec bitstream.

4.4.5 Channel-Aware Mode

EVS offers partial redundancy [21] based error robust channel aware mode at 13.2 kbps for both wideband and super-wideband audio bandwidths.

In a VoIP system, packets arrive at the decoder with random jitters in their arrival time. Packets may also arrive out of order at the decoder. Since the decoder expects to be fed a speech packet every 20 ms to output speech samples in periodic blocks, a de-jitter buffer is required to absorb the jitter in the packet arrival time. The channel aware mode combines the presence of a de-jitter buffer with partial redundancy coding of a current frame which gets piggy backed onto a future frame. At the receiver, the de-jitter buffer is polled to check if a partial redundant copy of the current lost frame is available in any of the future frames. If present, the partial redundant information is used to synthesize the lost frame which offers significant quality improvements under low to high FER conditions. Source control is used to determine which frames of input can best be coded at a reduced frame rate (called primary frames) to accommodate the attachment of redundancy without altering the total packet size. In this way, the channel aware mode includes redundancy in a constant-bit-rate channel (13.2 kbps).

4.5 Organization of the rest of the Technical Standard

In clauses 5 and 6, detailed descriptions of the encoder and the decoder are given. Bit allocation is summarized in clause 7.