5.1.10 Open-loop pitch analysis

26.4453GPPCodec for Enhanced Voice Services (EVS)Detailed algorithmic descriptionRelease 15TS

The Open-Loop (OL) pitch analysis calculates three estimates of the pitch lag for each frame. This is done in order to smooth the pitch evolution contour and to simplify the pitch analysis by confining the closed-loop pitch search to a small number of lags around the open-loop estimated lags.

The OL pitch estimation is based on a perceptually weighted pre-emphasized input signal. The open-loop pitch analysis is performed on a signal decimated by two, i.e. sampled at 6.4 kHz. This is in order to reduce the computational complexity of the searching process. The decimated signal is obtained by filtering the signal through a 5th-order FIR filter with coefficients {0.13, 0.23, 0.28, 0.23, 0.13} and then down-sampling the output by 2.

The OL pitch analysis is performed three times per frame to find three estimates of the pitch lag: two in the current frame and one in the look‑ahead area. The first two calculations are based on 10‑ms segments of the current frame. The final (third) estimation corresponds to the look-ahead, and the length of this segment is 8.75ms.

5.1.10.1 Perceptual weighting

Perceptual weighting is performed by filtering the pre-emphasized input signal through a perceptual weighting filter, derived from the LP filter coefficients. The traditional perceptual weighting filter has inherent limitations in modelling the formant structure and the required spectral tilt concurrently. The spectral tilt is pronounced in speech signals due to the wide dynamic range between low and high frequencies. This problem is eliminated by introducing the pre-emphasis filter (see subclause 5.1.4) at the input which enhances the high frequency content. The LP filter coefficients are found by means of LP analysis on the pre-emphasized signal. Subsequently, they are used to form the perceptual weighting filter. Its transfer function is the same as the LP filter transfer function but with the denominator having coefficients equal to the de-emphasis filter (inverse of the pre-emphasis filter). In this way, the weighting in formant regions is decoupled from the spectral tilt. Finally, the pre-emphasized signal is filtered through the perceptual filter to obtain a perceptually weighted signal, which is used further in the OL pitch analysis.

The perceptual weighting filter has the following form

(63)

where

(64)

and and . Since is computed based on the pre-emphasized signal, the tilt of the filter is less pronounced compared to the case when is computed based on the original signal. The de-emphasis is also performed on the output signal in the decoder. It can be shown that the quantization error spectrum is shaped by a filter having a transfer function . Thus, the spectrum of the quantization error is shaped by a filter whose transfer function is , with computed based on the pre-emphasized signal. The perceptual weighting is performed on a frame-by-frame basis while the LP filter coefficients are calculated on a sub-frame basis using the principle of LSP interpolation, described in subclause 5.1.9.6. For a sub-frame size = 64, the weighted speech is given by

(65)

where 0.68 is the pre-emphasis factor. Furthermore, for the open-loop pitch analysis, the computation is extended for a period of 8.75ms using the look-ahead from the future frame. This is done using the filter coefficients of the 4th subframe in the present frame. Note that this extended weighted signal is used only in the OL pitch analysis of the present frame.

If ACELP core is selected for WB, SWB or FB signals at bitrates higher than 13.2 kbps, its internal sampling rate is set to 16 kHz rather than 12.8 kHz. Nevertheless, the OL pitch analysis is done only at 12.8 kHz and the estimated OL pitch values are later resampled to 16 kHz. However, perceptually weighted input signal sampled at a16 kHz is still needed in the search of the adaptive codebook. The perceptual weighting filter at 16 kHz has the following form

(66)

where and . Thus, for this case, the pre-emphasis is done as follows

(67)

The perceptual weighting at 25.6 kHz and 32 kHz is described later in this document.

5.1.10.2 Correlation function computation

The correlation function for each of the three segments (or half-frames) is obtained using correlation values computed over a first pitch delay range from 10 to 115 (which has been decimated by 2) and over a second pitch delay range from 12 to 115 (which has been decimated by 2). Both of the two delay ranges are divided into four sections: [10, 16], [17, 31], [32, 61] and [62, 115] for the first delay range and [12, 21], [22, 40], [41, 77] and [77, 115] for the second delay range, so that the two sets of four sections overlap. The first sections in the two sets, [10, 16] and [12, 21] are, however, used only under special circumstances to avoid quality degradation for pitch lags below the lowest pitch quantization limit. Due to this special use of the first sections in the sets, omitting the pitch lags 10 and 11 in the second set of sections presents no quality issues. In addition, the second set omits pitch lags between 17 and 20, when the first sections are not used. The first section is mainly used in speech segments with stable, short pitch lags and the above limits have therefore a negligible effect on the overall pitch search and quantization performance.

The autocorrelation function is first computed on a decimated weighted signal for each pitch lag value in both sets by

(68)

where the summation limit depends on the delay section, i.e.:

(69)

This will ensure that, for a given delay value, at least one pitch cycle is included in the correlation computation. The autocorrelation window is aligned with the first sample of each of the two 10-ms segments of the current frame, where the autocorrelation can thus be calculated directly according to equation (68). To maximize the usage of the look-ahead segment, the autocorrelation window in the third segment is aligned with the last available sample. In this final segment the autocorrelation function of equation (68) is computed backwards, i.e., the values of are negative. Therefore, the computation as such is the same for all three segments, only the window alignment differs and the indexing of the signal is reversed in the last segment.

5.1.10.3 Correlation reinforcement with past pitch values

The autocorrelation function is then weighted for both pitch delay ranges to emphasize the function for delays in the neighbourhood of pitch lag parameters determined in the previous frame (extrapolated pitch lags).

The weighting function is given by a triangular window of size 27 and it is centred on the extrapolated pitch lags. The weighting function is given by

(70)

where is a scaling factor based on the voicing measure from the previous frame (the normalized pitch correlation) and the pitch stability in the previous frame. During voiced segments with smooth pitch evolution, the scaling factor is updated in each frame by adding a value of and it is upper-limited to 0.7. is the average of the normalized correlation in the two half frames of the previous frame and is given in equation (71). The scaling factor is reset to zero (no weighting) if is less than 0.4 or if the pitch lag evolution in the previous frame is unstable or if the relative frame energy of the previous frame is more than a certain threshold. The pitch instability is determined by testing the pitch coherence between consecutive half-frames. The pitch values of two consecutive half-frames are considered coherent if the following condition is satisfied:

where and denote the maximum and minimum of the two pitch values, respectively. The pitch evolution in the current frame is considered stable if pitch coherence is satisfied for both, the first half-frame of the current frame and the second half-frame of the previous frame as well as the first half-frame and the second half-frame of the current frame.

The extrapolated pitch lag in the first half-frame, , is computed as the pitch lag from the second half-frame of the previous frame plus a pitch evolution factor , computed from the previous frame (described in subclause 5.1.10.7). The extrapolated pitch lag in the second half-frame, , is computed as the pitch lag from the second half-frame of the previous frame plus twice the pitch evolution factor. The extrapolated pitch lag in the look ahead, , is set equal to . That is

(72)

where is the pitch lag in the second half-frame of the previous frame. The pitch evolution factor is obtained by averaging the pitch differences of consecutive half-frames that are determined as coherent (according to the coherence rule described above).

The autocorrelation function weighted around an extrapolated pitch lag is given by

(73)

5.1.10.4 Normalized correlation computation

After weighting the correlation function with the triangular window of equation (70) centred at the extrapolated pitch lag, the maxima of the weighted correlation function in each of the four sections (three sections, if the first section is not used) are determined. This is performed for both pitch delay ranges. Note that the first section is used only during high-pitched segments. For signals other than narrowband signals, this means that the open-loop pitch period of the second half-frame of the previous frame is lower than or equal to 34. For narrowband signals, the open-loop pitch period of the second half-frame of the previous frame needs to be lower than or equal to 24 and the scaling factor has to be higher than or equal to 0.1. It is further noted that the scaling factor is set to 0, if the previous frame were an unvoiced or a transition frame and the signal has a bandwidth higher than narrowband. In the following, the special case of three sections will not be explicitly dealt with if it arises directly from the text. The pitch delays that yield the maximum of the weighted correlation function will be denoted as , where k = 0,1,2,3 denotes each of the four sections. Then, the original (raw) correlation function at these pitch delays (pitch lags) is normalized as

(74)

The same normalization is applied also to the weighted correlation function, , which yields . It is noted that is aligned at the first sample of the corresponding half-frame for the two half-frames of the current frame and at the last sample of the look-ahead for the look-ahead segment, where the calculation is performed backwards in order to exploit the full look-ahead as well as possible.

At this point, four candidate pitch lags, , k = 0,1,2,3, have been determined for each of the three segments (two in the current frame and one in the look-ahead) in each of the two pitch delay ranges. In correspondence with these candidate pitch lags, normalized correlations (both weighted and raw) have been calculated. All remaining processing is performed using only these selected values, greatly reducing the overall complexity.

5.1.10.5 Correlation reinforcement with pitch lag multiples

In order to avoid selecting pitch multiples within each pitch delay range, the weighted normalized correlation in a lower pitch delay section is further emphasized if one of its multiples is in the neighbourhood of the pitch lag in a higher section. That is,

where , is a voicing factor (normalized pitch correlation) from the previous frame, and is the pitch value from the second half-frame of the previous frame. In addition, when the first section is searched and the pitch multiple of the shortest-section candidate lag is larger than 20 samples, the following reinforcement is performed:

Further, , is given by the voicing factor of the second half-frame in the previous frame if the normalized correlation in the second half-frame was stronger than in the first half-frame, or otherwise by the mean value of these two normalized correlations. In this way, if a pitch period multiple is found in a higher section, the maximum weighted correlation in the lower section is emphasized by a factor of 1.17. However, if the pitch lag in section 3 is a multiple of the pitch lag in section 2 and at the same time the pitch lag in section 2 is a multiple of the pitch lag in section 1, the maximum weighted correlation in section 1 is emphasized twice. This correlation reinforcement is, however, not applied at each section when the previous frame voicing factor, , is less than 0.6 and the pitch value is less than 0.4 times the previous pitch value (i.e., the pitch value does not appear to be a halved value of the previous frame pitch or larger). In this way, the emphasis of the correlation value is allowed only during clear voicing conditions or when the value can be considered to belong to the past pitch contour.

The correlation reinforcement with pitch lag multiples is independent in each of the two pitch delay ranges.

It can be seen that the "neighbourhood" is larger when the multiplication factor k is higher. This is to take into account an increasing uncertainty of the pitch period (the pitch length is estimated roughly with integer precision at a 6400 Hz sampling frequency). For the look-ahead part, the first line of the condition above relating to the highest pitch lags is modified as follows:

Note that first section is not considered in the correlation reinforcement procedure described here, i.e., the maximum normalized correlation in the first section is never emphasized.

5.1.10.6 Initial pitch lag determination and reinforcement based on pitch coherence with other half-frames

An initial set of pitch lags is determined by searching for the maximum weighted normalized correlation in the four sections in each of the three segments or half-frames. This is done independently for both pitch delay ranges. The initial set of pitch lags is given by

(75)

where the superscript denotes the first, the second or the third (look‑ahead) half-frame.

To find the right pitch value, another level of weighting is performed on the weighted normalized correlation function, , in each section of each half-frame in each pitch delay range. This weighting is based on pitch coherence of the initial set of pitch lags,, with pitch lags, , ji, i.e., those from the other half-frames. The weighting is further reinforced with pitch lags selected from the complementary pitch delay range, denoted as , , ji. Further, the weighting favours section‑wise stability, where a stronger weighting is applied for coherent pitch values that are from the same section of the same set as the initial pitch lag, and a slightly weaker weighting is applied for coherent pitch values that are from a different section and/or a different pitch delay range than the initial pitch lag. That is, if the initial pitch lag in a half‑frameis coherent with a pitch lag of section k in half-frame , then the corresponding weighted normalized correlation of section in half-frameis further emphasized by weighting it by the value , if the initial pitch lag is also from section in the same pitch delay range, or by , if the initial pitch lag is not from section k in the same pitch delay range. The variable is the absolute difference between the two analysed pitch lags and the two weighting factors are defined as

where is upper-bounded by 0.4, is upper-bounded by 0.25 and is the raw normalized correlation (similar to the weighted normalized correlation, defined in equation (76)). Finally, is a noise correction factor added to the normalized correlation in order to compensate for its decrease in the presence of the background noise. It is defined as

(77)

where is the total background noise energy, calculated as described in subclause 5.1.11.1.

The procedure described in this subclause helps further to avoid selecting pitch multiples and insure pitch continuity in adjacent half-frames.

5.1.10.7 Pitch lag determination and parameter update

Finally, the pitch lags in each half-frame, , and , are determined. They are selected by searching the maximum of the weighted normalized correlations, corresponding to each of the four sections across both pitch delay ranges. In case of VBR operation, the normalized correlations are searched in addition to the weighted normalized correlations for a secondary evaluation. When the normalized correlation of the candidate lag is very high (lower-bounded by 0.9) and it is considered a halved value (lower-bounded by a multiplication by 0.4 and upper-bounded by a multiplication by 0.6) of the corresponding candidate identified by searching the weighted normalized correlation, the secondary pitch lag candidate is selected instead of the firstly selected one.

In total, six or eight values are thus considered in each segment or half-frame depending on whether section 0 is searched. After determining the pitch lags, the parameters needed for the next frame pitch search are updated. The average normalized correlation is updated by:

(78)

Finally, the pitch evolution factor to be used in computing the extrapolated pitch lags in the next frame is updated. The pitch evolution factor is given by averaging the pitch differences of the consecutive half-frames that are determined as coherent. If is the pitch lag in the second half of the previous frame, then pitch evolution is given by

(79)

Since the search is performed on the decimated weighted signal, the determined pitch lags, , and are multiplied by 2 to obtain the open-loop pitch lags for the three half-frames. That is

(80)

In the following text, the following notation is used for the normalized correlations corresponding to the final pitch lags:

(81)

5.1.10.8 Correction of very short and stable open-loop pitch estimates

Usually, music harmonic signals or singing voice signals have short pitch lags and they are more stationary than normal speech signals. It is extremely important to have the correct and precise short pitch lags as incorrect pitch lags may have a serious impact upon the quality.

The very short pitch range is defined from to at the sampling frequency kHz. As the pitch candidate is so short, pitch detection of using time domain only or frequency domain only solution may not be reliable. In order to reliably detect short pitch value, three conditions may need to be checked: (1) in frequency domain, the energy from 0 Hz to Hz must be relatively low enough; (2) in time domain, the maximum short pitch correlation in the pitch range from to must be relatively high enough compared to the maximum pitch correlation in the pitch range from to ; (3) the absolute value of the maximum normalized short pitch correlation must be high enough. These three conditions are more important; other conditions may be added such as Voice Activity Detection and Voiced Classification.

Suppose notes the average normalized pitch correlation value of the four subframes in the current frame:

(82)

are the four normalized pitch correlations calculated for each subframe; for each subframe, the best pitch candidate is found in the pitch range from to . The smoothed pitch correlation from previous frame to current frame is

(83)

Before the real short pitch is decided, two pre-decision conditions are checked first : (a) check if the harmonic peak is sharp enough, which is indicated by the flag . It is used to decide if the initial open-loop pitch is correct or not; (b) check if the maximum energy in the frequency region [0, FMIN] is low enough, which is indicated by the flag .

(a) Determine base pitch frequency according to the initial open-loop pitch

(84)

Then, based on the amplitude spectrum of input signal in frequency domain, determine the decision parameters which are used to confirm whether the pitch related to the base pitch frequency is accurate. The decision parameters include energy spectrum difference, average energy spectrum and the ratio of energy spectrum difference and average energy spectrum.

Compute the energy spectrum difference and the average energy spectrum of the frequency bins around base pitch frequency

(85)

(86)

Compute the weighted and smoothed energy spectrum difference and average energy spectrum

(87)

(88)

where and are weighted and smoothed energy spectrum difference and average energy spectrum of the frequency bins around the base pitch frequency.

Compute the ratio of energy spectrum difference and average energy spectrum

(89)

Based on the decision parameters calculated above, confirm whether the initial open-loop pitch is accurate.

The harmonic sharpness flag is determined as follows:

(90)

If the above conditions are not satisfied, remains unchanged.

(b) Assume that the maximum energy in the frequency region (Hz) is (dB) , the maximum energy in the frequency region (Hz) is (dB), the relative energy ratio between and is given by

(91)

This energy ratio is weighted by multiplying an average normalized pitch correlation value ,

(92)

Before using the parameter to detect the lack of low frequency energy, it is smoothed in order to reduce the uncertainty,

(93)

where the is the low frequency smoothed energy ratio. If then a lack of low frequency energy has been detected (otherwise not detected ). is determined by the following procedure,

(94)

If the above conditions are not satisfied, remains unchanged.

An initial very short pitch candidate is found by searching a maximum normalized pitch correlation from to ,

(95)

If notes the current short pitch correlation,

(96)

The smoothed short pitch correlation from previous frame to current frame is

(97)

By using all the available parameters, the final very short pitch lag is decided with the following procedure,

(98)

wherein is a flag which forces the codec to select the time domain CELP coding algorithm for short pitch signal even if the frequency domain coding algorithm and AUDIO class is previously selected; is a flag which forces the coder to select VOICED class for short pitch signal.

5.1.10.9 Fractional open-loop pitch estimate for each subframe

The OL pitch is further refined by maximizing the normalized correlation function with a fractional resolution around the pitch lag values and (in the 12.8-kHz sampling domain). The fractional open-loop pitch lag is computed four times per frame, i.e., for each subframe of 64 samples. This is similar as the closed-loop pitch search, described in later in this specification. The maximum normalized correlation corresponding to the best fractional open-loop pitch lag is then used in the classification of VC frames (see subclause 5.1.13.2). The fractional open-loop pitch search is performed by first maximizing an autocorrelation function of the perceptually weighted speech for integer lags in the interval [], where for the search in the first and the second subframes, and for the third and fourth subframes. The autocorrelation function is similar to equation (99) except that perceptually weighted speech at 12.8 kHz sampling rate is used, i.e,

(100)

In the above equation, corresponds to the first sample in each subframe.

Let be the integer lag maximizing . The fractional open-loop pitch search is then performed by interpolating the correlation function and searching for its maximum in the interval . The interpolation is performed with a 1/4 sample resolution using an FIR filter – a Hamming windowed sinc function truncated at 17. The filter has its cut-off frequency (–3 dB) at 5062 Hz and –6 dB at 5760 Hz in the 12.8 kHz domain. This means the interpolation filter exhibits a low-pass frequency response. Note that the negative fractions are not searched if coincides with the lower end of the searched interval, i.e., if.

Once the best fractional pitch lag,, is found, the maximum normalized correlation is computed similarly to equation (101), i.e.,

(102)

The same normalization is applied also to the weighted correlation function,, which yields .

At this point, four candidate pitch lags, , k = 0,1,2,3, have been determined for each of the three half-frames (two in the current frame and one in the look ahead) in each of the two pitch delay ranges. In correspondence with these candidate pitch lags, normalized correlations (both weighted and raw) have been calculated. All remaining processing is performed using only these selected values, greatly reducing the overall complexity.

Note that the last section (long pitch periods) in both pitch delay ranges is not searched for the look ahead part. Instead, the normalized correlation values and the corresponding pitch lags are obtained from the last section search of the second half-frame. The reason is that the summation limit in the last section is much larger than the available look ahead and also the computational complexity is reduced.

The fractional OL pitch estimation as described above is performed only for bitrates lower or equal to 24.4 kbps. For higher bitrates, the VC mode is not supported and consequently, there is no reason to estimate pitch with fractional resolution. Therefore, at higher bitrates, where for the first and for the second subframe i=0 and for the third and the fourth subframe i=1.