2 Functional description

06.323GPPTSVoice Activity Detection (VAD)

The purpose of this clause is to give the reader an understanding of the principles of operation of the VAD, whereas the detailed description is given in clause 3. In case of discrepancy between the two descriptions, the detailed description of clause 3 shall prevail.

In the following clauses of clause 2, a Pascal programming type of notation has been used to describe the algorithm.

2.1 Overview and principles of operation

The function of the VAD is to distinguish between noise with speech present and noise without speech present. The biggest difficulty for detecting speech in a mobile environment is the very low speech/noise ratios which are often encountered. The accuracy of the VAD is improved by using filtering to increase the speech/noise ratio before the decision is made.

For a mobile environment, the worst speech/noise ratios are encountered in moving vehicles. It has been found that the noise is relatively stationary for quite long periods in a mobile environment. It is therefore possible to use an adaptive filter with coefficients obtained during noise, to remove much of the vehicle noise.

The VAD is basically an energy detector. The energy of the filtered signal is compared with a threshold; speech is indicated whenever the threshold is exceeded.

The noise encountered in mobile environments may be constantly changing in level. The spectrum of the noise can also change, and varies greatly over different vehicles. Because of these changes the VAD threshold and adaptive filter coefficients must be constantly adapted. To give reliable detection the threshold must be sufficiently above the noise level to avoid noise being identified as speech but not so far above it that low level parts of speech are identified as noise. The threshold and the adaptive filter coefficients are only updated when speech is not present. It is, of course, potentially dangerous for a VAD to update these values on the basis of its own decision. This adaptation therefore only occurs when the signal seems stationary in the frequency domain but does not have the pitch component inherent in voiced speech. A tone detector is also used to prevent adaptation during information tones.

A further mechanism is used to ensure that low level noise (which is often not stationary over long periods) is not detected as speech. Here, an additional fixed threshold is used.

A VAD hangover period is used to eliminate mid‑burst clipping of low level speech. Hangover is only added to speech‑bursts which exceed a certain duration to avoid extending noise spikes.

2.2 Algorithm description

The block diagram of the VAD algorithm is shown in figure 2.1. The individual blocks are described in the following clauses. ACF, N and sof are calculated in the speech encoder.

Figure 2.1: Functional block diagram of the VAD

The global variables shown in the block diagram are described as follows:

‑ ACF are auto‑correlation coefficients which are calculated in the speech encoder defined in GSM 06.10 (clause 3.1.4, see also clause A.1). The inputs to the speech encoder are 16 bit 2’s complement numbers, as described in GSM 06.10, clause 4.2.0;

‑ av0 and av1 are averaged ACF vectors;

‑ rav1 are autocorrelated predictor values obtained from av1;

‑ rvad are the autocorrelated predictor values of the adaptive filter;

‑ N is the long term predictor lag value which is obtained every sub‑segment in the speech coder defined in GSM 06.10;

‑ ptch indicates whether the signal has a steady periodic component;

‑ sof is the offset compensated signal frame obtained in the speech coder defined in GSM 06.10;

‑ pvad is the energy in the current frame of the input signal after filtering;

‑ thvad is an adaptive threshold;

‑ stat indicates spectral stationarity;

‑ vvad indicates the VAD decision before hangover is added;

‑ vad is the final VAD decision with hangover included.

2.2.1 Adaptive filtering and energy computation

Pvad is computed as follows:

This corresponds to performing an 8th order block filtering on the input samples to the speech encoder, after zero offset compensation and pre‑emphasis. This is explained in clause A.1.

2.2.2 ACF averaging

Spectral characteristics of the input signal have to be obtained using blocks that are larger than one 20 ms frame. This is done by averaging the auto‑correlation values for several consecutive frames. This averaging is given by the following equations:

Where n represents the current frame, n‑1 represents the previous frame etc. The values of constants are given in table 2.1.

Table 2.1: Constants and variables for ACF averaging

Constant

Value

Variable

Initial value

frames

4

previous ACF’s

av0 & av1

All set to 0

2.2.3 Predictor values computation

The filter predictor values aav1 are obtained from the auto‑correlation values av1 according to the equation:

where:

‑ ‑

R = | av1[0], av1[1], av1[2], av1[3], av1[4], av1[5], av1[6], av1[7] |

| av1[1], av1[0], av1[1], av1[2], av1[3], av1[4], av1[5], av1[6] |

| av1[2], av1[1], av1[0], av1[1], av1[2], av1[3], av1[4], av1[5] |

| av1[3], av1[2], av1[1], av1[0], av1[1], av1[2], av1[3], av1[4] |

| av1[4], av1[3], av1[2], av1[1], av1[0], av1[1], av1[2], av1[3] |

| av1[5], av1[4], av1[3], av1[2], av1[1], av1[0], av1[1], av1[2] |

| av1[6], av1[5], av1[4], av1[3], av1[2], av1[1], av1[0], av1[1] |

| av1[7], av1[6], av1[5], av1[4], av1[3], av1[2], av1[1], av1[0] |

‑ ‑

and:

‑ ‑ ‑ ‑

p = |av1[1]| a = |aav1[1]|

|av1[2]| |aav1[2]|

|av1[3]| |aav1[3]|

|av1[4]| |aav1[4]|

|av1[5]| |aav1[5]|

|av1[6]| |aav1[6]|

|av1[7]| |aav1[7]|

|av1[8]| |aav1[8]|

‑ ‑ ‑ ‑

aav1[0] = ‑1

av1 is used in preference to av0 as av0 may contain speech.

The autocorrelated predictor values rav1 are then obtained:

2.2.4 Spectral comparison

The spectra represented by the autocorrelated predictor values rav1 and the averaged auto‑correlation values av0 are compared using the distortion measure dm defined below. This measure is used to produce a Boolean value stat every 20 ms, as given by these equations:

difference = |dm ‑ lastdm|

lastdm = dm

stat = difference < thresh

The values of constants and initial values are given in table 2.2.

Table 2.2: Constants and variables for spectral comparison

Constant

Value

Variable

Initial value

thresh

0.05

lastdm

0

2.2.5 Periodicity detection

The frequency spectrum of mobile noise is relatively stationary over quite long periods. The Inverse Filter Autocorrelated Predictor coefficients of the adaptive filter rvad are only updated when this stationarity is detected. Vowel sounds however, also have this stationarity, but can be excluded by detecting the periodicity of these sounds using the long term predictor lag values (Nj) which are obtained every sub‑segment from the speech codec defined in GSM 06.10. Consecutive lag values are compared. Cases in which one lag value is a factor of the other are catered for, however cases in which both lag values have a common factor, are not. This case is not important for speech input but this method of periodicity detection may fail for some sine waves. The Boolean variable ptch is updated every 20 ms and is true when periodicity is detected. It is calculated according to the following equation:

ptch = oldlagcount + veryoldlagcount >= nthresh

The following operations are done after the VAD decision and when the current LTP lag values (N0 .. N3) are available, this reduces the delay of the VAD decision. (N{‑1} = N3 of previous segment.)

lagcount = 0

for j = 0 to 3 do

begin

smallag = maximum(Nj,N{j‑1}) mod minimum(Nj,N{j‑1})

if minimum(smallag,minimum(Nj,N{j‑1})‑smallag) < lthresh

then increment(lagcount)

end

veryoldlagcount = oldlagcount

oldlagcount = lagcount

The values of constants and initial values are given in table 2..

Table 2.3: Constants and variables for periodicity detection

Constant

Value

Variable

Initial value

lthresh
nthresh

2
4

oldlagcount
veryoldlagcount
N3

0
0
40

2.2.6 Information tone detection

The tone flag is only evaluated in the downlink VAD. In the uplink VAD, tone detection is not performed and tone = false.

Computation of the tone flag is complex. It is therefore evaluated after the processing of the current speech encoder frame. In this way transmission of the speech or SID frame is not delayed.

Information tones and environmental noise can be classified by inspecting the short term prediction gain, information tones resulting in higher prediction gains than environmental noise. Tones can therefore be detected by comparing the prediction gain to a fixed threshold. By limiting the prediction gain calculation to a fourth order analysis, information signals consisting of one or two tones can be detected whilst minimizing the prediction gain for environmental noise.

The prediction gain decision is implemented by comparing the normalized prediction error with a threshold. This measure is used to evaluate the Boolean variable tone every 20 ms. The signal is classified as a tone if the prediction error is smaller than the threshold predth. This is equivalent to a prediction gain threshold of 13,5 dB.

Mobile noise can contain very strong resonances at low frequencies, resulting in a high prediction gain. A further test is therefore made to determine the pole frequency of a second order analysis of the signal frame. The signal is classified as noise if the frequency of the pole is less than 385 Hz. The pole frequency calculation is described in clause A.4.

The algorithm for detecting information tones is as follows:

tone = false

den = a[1]*a[1]

num = 4*a[2] ‑ a[1]*a[1]

if ( num <= 0 )

return

if (( a[1] < 0 ) AND ( num / den < freqth ))

return

4

prederr = MULT (1 ‑ RC[i]*RC[i])

i=1

if (prederr < predth)

tone = true

return

The values of the constants are given in table 2.4. The coefficients a[1..2] are transversal filter coefficients calculated from rc[1..2]. The calculation of the reflection coefficients rc[1..4] is described below.

The offset compensated signal frame sof[0..159] is multiplied by the Hanning window to give the windowed frame sofh[0..159]:

where

The auto‑correlation acfh[0..4] of the windowed signal frame is then calculated:

rc[1..4] are then calculated from acfh[0..4] using the Schur recursion described in the RPE‑LTP codec.

Table 2.4: Constants for information tone detection

Constant

Value

freqth
predth

0,0973
0,0158

NOTE: Reflection coefficients are available in the RPE‑LTP codec. However, they are calculated after pre‑emphasis using a rectangular window and do not give good tone detection results.

2.2.7 Threshold adaptation

A check is made every 20 ms to determine whether the VAD decision threshold (thvad) should be changed. This adaptation is carried out according to the flowchart shown in figure 2.2. The constants used are given in table 2.5.

Adaptation takes place in two different situations: firstly whenever ACF[0] is very low and secondly whenever there is a very high probability that speech and information tones are not present.

In the first case, the threshold is adapted if the energy of the input signal is less than pth. The threshold is set to plev without carrying out any further tests because at these very low levels the effect of the signal quantization makes it impossible to obtain reliable results from these tests.

In the second case, the decision threshold (thvad) and the adaptive filter coefficients (rvad) are only updated with the rav1 values when there is a very high probability that speech and information tones are not present. Adaptation occurs if the following conditions are met over a number (adp) of signal frames:

‑ stationarity is detected in the frequency domain;

‑ the signal does not contain a periodic component;

‑ information tones are not present.

The step‑size by which the threshold is adapted is not constant but a proportion of the current value (determined by constants dec and inc). The adaptation begins by experimentally multiplying the threshold by a factor of (1‑1/dec). If the new threshold is now higher than or equal to Pvad times fac then the threshold needed to be decreased and it is left at this new lower level. If, on the other hand, the new threshold level is less than Pvad times fac then the threshold either needed to be increased or kept constant. In this case it is set to Pvad times fac unless this would mean multiplying it by more than a factor of (1+1/inc) (in which case it is multiplied by a factor of (1+1/inc)). The threshold is never allowed to be greater than Pvad+margin.

Table 2.5: Constants and variables for threshold adaptation

Constant

Value

Variable

Initial value

pth
plev
fac
adp
inc
dec
margin

300 000
800 000
3.0
8
16
32
80 000 000

adaptcount
thvad
rvad[0]
rvad[1]
rvad[2]
rvad[3] to
rvad[8]

0
1 000 000
6
‑4
1

All 0

Figure 2.2: Flow diagram for threshold adaptation

2.2.8 VAD decision

Prior to hangover the VAD decision condition is:

vvad = pvad > thvad

2.2.9 VAD hangover addition

VAD hangover is only added to bursts of speech greater than or equal to burstconst blocks. The Boolean variable vad indicates the decision of the VAD with hangover included. The values of the constants are given in table 2.6. The hangover algorithm is as follows:

if vvad then increment(burstcount) else burstcount = 0

if burstcount >= burstconst then

begin

hangcount = hangconst;

burstcount = burstconst

end

vad = vvad or (hangcount >= 0)

if hangcount >= 0 then decrement(hangcount)

Table 2.6: Constants and variables for VAD hangover addition

Constant

Value

Variable

Initial value

burstconst
hangconst

3
5

burstcount
hangcount

0
‑1