6 Computational description overview

06.423GPPTSVoice Activity Detection (VAD) for Half Rate Speech Traffic Channels

The computational details necessary for the fixed point implementation of the speech transcoding and DTX functions are given in the form of an American National Standards Institute (ANSI) C program contained in GSM 06.06 [5]. This clause provides an overview of the modules which describe the computation of the VAD algorithm.

6.1 VAD modules

The computational description of the VAD is divided into three ANSI C modules. These modules are:

‑ vad_reset;

‑ vad_algorithm;

‑ periodicity_update.

The vad_reset module sets the VAD variables to their initial values.

The vad_algorithm module is divided into nine sub‑modules which correspond to the blocks of figure 1 in the high level description of the VAD algorithm. The vad_algorithm module can be called as soon as the acf[0..8] and rc[1..4] variables are known. This means that the VAD computation can take place after the Autocorrelation Fixed point LAttice Technique (AFLAT) routine in the speech encoder (GSM 06.20 [2]). The vad_algorithm module also requires the value of the ptch variable calculated in the previous frame.

The ptch variable is calculated by the periodicity_update module from the lags[1..4] variable. The individual lag values are calculated for each subframe in the LTP routine of the speech encoder (GSM 06.20 [2]). The periodicity_update module is called after the current 20 ms signal frame has been encoded.

6.2 Pseudo‑floating point arithmetic

All the arithmetic operations follow the precision and format used in the computational description of the speech codec in GSM 06.06 [5]. To increase the precision within the fixed point implementation, a pseudo‑floating point representation of some variables is used. This applies to the following variables (and related constants) of the VAD algorithm:

pvad: Energy of filtered signal;

thvad: Threshold of the VAD decision;

acf0: Energy of the input signal.

For the representation of these variables, two 16‑bit integers are needed:

‑ one for the exponent (e_pvad, e_thvad, e_acf0);

‑ one for the mantissa (m_pvad, m_thvad, m_acf0).

The value e_pvad represents the lowest power of 2 just greater or equal to the actual value of pvad, and the m_pvad value represents an integer which is always greater than or equal to 16384 (normalized mantissa). It means that the pvad value is equal to

This scheme provides a large dynamic range for the pvad value and always keeps a precision of 16 bits. All the comparisons are easy to make by comparing the exponents of two variables, and the VAD algorithm needs only one pseudo floating point addition and multiplication. All the computations related to the pseudo‑floating point variables require simple 16 or 32‑bit arithmetic operations defined in the detailed description of the speech codec.

Some constants, represented by a floating point format, are needed and symbolic names (in capital letters) for their exponent and mantissa are used; table 8 lists all these constants with the associated symbolic names and their numerical constant values.

Table 8: List of floating point constants

Constant

Exponent

Mantissa

pth

E_PTH = 18

M_PTH = 26250

margin

E_MARGIN = 27

M_MARGIN = 27343

plev

E_PLEV = 20

M_PLEV = 17500

Annex A (informative):
VAD performance

In the optimization of a VAD, a trade‑off has to be made between speech clipping, which reduces the subjective performance of the system, and the mean channel activity factor. The benefit of DTX is increased as the activity factor is reduced. However, in general, a reduction of the activity factor will be associated with a greater risk of audible speech clipping.

In the optimization process, emphasis has been placed on avoiding unnecessary speech clipping. However, it has been found that a VAD with virtually no audible clipping would result in a high activity and little DTX advantage. The VAD specified in the present document introduces audible and possibly objectionable clipping in certain cases, mainly for low input levels and low signal to noise ratios.

An indication of the mean channel activity in DTX mode is given in table A.1. The figure quoted is the average calculated over a large number of conversations covering factors such as different talkers, noise characteristics and locations. It should be noted that the actual activity of a particular talker in a specific conversation may vary considerably from the figure given in the table. This is due to both talker behaviour and the level dependency of the VAD (the channel activity has been found to decrease by about 0.5% per dB of level reduction). However, as mentioned above, a decreased speech input level increases the risk of objectionable clipping.

Table A.1: Mean channel activity factor in DTX mode

Channel activity factor

60%

Annex B (informative):
Simplified block filtering operation

Consider an 8th order transversal filter with filter coefficients a[0..8], through which a signal is being passed, the output of the filter being:

8

s’n = – SUM (a[i]*s[n-i]) (1)

i=0

If we apply block filtering over 20 ms frames, then this equation becomes:

min(8,n)

s’n = – SUM (a[i]*s[n-i]) ; n = 0..167 (2)

i=0 ; 0 <= n <= 167

If the energy of the filtered signal is then obtained for every 20 ms frame, the equation for this is:

167 min(8,n)

pvad = SUM ( – SUM (a[i]*s[n-i]))2 ; 0 <= n-i <= 159 (3)

n=0 i=0

We know that:

159

acf[i] = SUM (s[n]*s[n-i]) ; i = 0..8 (4)

n=i ; 0 <= n-i <= 159

If equation (3) is expanded and acf[0..8] are substituted for s[n] then we arrive at the equations:

8

pvad = r[0]*acf[0] + 2*SUM (r[i]*acf[i]) (5)

i=1

Where:

8-i

r[i] = SUM (a[k]*a[k+i]) ; i = 0..8 (6)

k=0

Annex C (informative):
Pole frequency calculation

This annex describes the algorithm used to determine whether the pole frequency for a second order analysis of the signal frame is less than 385 Hz.

The filter coefficients for a second order synthesis filter are calculated from the first two unquantized reflection coefficients rc[1..2] obtained from the speech encoder. If the filter coefficients a[0..2] are defined such that the synthesis filter response is given by:

H(z) = 1/(a[0] + a[1]z‑1 + a[2]z‑2) (1)

Then the positions of the poles in the Z‑plane are given by the solutions to the following quadratic:

a[0]z2 + a[1]z + a[2] = 0, a[0] = 1 (2)

The positions of the poles, z, are therefore:

z = re + j*sqrt(im), j2 = ‑1 (3)

where:

re = – a[1] / 2 (4)

im = (4*a[2] – a[1]2)/4 (5)

If im is negative then the poles lie on the real axis of the Z‑plane and the signal is not a tone and the algorithm terminates. If re is negative then the poles lie in the left hand side of the Z‑plane and the frequency is greater than 2000 Hz and the prediction error test can be performed.

If im is positive and re is positive then the poles are complex and lie in the right hand side of the Z‑plane and the frequency in Hz is related to re and im by the expression:

freq = arctan(sqrt(im)/re)*4000/pi (6)

Having ensured that both im and re are positive the test for a pole frequency less than 385 Hz can be derived by substituting equations 4 and 5 into equation 6 and re‑arranging:

(4*a[2] – a[1]2 )/a[1]2 < tan2(pi*385/4000) (7)

or

(4*a[2] – a[1]2)/a[1]2 < 0.0973 (8)

If this test is true then the signal is not a tone and the algorithm terminates, otherwise the prediction error test is performed.

Annex D (informative):
Change Request History

Change history

SMG No.

TDoc. No.

CR. No.

Section affected

New version

Subject/Comments

SMG#15

4.1.1

ETSI Publication

SMG#20

5.0.1

Release 1996 version

SMG#27

6.0.0

Release 1997 version

SMG#29

7.0.0

Release 1998 version

7.0.1

Version update to 7.0.1 for Publication

SMG#31

8.0.0

Release 1999 version

8.0.1

Update to Version 8.0.1 for Publication