13 Interaction with TFO

06.773GPPMinimum Performance Requirements for Noise Suppresser Application to the AMR Speech EncoderTS

No interaction.

Annex A (informative): Method for generating Objective Performance Measures

This annex presents an objective methodology for characterising the performance of noise suppression (NS) methods. Two objective measures are specified to be used for characterising NS solutions complying with the AMR/NS specification.

A.1 Notations

The following notations are used in this document:

  • The operator AMR() corresponds to applying the AMR speech encoder and decoder on the input.
  • The operator NR() corresponds to applying the NS algorithm, and the AMR speech encoder and decoder on the input.
  • The clean speech signals are referred to as si , i = 1 to I.
  • The noise signals are referred to as nj , j = 1 to J.
  • The noisy speech test signals are referred to as dij = ij(SNR) nj+ si, i = 1 to I, j = 1 to J, where dij is built by adding si and nj with a pre-specified SNR as presented below.
  • The processed signal are referred to as yij = NR (dij).
  • The reference signal in the calculations shall be either the noisy speech test signal dij itself or dij processed by the AMR speech codec without NS processing. The latter signal will be referred to as cij = AMR (dij), i = 1 to I, j = 1 to J. The relevant reference signal will be indicated in the formulation of each objective measure below.
  • The notation Log() indicates the decimal logarithm.
  • ij(SNR) is the scaling factor to be applied to the background noise signal ni in order to have a ratio SNR (in dB) between the clean speech signal si and nj. The scaling of the input speech and noise signals is to be carried according to the following procedure:
  1. The clean speech material is scaled to a desired dBov level with the ITU‑T recommendation P.56 speech voltmeter, one file at a time, each file including a sequence of one to four utterances from one speaker.
  2. A silence period of 2 s is inserted in the beginning of each of the resulting files to make up augmented clean speech files.
  3. Within each noise type and level, a noise sequence is selected for every speech utterance file, each with the same length as the corresponding speech files, and each noise sequence is stored in a separate file.
  4. Each of the noise sequences is scaled to a dBov level leading to the SNR condition corresponding to the ij(SNR) value in each of the test cases by applying the RMS level based scaling according to the P.56 recommendation.
  • The determination of which frames contain active speech is to be carried out with reference to the ITU‑T recommendation P.56 active speech level measurement and is related to the classification of the frames into the presented speech power classes which is explained below.

A.2 Test material

The test material should manifest at least the following extent:

  • Clean speech utterance sequences: 6  utterances from 4 speakers – 2 male and 2 female – totalling 24 utterances
  • Noise sequences:
  • car interior noise, 120 km/h, fairly constant power level
  • street noise, slowly varying power level

Special care should be taken to ensure that the original samples fulfill the following requirements:

  • the clean speech signals are of a relatively constant average (within sample, where ‘sample’ refers to a file containing one or more utterances) power level
  • the noise signals are of a short-time stationary nature with no rapid changes in the power level and no speech-like components

The test signals should cover the following background noise and SNR conditions:

  • car noise at 3 dB, 6 dB, 9 dB, 12 dB and 15 dB
  • street noise at 6 dB, 9 dB, 12 dB, 15 dB and 18 dB

A feasible subset of these conditions giving a practically useful indication of the achieved performance would be:

  • car noise at 6 dB and 15 dB
  • street noise at 9 dB and 18 dB

The samples should be digitally filtered before NS and speech coding processing by the MSIN filter to become representative of a real cellular system frequency response.

A.3 Objective measures for characterization of NS algorithm effect

Assessment of SNR improvement level. The SNR improvement measure, SNRI, measures the SNR improvement achieved by the NS algorithm. SNR improvement is calculated separately in three groups of frames that represent power gated constituents of active speech signal. Hence, the SNRI measure is calculated separately in frames of high, medium and low power. These categories are used to characterise the effect of the NS processing on speech, allowing to distinguish the effect on strong, medium and weak speech. In addition to calculating the SNR improvement separately on the three categories, they are used to form an aggregate measure. A frame length of 80 samples is used since it has been found the most efficient to describe changes in the signal caused by NS processing.

The calculation is here presented for the high power speech class:

For each background noise condition j

For each speaker i

Construct a noisy input signal dij as follows:

dij(n) = ij  nj(n) +  si(n)

where ij depends on the SNR condition according to the procedure described above

cij = AMR (dij)

yij = NR (dij)

(1)

where ksph and Ksph are the index and the total number of frames containing speech of a high power

knse and Knse are the corresponding index and total number of noise only frames

is a constant that should be set at 10‑5

SNRI_mij correspondingly for medium power frames

SNRI_lij correspondingly for low power frames

(2)

(3)

(4)

In addition, measures for the SNR improvement in the high, medium and low power speech classes (SNRI_h, SNRI_m, SNRI_l, respectively) shall be recorded based on the following formulae:

(5)

(6)

(7)

It is, in addition, informative to record separately the noise type specific SNR improvement measures, namely, SNRI_hj, SNRI_lj, SNRI_mj and SNRIj for each j.

To determine which frames belong to high, medium and low power classes of active speech and which present pauses in the speech activity (noise only), the active speech level (in dB) sp_lvl of the noise free speech si(n) is first determined according to the ITU‑T recommendation P.56. Thereafter, the frames are classified into the four classes as follows . Let us first define four number sequences: , , , . All four sequences are initialized to an empty sequence:

  • (8)

Then, the frame power is calculated in each signal frame k:

(9)

We shall then classify each frame according to the frame power as follows:

if

else if

(10)

else if

else if

where is a constant whose value shall be such that in the dB scale, it shall be below sp_lvl + th_nl; a value of 10‑7 should be used if sp_lvl = ‑26 dBov and th_nl = ‑34 dB, as proposed below

th_h, th_m, th_l are pre-determined lower threshold power levels for classifying the speech frames to the high, medium, and low power classes, correspondingly. In the following, these threshold values are called power class threshold values

is a function returning the length of the number sequence

The following notes on the formulation of the frame classification are made:

  • The lower bound for the power of the noise-only class of frames is motivated by a desire to restrict the analysis to noise frames that are among or close the speech activity, hence excluding long pauses from the analysis. This makes the analysis concentrate increasingly on the effects encountered during speech activity.
  • In poor SNR conditions, the noise power level may occur to be higher than the lower bound of some of the speech power classes. However, even in this case, the information of the effect on the low power portions of speech may be informative. Another way of formulating the measure might be to make the power thresholds dependent on the noise level. This would, however, restrict the comparability of the SNR improvement figures of the different classes over experiments with different background noise content.

The scaling for the clean speech material should be determined optimally so that the dynamics of the 16 bit arithmetic system is efficiently used but no waveform clipping is produced. Typically, a normalisation to the active speech level of –26 dBov is preferable. In such a case, the following values should be used for the power class thresholds:

th_h = ‑1 dB

th_m = ‑10 dB

th_l = ‑16 dB (11)

th_nh = ‑19 dB

th_nl = ‑34 dB

Assessment of noise power level reduction. The noise power level reduction NPLR measure relates to the capability of the NS method to attenuate the background noise level.

The NPLR measure is calculated as follows:

For each background noise condition j

For each speaker i

Construct a noisy input signal dij as follows:

dij(n) = ij  nj(n) +  si(n)

where ij depends on the SNR condition according to the procedure described above

cij = AMR (dij)

yij = NR (dij)

, (12)

where is a constant that should be set at 10-5;

knse and Knse are the corresponding index and total number of noise only frames

(13)

(14)

Furthermore, it is informative to record separately the noise type specific NPLR measures, or NPLRj, for each j.

Comparison of SNRI and NPLR. A comparison of the SNRI and NPLR measures can be used to acquire an indication of possible speech distortion produced by the tested NS method. If the NPLR parameter assumes clearly higher absolute values than SNRI, it can be expected that the NS candidate causes distortion to speech. This relation, however, should always be verified through a comparison with subjective test results.

Annex B (normative): Methodology for Measuring Subjective SNR Improvement for CCR Experiments

The purpose of experiment 3 is to evaluate the performances of the NS algorithm in background noise conditions with two different bit-rates (5.9 kbps and 12.2 kbps). For these experiments three types of noise have been selected: car noise, street noise and babble noise. For each type of noise two different nominal SNR levels have been set:

Noise type

SNR [dB]

Car

6, 15

Street

9, 18

Babble

9,18

For each sub-experiment and for each type of noise three ideal NS reference conditions will be processed. The exception is that for the higher SNRs (15dB for car noise and 18 dB for street, babble noise) only 2 ideal noise reference conditions will be tested (+3, +6dB):

Ideal SNR improvement

SNR sub-exp. +3 dB

SNR sub-exp. +6 dB

SNR sub-exp. +9dB

Each ideal NS will be compared during the sub-experiment with the speech+noise signals mixed at the nominal SNR levels. This leads to a total number of CCR reference results of 5 per sub-experiment corresponding to 3 (2 for the higher SNRs) SNR improvement levels. By connecting adjacent point by straight lines we will obtain a graph giving a correspondence between CCR scores and perceived SNR improvement (cf. figure B.1).

Finally the perceived SNR improvement for an AMR-NS candidate is obtained using the CCR vs SNR graph as illustrated in figure B.1.

Figure B.1. Example of CCR versus SNR improvement graph
O: ideal NS score, *:AMR-NS candidate score.

Annex C (normative): Test Plan for Checking Conformance to Requirements

Document History:

Issue 0.1

16 Mar 00

First Issue, derived from the AMR/NS Selection Test Plan version 2.2 (Tdoc. SMG11/S4 356/99 R3)

Issue 0.2

28 Mar 00

D. Pascal (France Télécom R&D): Text of Experiment 1+ Annex A.1; Various editorial modifications

Issue 0.3

31 Mar 00

S. Aftelak (Motorola): Insertion of section on CCR tests + Annex A.3

Issue 0.4

23 May 00

S. Aftelak (Motorola): Mainly additions/changes to CCR tests

Issue 0.5

19 June 00

D. Pascal (France Télécom R&D): Statistical Analysis for Experiment1 (PC test)

Issue 0.6

05 Sept 00

Experimental Table for CCR Experiment 4

Issue 0.7

19 Oct 00

A. Eriksson (Ericsson): Addition of ACR tests

Issue 0.8

25.Oct.00

[SQ/Osaka]:

  • Note added to Section 8.11 on instructions
  • Added placeholder for statistical analysis section for Exp.4
  • Added modified ACR instructions in Annex A

Issue 0.9

27 Oct 00

A Eriksson (Ericsson)

  • Editorial errors in Exp 2 and Exp 3 corrected
  • Experiment 4 changed to ACR test

Issue 0.10

16 Jan 01

A Eriksson (Ericsson)

  • Experiment 2 extended with odd level and DTX conditions
  • Experiment 4 reverted to previous CCR experiment

Issue 0.11

18 Jan 01

D. Pascal (France Télécom R&D) : Various editorial corrections and details, Statistical analysis for CCR experiments 3 and 4 is not correct and should be modified.

Issue 0.12

22 Jan 01

S Aftelak (Motorola):

Editorial changes and changes to reflect fact that experimenter is responsible for providing (and reporting) processing tables and randomizations used.

Issue 2.0.0

22 Jan 01

Agreed at S4#15 Plenary meeting.