C9. Experiments 3a & 3b: Performances in Background Noise Conditions (Mod-CCR)
06.773GPPMinimum Performance Requirements for Noise Suppresser Application to the AMR Speech EncoderTS
C9.1 Introduction
These experiments are designed to test Requirements in the associated Section in the Recommended Minimum Performance Requirements Specification ([1], TS GSM 06.77). Specifically, the AMR with noise suppression should, in a certain number of conditions, be preferred to the AMR without noise suppression in a background noise environment and should provide a reasonable level of SNR improvement. Experiment 3a examines the performance of the noise suppression with the half-rate codec, while Experiment 3b examines the noise suppression with the full rate codec. Both experiments will use the Modified Comparison Category Rating (Mod-CCR, Note 1) method with a seven-point rating scale. Listeners will judge the relative quality of samples processed through the codec with noise suppression, compared to those without the noise suppression applied (example instructions for listeners are given in Annex A.3). The samples will have background noise of various types and levels mixed into the source speech before processing through the codec.
The factors for each of the four sub-experiments are presented in Table 9.1.
Factor | Expt 3a | Expt 3b |
codec | AMR 5.9 kb/s | AMR 12.2 kb/s |
noise types | car (6 and 15 dB) | car (6 and 15 dB) |
Table 9.1: Factors for Experiments 3a and 3b
Note 1:
The standard Comparison Category Rating method (CCR) which is described in Annex E of Rec. P.800 is similar to the Degradation Category Rating method (DCR, Annex D). In Annex E, it is explicitly said : "Listeners are presented with a pair of speech samples on each trial. In the DCR procedure, a reference (unprocessed) sample is presented first, followed by the same speech sample, which has been processed by some technique. In the CCR procedure, the order of the processed and unprocessed samples is chosen at random for each trial. Listeners use the seven-point CCR scale to judge the quality of the second sample relative to that of the first. The DCR and the CCR methods are particularly useful for assessing the performance of telecommunications systems when the input has been corrupted by background noise. However, an advantage of the CCR method over the DCR procedure is the possibility to assess speech processing that either degrades or improves the quality of the speech.
Here we are using a different application of the standard CCR method. The modified CCR method uses processed reference samples (but without noise suppression applied) whereas the standard CCR method uses unprocessed reference samples.
C9.2 Test Factors and Conditions
Three types of background noise will be used, at two different SNRs:
- A car noise that is stationary both in level and in spectrum.
- A street noise that is non-stationary in level, but fairly stationary in spectrum.
- A babble noise that is fairly stationary in level, but non-stationary in spectrum.
The noise samples will be those utilised during the AMR Noise Suppression Selection Phase.
The codec is held constant for each experiment, with two SNR classes (‘SNR’ and ‘SNR+9dB’) per experiment. All of the noise types are used in each experiment. The noise samples will have been recorded in scenarios representative of the respective SNR value for each noise.
The factors and conditions to be used in Experiments 3a and 3b are presented in Table 9.2. The expanded set of test conditions is given in Section 9.12.
Main Codec Conditions | # | Notes |
Noise Suppresser Candidates | 1 | |
Codec | 1 | AMR |
Codec Modes (HR/FR) | HR FR | 5.9 kbit/s rate for Experiment 3a 12.2 kbps rate for Experiment 3b |
BERs | 0 | Clear channel, no transmission errors |
Input level | 1 | nominal: -26dB relative to OVL |
Acoustic Background Noise | 3 | car, street, and babble noise |
Background noise SNRs | 2 | low, high for each (see Table 9.1) |
Input Characteristic | 1 | GSM transmit filtered |
Codec references | # | Notes |
All Experiments | 1 | the same AMR rate w/o NS |
Other references | # | Notes |
Direct | nominal level, GSM transmit filtered | |
MNRU, Exp 3a and 3b | nominal level, GSM transmit filtered, Q= 12, Q= 4 | |
Ideal noise suppression simulation | ||
Common Conditions | # | Notes |
GSM Channel | 0 | NO channel model |
Number of talkers | 4 | 2 male + 2 female primary talkers |
Number of speech samples | 28 | 7 Sentence-pairs/primary talker (6 for Test, 1 for Practice) |
Listening Level | 1 | -15dBPa (79dB SPL) at ERP |
Listeners | 24 | Naive Listeners |
Randomizations | 6 | 6 groups of 4 listeners |
Rating Scale | 1 | CCR Instructions |
Replications | 1 | Original Presentation Only |
Table 9.2: Factors and conditions for Experiments 3a and 3b
C9.3 Preliminary Conditions
The following 16 preliminary test conditions are recommended, for presentation, before proceeding to the test samples. The samples shall be presented in the random order given in Table 9.3
Cond. | Presentation | Noise | SNR | Reference | Processed | Speech Sample | |
order | (dB) | Ideal NS | Codec | Number | |||
P1 | 9 | Car | 6 | Direct | – | Direct | M1S07 |
P2 | 5 | Car | 15 | AMR@x | – | AMR@x | F1S07 |
P3 | 12 | Car | 6 | MNRU-12 | – | MNRU-16 | M2S07 |
P4 | 13 | Car | 15 | MNRU-12 | – | Direct | F2S07 |
P5 | 2 | Street | 9 | AMR@x | – | AMR@x | M1S07 |
P6 | 4 | Street | 18 | MNRU-12 | – | MNRU-16 | F1S07 |
P7 | 8 | Street | 18 | MNRU-12 | – | Direct | M2S07 |
P8 | 16 | Babble | 9 | AMR@x | – | AMR@x | F2S07 |
P9 | 7 | Babble | 9 | MNRU-12 | – | MNRU-16 | M1S07 |
P10 | 1 | Babble | 18 | MNRU-12 | – | Direct | F1S07 |
P11 | 11 | Car | 6 | AMR@x | 4 | AMR@x | M2S07 |
P12 | 3 | Car | 15 | AMR@x | 10 | AMR@x | F2S07 |
P13 | 15 | Street | 18 | AMR@x | 4 | AMR@x | M1S07 |
P14 | 6 | Street | 9 | AMR@x | 10 | AMR@x | F1S07 |
P15 | 10 | Babble | 9 | AMR@x | 4 | AMR@x | M2S07 |
P16 | 14 | Babble | 18 | AMR@x | 10 | AMR@x | F2S07 |
Notes: |
|
Table 9.3: List of preliminary conditions
C9.4 Speech Material
The source speech material shall be as defined in Section 6.3 and will consist of the material used during the AMR Noise Suppression Selection phase: Each sample consists of two sentences. Only primary talkers are needed. For the four talkers, the following source material should be prepared:
Seven samples for each talker, six for the test samples and one for the preliminaries,
Each sample to be eight seconds long,
Unique sentences-pairs in each sample (i.e., no repeated across the talkers)
To reduce any speech material effect, the samples for each talker must be unique. For these experiments, these unique stimuli are not balanced across all conditions, candidates and subject groups. The same sample numbers for each talker are used for common conditions within a subject group and changed across subject groups (these sample numbers are arbitrarily assigned to samples). For a given language, the same speech material must be used for the two experiments 3a and 3b. The noise material and its mix with the speech material should be as defined in Section 6.8 and Section 6.3.7 respectively.
C9.5 Experimental Design
The design is based on a restricted randomization philosophy using six different randomizations, each of which is used with a group of four of the 24 listeners. This means that up to four subjects can perform the experiment simultaneously.
Each listener will hear all of the conditions four times, once with speech from each of the four talkers. Over the experiment as a whole, each of the conditions will be paired with six different samples from each of the four talkers. Each of the six groups of subjects will hear different combinations of source material and condition.
C9.6 Processing
Every condition is processed with each of the six samples of each of the four primary talkers. The actual samples to be used for each condition, within with each subject group, are presented in Section 9.12, Test Conditions.
C9.7 Randomizations
The test shall be completed using the randomizations provided by the experimenter. There shall be six randomizations for the sub-experiments, one for each subject group. The same randomizations shall be used for the two experiments (3a and 3b). Each one will therefore be used by four of the 24 subjects. Each randomization is balanced across four blocks of 48 stimuli to eliminate long sequences of similar conditions or identical talkers. The sequences shall provide for alternating male-female talkers. Use of these randomizations will allow presentation order to be used as a factor in a global analysis, should that be necessary. The randomization shall be constrained to a randomized block design, which controls practice and fatigue effects that may occur over the course of a test session.
C9.8 Duration of the CCR Experiments 3a and 3b
Each trial consists of an eight-second reference sample + an eight-second test sample + five second voting time, totaling 21 seconds. For each of the four experiments there are 16 preliminary conditions x 21 seconds or 5.6 minutes for an introductory block. Each presentation set within an experiment consists of 52 conditions (A/B+B/A) x 4 talkers x 21 seconds or 70 minutes, presented as eight 8.75 minute blocks of 25 stimuli for 75.6 minutes testing time / subject group / experiment. The total testing time for each experiment will be 7 hours and 34 minutes, if four listeners are tested at one time.
To reduce the effects of subject fatigue, each 8.75 minute block should be separated by short comfort breaks.
Note that the above calculations do not include the time needed to give the subjects their instructions, or time taken for comfort breaks.
C9.9 Votes Per Condition
In each of the three experiments, 24 listeners rate every condition with four talkers in each of two presentation orders (A/B and B/A), giving:
(24 subjects x 4 talkers x 2 presentations) = 192 votes per condition
From past experience with CCR tests, this is the minimum number of votes per condition needed to give enough statistical certainty to differentiate the performance of one candidate process from another candidate process over the conditions and against the references.
C9.10 Test Procedure
Factors important for the experimental environment are specified in Sections 6.4, 6.5, and 6.6. As specified in Section 9.8, comfort breaks should be provided to reduce the effects of subject fatigue.
C9.11 Opinion Scale
The question asked of the subject is a based on of the CCR Listening Quality Comparison Scale. The listening subjects will judge the quality of the second sample with regard to quality of the first sample. The subjects will listen to each pair of samples and after these have been played, they will be asked to give their comparative opinion. Annex A contains an example of the instructions for the subjects in English. Changes to the instructions may be needed to specify the method of data collection being used (button-press, paper & pencil, etc.).
C9.12 Test Conditions for Experiments 3a and 3b
Cond. | Noise | SNR (dB) | Reference | Processed | Speech sample | ||
Ideal NS | Codec | number | |||||
1 | Car | 6 | AMR@x | – | AMR@x | 4 5 6 1 2 3 | |
2 | Street | 9 | AMR@x | – | AMR@x | 4 5 6 1 2 3 | |
3 | Babble | 9 | AMR@x | – | AMR@x | 4 5 6 1 2 3 | |
4 | Car | 6 | MNRU-16 | – | MNRU-12 | 4 – – 1 – – | |
5 | Car | 6 | Direct | – | MNRU-12 | 4 – – 1 – – | |
4′ | Street | 9 | MNRU-16 | – | MNRU-12 | – 5 – – 2 – | |
5′ | Street | 9 | Direct | – | MNRU-12 | – 5 – – 2 – | |
4” | Babble | 9 | MNRU-16 | – | MNRU-12 | – – 6 – – 3 | |
5” | Babble | 9 | Direct | – | MNRU-12 | – – 6 – – 3 | |
6 | Car | 6 | AMR@x | 3 | AMR@x | 1 2 3 4 5 6 | |
7 | Car | 6 | AMR@x | 6 | AMR@x | 1 2 3 4 5 6 | |
8 | Car | 6 | AMR@x | 9 | AMR@x | 1 2 3 4 5 6 | |
9 | Street | 9 | AMR@x | 3 | AMR@x | 2 3 4 5 6 1 | |
10 | Street | 9 | AMR@x | 6 | AMR@x | 2 3 4 5 6 1 | |
11 | Street | 9 | AMR@x | 9 | AMR@x | 2 3 4 5 6 1 | |
12 | Babble | 9 | AMR@x | 3 | AMR@x | 3 4 5 6 1 2 | |
13 | Babble | 9 | AMR@x | 6 | AMR@x | 3 4 5 6 1 2 | |
14 | Babble | 9 | AMR@x | 9 | AMR@x | 3 4 5 6 1 2 | |
15 | Car | 6 | AMR@x | – | AMR/NS1@x | 1 2 3 4 5 6 | |
16 | Street | 9 | AMR@x | – | AMR/NS1@x | 2 3 4 5 6 1 | |
17 | Babble | 9 | AMR@x | – | AMR/NS1@x | 3 4 5 6 1 2 | |
18 | Car | 15 | AMR@x | 3 | AMR@x | 1 2 3 4 5 6 | |
19 | Car | 15 | AMR@x | 6 | AMR@x | 1 2 3 4 5 6 | |
20 | Street | 18 | AMR@x | 3 | AMR@x | 2 3 4 5 6 1 | |
21 | Street | 18 | AMR@x | 6 | AMR@x | 2 3 4 5 6 1 | |
22 | Babble | 18 | AMR@x | 3 | AMR@x | 3 4 5 6 1 2 | |
23 | Babble | 18 | AMR@x | 6 | AMR@x | 3 4 5 6 1 2 | |
24 | Car | 15 | AMR@x | – | AMR/NS1@x | 1 2 3 4 5 6 | |
25 | Street | 18 | AMR@x | – | AMR/NS1@x | 2 3 4 5 6 1 | |
26 | Babble | 18 | AMR@x | – | AMR/NS1@x | 3 4 5 6 1 2 | |
27-52 | Reversed order of the reference and processed speech samples in cond. 1-26 | ||||||
Notes: |
– ‘multiple’ conditions “4s” and “5s” (e.g. 4 and 4′) are only presented to a subset of listeners (e.g. to the first and the fourth groups of randomisation) , |
C9.13 Statistical Analysis
The statistics to be reported from this CCR test are the averaged CMOS () scores and the standard deviations () for all the conditions.
Additionally, the requirement in [1, Section 6.1.4] should be checked using hypothesis tests for the conditions 15-17 and 24-26 if the mean CMOS score is greater than zero (the NS performance is preferred) and greater or equal to zero (the NS performance is equivalent) within a 95 % confidence.
The hypothesis test should be performed using a 1-tailed T-test. The NS algorithm has failed the requirement at level “preferred” for any of test condition if
where
and the subscripts denotes the test condition, is the number of votes, and is the inverse of the Student’s t-distribution with degrees of freedom and probability 0.05.
Similarly, the NS algorithm has failed the requirement at level “equal” if