C7. Experiment 1: Degradation in Clean Speech (Pair Comparison Test)
06.773GPPMinimum Performance Requirements for Noise Suppresser Application to the AMR Speech EncoderTS
C7.1 Introduction
This PC (Paired-Comparison) experiment was prepared to test the ‘No degradation in clean speech’ requirement in the Recommended Minimum Performance Requirements specification ([1], TS GSM 06.77), i.e.. This PC experiment will be run for the whole set of bit rates of the base vocoder, in single and tandem connection.
The test methodology is direct, paired, forced choice comparison (i.e. A versus B test method with forced choice) . The question that we are trying to answer with this test is not “What is the rank order of several coders?” but rather “Does the quality of coder with noise suppression (+NS) meet or exceed the quality of the coder without NS for a given condition?” The direct comparison A/B test methodology can answer this question by considering the proportion (or percent) of the measures where the candidate was preferred over the standard. Each individual judgement is a binary decision. A rank order approach could be taken as noted in the Handbook of Telephonometry [3] regarding Paired Comparisons but notes: "In the scaling modulus is included the common standard deviation, which is, however, unknown and so does not permit calculating confidence limits for the scale positions obtained."
For the A/B experiment proposed here, with 24 subjects each making two independent measures (A/B and B/A) of the preference of the candidate coder over the standard coder for four talkers (two male and two female) each condition and with one repeat , the effective N is 384. In order to accommodate the repeat measure, single sentence samples will be used. This provides the additional benefit of directly adjacent A/B comparisons during presentation. The repeat measure will be made using a unique second sentence.
C7.2. Test Factors and Conditions
The PC test will be run for the following basic vocoder conditions:
- Bit Rates of 4.75 kbit/s, 5.15 kbit/s, 5.9 kbit/s, 6.7 kbit/s, 7.4 kbit/s, 7.95 kbit/s, 10.2 kbit/s and 12.2 bit/s.
- Single codec.
This results in a single PC experiment with clean source speech and no channel impairments. The speech material used in these experiments are 4s samples (single sentence).
The following table (Table 7.1) shows the testing factors to be used in this experiment. Due to the limited number of conditions tested within this experiment, it is possible to design a more balanced test structure and introduce some dummy conditions where the perceived difference in quality within the pairs of stimuli should be obvious for the subjects. A list of test conditions is given in Table 7.3.
Main Codec Conditions | # | Notes |
Noise Suppresser Candidate | 1 | |
Codec | 1 | AMR |
Codec Modes (FR/HR) | HR FR | All 8 AMR modes |
BERs | 0 | Clear channel, no transmission errors |
Input level | 1 | nominal: -26dB relative to OVL |
Acoustic Background Noise | 0 | None |
Tandeming | 0 | No tandeming condition |
Input Characteristic | 1 | GSM Filtered |
Codec references | # | Notes |
Test vocoders | 1 | AMR with NS |
Reference vocoder | 8 | AMR at 12.2, 10.2, 7.95, 7.4, 6.7, 5.9, 5.15 & 4.75 |
Other references | # | Notes |
Direct | Nominal level, GSM Filtered | |
MNRU | 2 | Q = 5 dB & 20 dB, other Q values in preliminaries |
Ideal Noise Suppression | 0 | None |
Common Conditions | # | Notes |
GSM Channel | 0 | NO channel model |
Number of talkers | 4 | 2 male + 2 female |
Number of speech samples | 52 | 12/talker + 1 practice/talker |
Sentences/sample | 1 | Single sentence stimuli |
Listening Level | 1 | -15dBPa (79dB SPL) at ERP |
Listeners | 24 | Naive Listeners |
Randomizations | 6 | 6 groups of 4 listeners |
Rating Scale | 1 | PC Instructions |
Replications | 2 | Original Presentation + repeat w/ 2nd sentence |
Table 7.1: Factors and conditions for Experiment 1
C7.3 Preliminary Conditions
The following 16 preliminary test conditions are recommended.
Cond. | Presentation order | Reference Codec | Trans-codings | Processed Codec | Trans-codings | Talker and Sample Number |
P1 | 5 | Direct | – | MNRU-20 | – | F1S13 |
P2 | 1 | MNRU-18 | – | MNRU-22 | – | M1S13 |
P3 | 3 | MNRU-19 | – | MNRU-21 | – | F2S13 |
P4 | 7 | AMR-12.2 | 1 | AMR-12.2 | 1 | M2S13 |
P5 | 6 | AMR-12.2 | 1 | AMR-5.9 | 1 | F1S13 |
P6 | 2 | AMR-5.9 | 1 | AMR-5.9 | 1 | M1S13 |
P7 | 4 | AMR-4.75 | 1 | AMR-7.95 | 1 | F2S13 |
P8 | 8 | MNRU-5 | – | MNRU-20 | – | M2S13 |
P9 | 14 | MNRU-20 | – | Direct | – | F1S13 |
P10 | 10 | MNRU-22 | – | MNRU-18 | – | M1S13 |
P11 | 12 | MNRU-21 | – | MNRU-19 | – | F2S13 |
P12 | 16 | AMR-12.2 | 1 | AMR-12.2 | 1 | M2S13 |
P13 | 13 | AMR-5.9 | 1 | AMR-12.2 | 1 | F1S13 |
P14 | 9 | AMR-5.9 | 1 | AMR-5.9 | 1 | M1S13 |
P15 | 11 | AMR-7.95 | 1 | AMR-4.75 | 1 | F2S13 |
P16 | 15 | MNRU-20 | – | MNRU-5 | – | M2S13 |
Table 7.2: List of preliminary conditions for Experiment 1
C7.4 Speech Material
Single sentences. For the 4 talkers, 2 male and 2 female there are:
13 stimuli / talker, each stimuli 4sec long w/ 1 sentence
12 unique sentences / talker for test plus one for practice
To reduce the speech material effect, each talkers’ samples must be unique. For this experiment, the unique samples are not balanced across all condition, candidates and subject groups. The same sample numbers for each talker are used for common conditions within a subject group and changed across subject groups.
C7.5 Experimental Design
The design is based on a restricted randomization philosophy using 6 different randomizations, each one covered by a group of 4 of the 24 subjects. This means that up to 4 subjects can perform the experiment simultaneously.
Each subject will hear all of the conditions 16 times, four times with speech from each of the four talkers. Each of two stimuli for a talker will be presented in both the A/B and B/A order. Over the experiment as a whole, each of the conditions will be paired with twelve different samples from each of the four talkers. Each of the six groups of subjects will hear different combinations of source material and condition.
C7.6 Processing
Every condition has to be processed for each of the twelve stimuli of each of the four talkers. The actual samples used for each condition by each subject group are presented in Section 7.12 Test Conditions.
C7.7 Randomizations
Separate randomizations for each of the six subject groups shall be provided to reduce order effects and to minimize differences between the laboratories. There shall be six randomizations for the experiment, one for each subject group. Each one will therefore be used by four of the 24 subjects.
C7.8 Duration of the PC Experiment
Each stimuli is 4 sec reference + 4 sec speech sample + 4 s voting time or 12 seconds. For this experiment there are 16 preliminary conditions x 12 seconds or 3.2 minutes for an introductory block. The presentation set for the experiment consists of 40 conditions (A/B+B/A) x 2 repeats x 4 talkers x 12 seconds or 64 minutes. The experiment is presented as the 16 preliminary conditions followed by the test itself divided in several sessions, i.e. 67,2 minutes testing time / subject group. The 6 groups of 4 subjects require 7 hours and 30 minutes total testing time for the experiment (6 x 1h 15 env.)
To reduce the effects of subject fatigue, sessions should be separated by short comfort breaks.
Note that the above calculations do not include the time needed to give the subjects their instructions, or for comfort breaks.
C7.9 Votes Per Condition
Every condition will have 24 subjects vote on four stimulus from each of four talkers, giving:
(24 subjects x 4 talkers x 4 Presentations) = 384 votes per condition
From past experience of PC tests, this is the minimum number of votes per condition needed to give enough statistical certainty to differentiate the performance of one candidate process from another candidate process over the conditions and against the references.
C7.10 Test Procedure
Factors important for the experimental environment are specified in sections 6.4, 6.5, and 6.6. As specified in section 7.8, comfort breaks should be provided to reduce the effects of subject fatigue.
7.11 Opinion ScaleThe question asked of the subject is according to the Paired-Comparison binary scale. The specific wording is designed to evaluate the relative quality of the test sample in relation to the reference sample. In order to minimise presentation bias, the samples will be presented in both the A/B and B/A directions within the experiment. The subjects will listen to each pair of samples, and after presentation is completed, they will be asked to give their opinion. Annex A.1 contains an example of the instructions for the subjects in English.
C7.12. Statistical Analysis
The statistics to be reported for this pair-comparison experiment [4] are the proportion P of subjects preferring the test stimulus over the reference stimulus (as defined in Table 2) for a total of N votes per condition, the standard deviation s:
(Eq.1)
and the upper and lower confidence limits, as calculated by:
(Eq.2)
where is the standardized score for a normal distribution cutting off the lower proportion of cases.
Additionally, a hypothesis to test was whether the preference for the noise reduction-enabled AMR codec was statistically different from the ideal proportion =0.5, i.e. that the AMR with noise suppression is equally preferred to AMR without noise suppression (for quiet background). In other words,
The null hypothesis Ho is tested using a z test where:
(Eq.3)
Hence, the null hypothesis is rejected if
Or accepted if:
(Eq.4)
For a 95% confidence level, Equations 2 and 4 are reduced to (, N=384):
(Eq.5)
(Eq.6)
C7.13. Test Conditions for Experiment 1
Cond. | Reference Codec | Processed Codec | Trans-codings | Speech sample number (6 sequences) |
1 | AMR@12.2 | AMR@12.2 | 1 | 2 3 4 5 6 1 |
2 | AMR@10.2 | AMR@10.2 | 1 | 3 4 5 6 1 2 |
3 | AMR@7.95 | AMR@7.95 | 1 | 1 2 3 4 5 6 |
4 | AMR@7.4 | AMR@7.4 | 1 | 4 5 6 1 2 3 |
5 | AMR@6.7 | AMR@6.7 | 1 | 5 6 1 2 3 4 |
6 | AMR@5.9 | AMR@5.9 | 1 | 6 1 2 3 4 5 |
7 | AMR@5.15 | AMR@5.15 | 1 | 2 3 4 5 6 1 |
8 | AMR@4.75 | AMR@4.75 | 1 | 3 4 5 6 1 2 |
9 | AMR@12.2 | AMR@5.9 | 1 | 1 2 3 4 5 6 |
10 | AMR@4.75 | AMR@7.95 | 1 | 4 5 6 1 2 3 |
11 | DIRECT | MNRU Q= 20 dB | 1 | 5 6 1 2 3 4 |
12 | MNRU Q= 5 dB | MNRU Q= 20 dB | 1 | 6 1 2 3 4 5 |
13 | AMR@12.2 | AMR/NS@12.2 | 1 | 2 3 4 5 6 1 |
14 | AMR@10.2 | AMR/NS@10.2 | 1 | 3 4 5 6 1 2 |
15 | AMR@7.95 | AMR/NS@7.95 | 1 | 1 2 3 4 5 6 |
16 | AMR@7.4 | AMR/NS@7.4 | 1 | 4 5 6 1 2 3 |
17 | AMR@6.7 | AMR/NS@6.7 | 1 | 5 6 1 2 3 4 |
18 | AMR@5.9 | AMR/NS@5.9 | 1 | 6 1 2 3 4 5 |
19 | AMR@5.15 | AMR/NS@5.15 | 1 | 1 2 3 4 5 6 |
20 | AMR@4.75 | AMR/NS@4.75 | 1 | 3 4 5 6 1 2 |
21 – 40 | Reversed order of the reference and processed speech samples in cond. 1-20 | |||
41 – 60 | Repeat of conditions 1 – 20 with Speech Sample Number +6 | |||
61 – 80 | Reversed order of the reference and processed speech samples in cond. 41 – 60 | |||
Notes: |
|
Table 7.3: Test conditions for Experiment 1