C6 Information relevant to all Experiments

06.773GPPMinimum Performance Requirements for Noise Suppresser Application to the AMR Speech EncoderTS

C6.1 General Technical Notes

Any and all deviations from the specifications contained in this document and the Processing Functions document [2] must be documented and submitted to SMG11/S4 along with the experimental results.

C6.2 Codec Adaptation and Error Conditions

The philosophy of the AMR system is that it is capable of dynamically altering the ratio of speech and channel coding to maximize speech performance as channel conditions change. Each of the combinations of speech and channel coding rates is known as a mode.

However, for the purpose of the AMR Noise Suppresser tests, only fixed mode operation will be considered.

C6.3 Speech Material

All AMR-NS Experiments are subjective listening experiments using pre-recorded speech passed through the candidate algorithms and simulated impairment conditions prior to use in the experiments. Three types of speech sample are used in these experiments:

  • Single sentence samples, 4 seconds in length
  • Short samples; sentence pairs, 8 seconds in length.
  • Long samples; sentence quadruplets, 16 seconds in length.

The experiment investigating the equivalence of the candidate Noise Suppresser algorithms to the AMR algorithm without noise suppression in a quiet environment (PC experiment 1) will use the single sentence stimuli. The experiments investigating the possible introduction of artifacts and clipping by the candidate Noise Suppresser algorithms (ACR experiments 2a, 2b & 2c) will use the long 16-second samples. Experiment 2 includes conditions investigating level dependency, VAD and DTX. All other experiments will use the short 8-second samples.

For all original speech samples a 2s header will be added to accommodate the Initial Convergence Time of the Noise Suppresser algorithms. For all experiments this header should be removed at the end of the processing prior to being used in subjective listening tests.

Information for constructing these sentences is provided in the remainder of this subsection.

Unless stated otherwise in the individual plans, each source speech file will contain unique speech material (i.e. none of the sentences used in any given sample should be used in any other sample for the same, or any other talker within any sub experiment).

Pre-recorded source speech material may possibly be purchased as described in Section 6.3.1. Preferably, the test house should provide its own source speech material. The guidelines contained in Section 6.3.2 should be followed.

To avoid noise contrast effects, any silence gaps and/or pauses added to the speech files to pad them out into the specified formats for the source speech samples described in sections 6.3.3, 6.3.4 and 6.3.5, should not be pure digital silence. Padding out should be done by adding the ambient noise present during the recording of the speech material between the sentences.

The information in sections 6.3.3, 6.3.4 and 6.3.5 should be used in the preparation of the material that the talkers will utter, as well as how the recorded material should be constructed.

C6.3.1 Availability of Pre-recorded Speech Material

A "Multi-lingual Speech Database for telephonometry 1994", on 4 CD-ROM disks, was available from NTT-AT, No.7 Hakuei Buildg, 2-4-15 Naka-machi, Musashino-shi, 180 Japan (phone: +81 422 37 0823, fax: +81 422 60 4806).

In this database, the speech samples consist of pairs of short sentences with a total length of 8-10 seconds. Each sentence lasts approximately 2 to 3 seconds. Four male and four female native speakers are assigned to each of the 21 languages and 96 speech samples are available for each language. The sampling rate is 16 kHz. Active speech level (as defined in ITU-T Rec. P.56) of every speech sample is adjusted to -26dBovl.

Each CD consists of two different areas: audio and data. Speech samples in the audio area are digitized by 44.1 kHz and 16 bits word length linear PCM and can be played back by a commercial CD player. All speech samples in the data area are recorded in standardized format in 16-bit, 2’s complement, low-byte first (little endian) format and can be retrieved by an ordinary PC-DOS system and CD-ROM reader.

C6.3.2 Recording Your Own Speech Databases

All speech recordings should be made in acoustical and electrical environments complying with the requirements given in Annex B.1.1 of ITU-T Rec. P.800.

The recommended method is to record the speech with a linear microphone and a low-noise amplifier with flat frequency response, digitize the speech, and then flat filter and level equalize. To achieve optimum SNR, the microphone should be positioned 15 to 20 cm from the talker’s lips. A windscreen should be used if breath puffs from the talker are noticed.

The recordings should be made directly into a computer (A/D) or via a high quality recording system such as a DAT.

C6.3.3 Format for Single Sentence Speech Samples

Each source speech file will contain one sentence and will last nominally 4s. All source speech files within an experiment will be exactly the same length. This enhances the ability to recognize processing problems. An approximate 0.5 seconds period of silence precedes the sentence, and a similar period of silence follows the sentence. The speech files are organized as in the example shown in Figure 6.3.1. The sentences will be simple meaningful sentences as described in Annex B1.4 of ITU-T Rec. P.800.

Figure 6.3.1: Example of Speech file structure for single sentences

It must be noted that the trailing silence of 0.5s after the end of the sentence in the file is of extreme importance, since there are (for some conditions) a series of FIR filters with large number of coefficients. If the prescribed trailing silence is not present, there is a considerable risk that speech will be clipped at the end of the file.

C6.3.4 Format for Short Speech Samples

Each source speech file will contain one pair of sentences and will last nominally 8 seconds, with a flexible time interval between the two sentences. All source speech files within an experiment will be exactly the same length. This enhances the ability to recognize processing problems. An approximate 0.5 seconds period of silence precedes the first sentence in the file, and a similar period of silence follows the second sentence in the file. The speech files are organized as in the example shown in Figure 6.3.2. The sentences will be simple meaningful sentences as described in Annex B1.4 of ITU-T Rec. P.800.

Figure 6.3.2: Example of speech file structure for short speech samples

It must be noted that the trailing silence of 0.5s after the end of the second sentence in the file is of extreme importance, since there are (for some conditions) a series of FIR filters with large number of coefficients. If the prescribed trailing silence is not present, there is a considerable risk that speech will be clipped at the end of the file.

C6.3.5 Format for Long Speech Samples

Each sample will contain 4 different sentences and will last nominally 16 seconds, with a time interval between sentences as described in Annex B1.4 of ITU-T Rec. P.800. All source speech files within an experiment will be exactly the same length. An approximate 0.3-0.5 seconds period of silence precedes the first sentence in the file, and a similar period of silence follows the last sentence in the file. The speech files are organized as in the example shown in Figure 6.3.3. The sentences will be simple meaningful sentences as described in Annex B1.4 of ITU-T Rec. P.800. Active speech in each source speech file should be present for not less than 9 seconds and not more than 12s. {note – this last requirement may be hard to meet for some speech data bases. The typical English Harvard Sentence is less that 2 seconds long. Four of these would be less than the required 9 seconds of active speech. Therefore a reasonable relaxation of this last requirement should be tolerated}.

Figure 6.3.3: Example of speech file structure for long speech samples

These samples could be built by the addition of two of the 8-sec sentences described in Section 6.3.3, providing that the constraint for the active speech described above is (reasonably) fulfilled.

C6.3.6 Processing of the Speech Files

All speech files will need to be pre-processed prior to being processed through the experimental conditions. This pre-processing ensures that the speech is at the correct level and has the correct input characteristic. Full details on the processing required are given in [2]. Speech levels will be measured with the P.56 algorithm and level adjusted with the gain/loss algorithm to the level required for each test condition as defined in the test plans for the individual experiments. Where the nominal level is specified, this level should be set to 26dB (±1dB) below digital overload (-26dBovl).

Some of the experiments require that the source speech material has background noise added. Details of the process to be followed are given in [2]. Noise levels will be measured with the rms. computation algorithm and level adjusted with the gain/loss algorithm to the required level. The following procedure will be followed:

  1. The environmental noise will be Delta SM filtered to incorporate a near field microphone response.
  2. The environmental noises will be passed through the GSM send characteristic (see [2]).
  3. The noise levels will be adjusted using the r.m.s. measure to the mean level dictated by following test plans. For each type of noise, six segments will be taken from the noise file. The segments will be numbered from N1 to N6.
  4. The source speech material will be passed through the GSM send characteristic [2] and normalized (level equalized to -26dB) using the speech level meter complying with Rec. P.56. This is the responsibility of the Host Laboratories.
  5. Finally, the noise will be digitally mixed with the normalized speech material. If the resulting signal amplitude exceeds the overload point of the A/D converter, it should be limited to the peak value and the clipping effect should be controlled by expert observation. The following mixing scheme details the combining of speech and noise samples for each speaker.

M1

M2

F1

F2

Speech sample 1

N1

N2

N3

N4

Speech sample 2

N2

N3

N4

N5

Speech sample 3

N3

N4

N5

N6

Speech sample 4

N4

N5

N6

N1

Speech sample 5

N5

N6

N1

N2

Speech sample 6

N6

N1

N2

N3

Speech sample 7 (practice)

N1

N3

N5

N2

Table 6.3.4: Speech vs. Noise samples mixing scheme

C6.4 Listening Environment

For all experiments, subjects should be seated in a quiet environment; 30dBA Hoth Spectrum (as defined by ITU-T, Recommendation P.800, Annex A, section A.1.1.2.2.1 Room Noise, with table A.1 and Figure A.1) measured at the head position of the subject. This will help ensure consistency between the different subjects in the same laboratory as well as across the different laboratories in which these experiments will be performed.

The following points should be adhered to:

  • Where the experiment design and the listening environment allows for multiple subjects in each listening session, the requirements stated above apply to each of the positions the subjects will occupy.
  • Where there are multiple simultaneous subjects, they should not be able to see the responses made by other subjects.
  • All test stimuli will be presented to the subjects over a telephone handset with Modified IRS receiving response (exclusive of the SRAEN filter). Any deviation shall be reported, e.g. use of one ear-piece in a headphone.
  • Subjects should be told not to discuss the experiment with subjects who are yet to participate.
  • Any test house performing multiple experiments must use different listening subjects for each experiment or sub-experiment.

C6.5 Experimental Procedure

Initially the experimenter should present and explain the experiment instructions to the subjects. When the subject has understood the instructions, they will first listen and give score to the preliminary conditions. After the preliminaries have been completed, there should be sufficient time allowed for answering possible questions from the subjects. Any questions about the procedure or the meaning of the instructions should be answered, but any technical questions on matters such as the experimental methodology or details of the types of distortions they are listening to must not be answered until they have completed the experiment.

C6.6 Preliminary Conditions

Preliminary conditions are included in the experiment to help acclimatize the subjects with the experimental procedure and to help reduce learning effects of the subjects, by ensuring that the subjects hear a full range of the potential qualities at the start of the experiment. No suggestions should be made to the subjects that the preliminary samples include the best or worst in the range to be covered, or exhaust the range of conditions they can expect to hear.

C6.7 Reference Conditions

Four types of reference conditions are used in these experiments:

  • AMR without NS References: These are to be used to determine how the AMR Noise Suppresser performs in relation to these.
  • Direct unprocessed speech plus noise source material.
  • MNRU references: These are included as standard references of known and well understood performance and will allow the results to be expressed in terms of Equivalent Q as well as MOS for the ACR tests. MNRUs are also included in the CCR tests as references to estimate the test sensitivity and explore most of the CMOS range. For the CCR experiments, relative MNRU comparisons are used to estimate the test sensitivity. For example an MNRU of 12 may be compared to an MNRU of 16. If this difference is just above the significance level, it represents the test sensitivity.
  • Ideal noise suppression levels: These are represented by varying the SNR level between the speech and noise. These conditions are AMR processed. This is an attempt to define equivalent noise suppression levels.

For the Tests involving background noise conditions, the MNRU references will use noisy speech (i.e. background noise will be used with the MNRU). The exact number of each of these types of reference in each experiment can be found in the experiment plans in the sections 7-10.

C6.8 Noise Material

Most of the Noise Suppresser Selection Test Experiments require the addition of noise to the speech material. The following types of noise are identified in this test plan:

Car Noise: This represents stationary (static) background noise and will be typical of the noise experienced when inside a moving vehicle (car) at a constant speed.

Street Noise: This represents non-stationary (dynamic) noise and will be typical of noise which might be experienced by someone using a mobile on a city street.

Babble Noise: This represents non-stationary (dynamic) noise and will be typical of the background noise encountered in public places: restaurant, cafeteria, open offices.

Noise files available free of charge from ARCON solely for the purposes of SMG11/S4 work shall be used. Contact ETSI (Paolo Usai) for further information (Paolo.Usai@ETSI.FR)