Subjective and objective assessment of sound quality: solutions and applications Carlos Herrero

Subjective and objective assessment of sound quality:
solutions and applications
Carlos Herrero
HUT, Telecommunications Software and Multimedia Laboratorio
[email protected]
Abstract
The aim of the paper is to review current research projects and recommendations
related to subjective and objective assessment of sound quality. The paper
describes the problems and limitations of subjective testing, shows the results of
evaluating ITU-R objective audio quality measurement method and also presents
different application domains and recent research in this field.
Table of Contents
1
2
Introduction.................................................................................................................... 1
Subjective assessment of sound quality......................................................................... 2
2.1
Review of ITU-R recommendations related to subjective testing ......................... 2
2.1.1
Recommendation ITU-R BS.1116 – Small impairments .............................. 2
2.1.2
Recommendation ITU-R BS.1534 – Intermediate quality ............................ 3
2.2
Limitations and problems of subjective tests......................................................... 5
3 Objective measurement of sound quality....................................................................... 5
3.1
Overview of ITU-R recommendations: PEAQ and PESQ .................................... 6
3.2
Evaluation of PEAQ .............................................................................................. 8
3.3
Application Domains of objective measurement of sound quality...................... 12
3.4
Beyond PEAQ and PESQ .................................................................................... 14
4 Conclusions.................................................................................................................. 18
1 INTRODUCTION
Standards and recommendations about sound quality assessment are needed to correctly
compare the performance of different audio systems and hardware. The main goal of this
paper is to introduce and to explain objective measurements and subjective assessments of
audio signals. The first part of the paper is dedicated to ITU-R recommendations related to
subjective testing. It also tells about their problems and limitations, which serves as a nexus
with the second part, where objective measurement methods are explained.
1
2 SUBJECTIVE ASSESSMENT OF SOUND QUALITY
The digital audio chain contains different phases and equipments: microphone, recording,
coding, transmission, decoding and loudspeakers. Linear and nonlinear errors accumulate in
the audio chain. For recording, coding, decoding and transmission systems the goal is that the
audio signal that comes out of the system should sound exactly like the input. The
representation of audio signals by Pulse Code Modulation (PCM) can be made arbitrarily
good simply by increasing the word-length, and transmission and recording of the PCM
signals can be made arbitrarily precise by using appropriate error correction. In these stages
of the audio chain there is always a design trade-off between computational complexity and
audio quality. However, the converters, located at the beginning and at the end of the process,
have analog limitations. Quantizers are inherently non-linear and sample rate conversion is
needed to create consumer versions (at 44.1 kHz sample rate) from professional recordings
(48 kHz). Nowadays, storage and transmission of music through Internet depends
increasingly on lossy audio compression algorithms, which take advantage of the properties
of the human auditory system using psychoacoustic models. With enough bit rates it is
possible to control the resulting coding distortions, so that they are below the threshold of
hearing, but, in many occasions, those distortions can still be easily detected.
One of the best ways to compare coding algorithms, recording and transmission systems,
or microphones and loudspeakers, is by using standardized methods for the subjective
assessment of audio signals. Those methods have been historically defined by the
International Telecommunication Union (ITU), and different recommendations have been
proposed for different purposes, as can be seen in the following section.
2.1 Review of ITU-R recommendations related to subjective testing
The methods used for the subjective assessment of audio quality itself and of the
performance of audio systems depend somewhat on the intended purpose of the assessment.
Hence, some recommendations are used when audio signals are tested altogether with
pictures but, when this is not the case, two main situations can occur. If there are small
impairments the ITU-R BS.1116 recommendation (ITU-R, 1997) is used. However, for
evaluating audio signals with intermediate quality the recommendation ITU-R BS.1534 (ITUR, 2001), also known as MUSHRA, is the preferred method. Those are the recommendations
most commonly used and are described in detail in following subsections.
2.1.1 Recommendation ITU-R BS.1116 – Methods for the subjective assessment of small
impairments in audio systems, including multichannel sound systems
It is the method intended for use in the assessment of systems which introduce small
impairments. Those can be so small that to be able to detect them a rigorous control of the
experimental conditions and appropriate statistical analysis are needed. If the analyzed
systems introduce relatively large and easily detectable impairment, to use ITU-R BS.1116
recommendation leads to excessive expenditure of time and effort and results may also be
less reliable than those obtained by employing a simpler test method. This recommendation is
a basic reference for the other subjective assessment recommendations, which may contain
additional special conditions or, as well, relaxations of the ITU-R BS.1116 requirements. The
result of a test conducted according to ITU-R BS.1116 recommendation is the basic audio
quality of the system under test.
2
During the test, the listener is free to listen to any of three audio sources, where one of
them is known to be the reference signal. The other two sources may be either the test signal
or the reference signal again. Listeners must be extensively trained, so that they are asked to
rate those two audio sources in relation with the known reference signal. One of the sources
must be indiscernible from reference signal and the other reveals impairments. A continuous
five-grade impairment scale is defined, where 5.0 means imperceptible impairments, 4.0
perceptible but not annoying, 3.0 is given to slightly annoying impairments, 2.0 to annoying
and 1.0 to very annoying impairments. To make the statistical analysis the listener’s rating are
transformed into a single value, called the subjective difference grade (SDG), defined as the
difference between the ratio of the test signal and the ratio of the reference signal. The SDG
values 0 when tested signal contains imperceptible impairments or not impairments at all, an
SDG of -4 warns that the test signal contains very annoying impairments.
2.1.2 Recommendation ITU-R BS.1534 – Method for the subjective assessment of
intermediate quality level of coding systems
A different recommendation is intended to cover the aspects of subjective assessment of
intermediate quality level of coding systems. The advent of Internet multimedia has
stimulated the development of several advanced audio and video compression technologies.
To be used on Internet audio content is required to be coded with extremely low bit-rates,
while preserving at a major extent the subjective quality of the original signal. Current
Internet audio codecs experience a large variation in terms of the audio quality achieved at
different bit-rates and for different audio signals.
Subjective listening tests using a number of qualified listeners and a selection of audio
sequences are still recognized as being the most reliable way of quality assessment. However,
the test method defined in ITU-R BS.1116, and explained before, is not suitable for assessing
such lower audio qualities; it is generally too sensitive, leading to a grouping of results at the
bottom of the continuous five-grade impairment scale. Thus, the EBU Project Group B/AIM
proposed a new test method, called MUSHRA (MUlti Stimulus test with Hidden Reference
and Anchors). The method was designed to give a reliable and repeatable measure of the
audio quality of intermediate-quality signals. The method was afterward standardized by the
ITU-R and it is frequently used after that.
Whereas ITU-R BS.1116 uses a “double-blind triple-stimulus with hidden reference” test
method, MUSHRA is a “double-blind multi-stimulus” test method with hidden reference and
hidden anchors. The first method is adequate when the test signal presents only small
impairments and the assessor is asked to detect any perceptible annoyance caused by artifacts
and distortions in the signal. When the test signal contains large impairments the assessor has
not difficulties to detect the artifacts, but assessor must also grade the relative annoyances of
various artifacts, which is a more difficult task. The perceptual distance between the reference
and the test items is expected to be relatively large. Thus, if each system is only compared
with the reference, the differences between any two systems may be too small to discriminate
between them. Consequently, MUSHRA uses not only a high-quality reference but also a
direct paired comparison between different systems. The assessor can switch at will between
the reference signal and any of the systems under test. Because the assessors can directly
compare the impaired signals, they can relatively easily detect differences between the
impaired signals and can then grade them accordingly. This feature permits a high degree of
resolution in the grades given to the systems.
3
The grading scale used in the MUSHRA process is different from the one used in ITU-R
BS.1116, but it employs the scale traditionally used for the evaluation of picture quality: the
five-interval Continuous Quality Scale (CQS). The intervals are described from top to the
bottom as Excellent, Good, Fair, Poor and Bad. The listeners record their assessments of the
audio quality in a suitable form; for example, using sliders on an electronic display.
Figure 1. User interface for MUSHRA test. (Stoll, 2000)
Figure 1 shows the user interface which was used for MUSHRA tests during an evaluation
of Internet audio codecs by EBU B/AIM Project Group. The buttons represent the reference,
which is specially displayed on the top left, and all the signals under test, including the hidden
reference and two anchors, which are low-pass filtered versions (3.5 kHz and 7 kHz) of the
unprocessed signal. Under each button, with the exception of the button for the reference, a
slider is used to grade the quality of the test item according to CQS. Slides are typically 10
cm long or more, with an internal numerical representation in the range of 0 to 100. That is
important because the statistical analysis of the results obtained is perhaps one of the most
demanding tasks. The scores granted by each listener are normalized and combined with
other listener’s scores. The calculation of those average scores will result in the Mean
Subjective Score (MSS) for that signal. While SDG values obtained with ITU-R BS.1116
vary from 0 (excellent) to -4 (very annoying), the MSS values vary from the positive values 0
to 100, where here 0 corresponds to the bottom of the scale (bad quality).
Figure 2 depicts one example of MUSHRA test results. There were 6 signals under test:
one hidden reference, which always gets the maximum score, two anchors that get scores of
30 (3.5 kHz) and 55 (7 kHz); and, finally, three coded audio signals that get better scores for
higher bit-rates.
4
Figure 2. AMR and AAC codecs compared with MUSHRA test. (Seppänen, 2004)
2.2 Limitations and problems of subjective tests
The use of the human as an acoustic measuring device has a lot of well known
disadvantages, the most important among them are the variety and variability of listeners
(Rothauser, 1966). Moreover, in order to obtain reliable data, formal subjective tests should
be performed under optimal listening conditions using careful experimental procedures and a
sufficient number of expert listeners. Because of these constraints, many situations can arise
where such listening tests are impractical (Treurniet, 2000).
In most cases the result of a listening test is presented as a statement of the mean value of
the listeners’ responses and of the variance of these responses. If these data are given, the
general significance of the result is still unknown. Questions like the following have at least
to be considered:
What are the largest tolerable variances and the smallest number of listeners, so that
the given mean value can really be representative for a greater community of people?
Were the listeners selected randomly, or with regard to some special considerations
like “students of HUT with normal hearing”?
Was the group of listeners trained, and to what extent?
Another fact to be taken into account is the ambiguity of the questioning. Thus, the
interpretation of the respective subjective term is left to each listener himself. Beside to that,
the translation into different languages of the questionnaire can provoke that the same test
have different results for each type of audience. Standard and reproducible subjective
measurement procedures were defined: MUSHRA and ITU-R BS.1116 recommendations.
Those minimize the risk of having audience-dependent results, although are very expensive in
terms of cost and time consumption.
Because of the limitations and problems presented above, reliable methods for the
objective measurement of perceived audio quality are highly desirable. Some of them are
presented in the following section.
3 OBJECTIVE MEASUREMENT OF SOUND QUALITY
Devising a method for predicting an average subjective quality rating using only objective
measurements of audio signal characteristics is a significant challenge. It must include an
5
accurate model of psychoacoustic processes in order to predict the detectability of nearthreshold stimuli in various audio contexts, and it must also include knowledge about
cognitive aspects of audio quality judgments.
Those methods were first devised to be applied to speech codecs, and later on for widebandwidth signals. Several psychoacoustic models were proposed, for both narrow and wideband audio, thus, the emergence of various approaches emphasized the requirement for
standardized methods. First, in 1996 was presented the ITU-R recommendation P.861 (ITUR, 1996), which describes an objective speech quality assessment algorithm for speech
codecs, also called PESQ (Perceptual Evaluation of Speech Quality). Later on, in 1998, an
algorithm for objective measurement of wide-band audio signals was presented, the ITU-R
recommendation BS.1387 (ITU-R, 1998), also called PEAQ (Perceptual Evaluation of Audio
Quality). Both measurement systems are described below, and, as well, a comparison of
results obtained with subjective tests and objective measurements, which serves to evaluate
the validity of PEAQ. The section finishes with some recent progress and advances in this
field, research works that go beyond PESQ and PEAQ methods, as well as remaining
challenges.
3.1 Overview of ITU-R recommendations related to objective measurement of sound
quality: PEAQ and PESQ
PEAQ and PESQ methods were not built form scratch, but combining ideas from several
proposed methods (Thiede, 2000). For example, the road to the final standardization of PEAQ
was as follows. In 1994 the ITU-R initiated a process to identify and recommend a method
for objective measurement of perceived audio quality. The first task was to create a
committee that should clarify the expected applications of such method, to examine the
performance of existing methods, and to describe the method selected or, if existing methods
were found to be inadequate, the new method created to meet performance requirements. A
call for proposals resulted in responses from seven model proponents, and their performances
were compared. But no one model was significantly better than all of the others, and the
original proponents collaborate to develop a new improved model called PEAQ. Finally, two
versions of the method were developed. The Basic Version of PEAQ is intended to be fast
enough for real-time monitoring, whereas the Advanced Version requires more computational
power to achieve higher reliability.
A high-level representation of the PEAQ model is shown in Figure 1. In general it
compares a signal that has been processed in some way with the corresponding original
signal. Concurrent frames of the original and processed signal are transformed into a timefrequency representation by the psychoacoustic model. Then a task-specific model of
auditory cognition reduces these data to a number of model output variables (MOV), and
finally, those scalar values are mapped to the desired quality measurement.
The psychoacoustic model in the Basic Version uses a Discrete Fourier Transform (DFT)
to transform the signal into a time-frequency representation; however, the Advanced Version
uses both a DFT and a filter bank. The data from the DFT is mapped from the frequency scale
to a pitch scale, the psychoacoustic equivalent of frequency. For the filter bank, the frequency
to pitch mapping is implicitly taken into account by the bandwidths and spacing of the bandpass filters.
6
Figure 3. High-level description of model. (Treurniet, 2000)
The psychoacoustic model of PEAQ produces two different representations of the input
signals. Those representations are compared by the cognitive model to calculate the MOVs
values that summarize psychoacoustic activity over time. Important information for making
the quality measurement is derived from the differences between the frequency and pitch
domain representations of the reference and test signals. In the frequency domain, the spectral
bandwidths of both signals are measured and the harmonic structure in the error is
determined. In the pitch domain, error measures are derived from the excitation envelope
modulations, the excitation magnitudes, and the excitation derived from the error signal
calculated in the frequency domain.
The model variables, MOVs, are used by the model to predict the subjective quality rating
that would be assigned to the processed signal in a formal ITU-R BS.1116 based listening
test. This prediction of the SDG is called the objective difference grade (ODG), and it has
obviously the same meaning as the SDG value, which was explained previously in section
2.1.1. The PEAQ quality measurement is based on eleven MOVs for the Basic Version, and
on five variables for the Advanced Version. The transformations from the MOVs to the ODG
were optimized using data from previously conducted listening tests (Thiede, 2000).
Similarity, the model for perceptual evaluation of speech quality, or PESQ, is based on an
integration of two previous models, the perceptual speech quality measure, known as PSQM,
and the perceptual analysis measurement system, PAMS. PESQ uses a psychophysical model
of the human hearing system, as well as a cognitive model. The quality score is based on the
average distance between the transforms of the reference and degraded signals. The quality
score that PESQ produces is a prediction of perceived listening quality based on the absolute
category rating method (ACR). In this method, listeners hear a number of degraded
recordings, and are prompted to vote on each one according to an opinion scale such as the 5point listening quality (LQ) scale. In LQ scale 1 means bad quality, 2 poor, 3 fair, 4 good and
5 means excellent quality. The ACR method with the LQ opinion scale is the most commonly
used method in telecommunications assessment, and was the primary focus during
development of PESQ (Rix, 2000).
7
3.2 Evaluation of PEAQ
In the previous section we have seen how PEAQ and PESQ methods work to produce
estimations of perceived audio or speech quality, this section tells about the validity of PEAQ
results. Different evaluation tests were performed to know the efficiency of the seven model
proponents at the beginning of the PEAQ standardization process, comparing their results
with the data available from multiple subjective listening tests. In order to compare the
performance of the different models or model versions, a number of different criteria are
relevant (Thiede, 2000):
-
Tolerance scheme. A tolerance scheme was designed to weight differently the
deviations of the ODG values from the SDG ones at the upper and lower ends of
the impairment scale. This is so because a difference of 0.5 grade has not the same
significance near the lower end of the quality scale as near the upper end. Then, a
tolerance region is created, which is related to the confidence intervals (CI) of the
listening tests. The average distance from the ODGs outside the tolerance region to
the boundaries is one criterion for evaluating measurement methods. As can be
seen in Figure 4, error need to be larger for lower quality signals than for highquality signals in order to have an effect on the average.
Figure 4. Tolerance region (minimum confidence interval=0.25) . (Thiede, 2000)
-
-
-
Correlation. The correlation coefficient is often used to express the strength of the
linear relationship of one variable with another. Further, the squared correlation
coefficient is a measure of the variance in one variable accounted for by the
variance in the other. Since a linear relationship is expected between SDG and
ODG variables, the correlation coefficient should be a useful criterion. However,
the magnitude of the correlation can be affected drastically by the presence of few
extreme outliers, so, this criterion should not be used in isolation.
Absolute Error Score. The absolute error score (AES) was introduced to relate the
accuracy of a model to the accuracy of the listening test. AES value is calculated
in a similar way to the correlation, but it also depends on the confidence interval,
which is different for each SDG value. Again, AES value gives useful hints, but it
also should not be used in isolation to measure overall performance.
Number of outliers. The number-of-outliers criterion is based on the premise that
any prediction error exceeding the tolerance region boundaries is as severe as any
other, independent of the absolute value of the error. This method consists on
8
simply counting all occurrences of errors larger than the SDG confidence interval,
where normally the limits for the allowed error margin are asymmetric.
Some of the previous criteria tell how much the algorithm fails, whereas others tell how
often the algorithm fails. Following figures show the relation between subjective quality and
the signal-to-noise ratio (SNR), and between both versions of PEAQ (Thiede, 2000). The
solid lines represent the tolerance region. Looking at the figures we can conclude that the
advanced version of PEAQ is superior to the basic version, the SNR is clearly not a viable
measure of quality for audio signals.
Figure 5. Relation between SDG and SNR, Advanced and Basic PEAQ. (Thiede, 2000)
In the following example (Treurniet, 2000) we can see another evaluation experiment. In
this case the performance assessment was divided into two parts: comparison by audio items
and comparison by systems. The first 21 expert listeners evaluated the quality of eight audio
items processed by 17 systems. A system is defined as a codec operating at a particular bit
rate (6 codecs were studied). By averaging over listeners, the subjective data set was reduced
to 136 mean SDGs. The performance of the objective measurement method (PEAQ) was
evaluated by predicting the mean subjective quality rating for each item by system conditions.
Figure 6 shows the relationship between the mean SDG and the ODG for the 136 items, i.e.,
comparison by audio items. The linear correlation between these variables is 0.85, and the
slope of the regression line is 0.79.
9
Perfect correspondence between the objective measurements and the subjective quality
ratings was not achieved since not all of the data points fall on the diagonal. However, the
objective measurements agree reasonably well with the subjective quality grades. Some
noticeable outliers suggest that the accuracy of the objective measurement method may be
influenced by the nature of the audio material. An investigation of the most severe outliers
indicated that they are due to two codecs processing two audio items.
Figure 6. Correlation of mean-item ODG with SDG (r = 0.85). (Treurniet, 2000)
The overall subjective quality of a particular system was defined as the average of the
mean SDGs for the eight items processed by that system. The corresponding overall objective
quality measurement was obtained by averaging the ODGs for the same eight items. Figure 7
shows the relationship between the 17 overall mean SDGs and ODGs, i.e., comparison by
system. The linear correlation is 0.97 and the slope of the regression line is 0.95.
Figure 7. Correlation of system ODG and SDG (r = 0.97). (Treurniet, 2000)
Comparison by systems shows a correlation much stronger than the first method,
correlation of grades for audio items. This can be understood as a consequence of averaging
over subsets of audio items. Figure 8 shows the difference between the overall mean SDG
and ODG for each of the systems. A positive value indicates that PEAQ underestimated the
quality rating for that system, whereas a negative value indicates the opposite situation. It can
be seen from this figure that the absolute value of this difference is always less than 0.5.
Other conclusion that can be obtained from this graph is, for example, that the overall
objective qualities of codec U and Z are somewhat lower than their subjective qualities. Such
consistencies within codec families might be due to some unspecified types of distortions
generated by the coding algorithms that are resolved suboptimally by PEAQ method.
10
Figure 8. Difference between SDG and ODG per system. (Treurniet, 2000)
This section finishes with a performance assessment (Schmidmer, 2005) where PEAQ is
not compared with the results of a ITU-R BS.1116 based listening test, but compared with a
MUSHRA based test. The experiment is again related to audio coding and the seven audio
codecs under test were in this case mentioned:
o
o
o
o
o
o
o
Microsoft Windows Media 4
MPEG-4 AAC (Fraunhofer)
MP3 (Fraunhofer)
Quicktime 4, Music-Codec 2 (Qdesign)
Real Audio 5.0
RealAudio G2
MPEG-4 TwinVQ (Yahama)
48 kbps Stereo - DR
100
Score
80
60
Subjective
40
Objective
20
0
1
2
3
4
5
6
7
8
Codec No.
64 kbps Stereo - DR
100
Score
80
60
Subjective
40
Objective
20
0
1
2
3
4
5
6
7
8
Codec No.
Figure 9. Differences between objective and subjective assessment of audio quality.
(Schmidmer, 2005)
11
3.3 Application domains of objective measurement of sound quality
The performance of microphones, recording and transmission equipment, and
loudspeakers has been made better over time by successive incremental improvements. The
present practice utilizes several kinds of measurements that date from the very early days of
audio, to characterize the linear and nonlinear errors that accumulate in the audio chain. For
some new processes, specifically low bit-rate audio codecs, the measurement of these
traditional audio parameters has been never been strictly appropriate. Low bit-rate codecs
introduce new kinds of errors that the traditional measurements were not designed to detect,
and in fact these kinds of systems could even be designed to measure well even when they
don’t sound good. The objective measurement recommendations presented in this section
have been done to try to assess automatically the degradation of audio quality in different
stages of the audio chain. The systems were designed to emulate the way human hearing
works to distinguish different sounds from one another. Sometimes the system has to work on
real-time, whereas non-real-time measurement is sufficient for other applications; it
determines the version of PEAQ to be used. Some of the possible application scenarios for
objective measurement techniques are listed (Thiede, 2000):
- Assessment of implementations. Procedure for characterizing different
implementations of audio processing equipment, in many cases audio codecs.
- Perceptual quality lineup. Fast procedure that tests equipment or circuits before
putting them into service.
- On-line monitoring. Continuous process to monitor audio transmission in service.
- Equipment or connection status. Detailed analysis of a piece of equipment or a
circuit.
- Codec identification. Procedure to identify type and implementation of a particular
codec.
- Codec development. Procedure characterizing performance of a codec in as much
detail as possible.
- Network planning. Procedure to optimize cost and performance of a transmission
network under given constraints.
- Aid to subjective assessment. Tool for identifying critical material to include in a
subjective listening test.
As an example, we consider the results of an investigation (Benjamin, 2002) that used
PEAQ to measure the audio degradation caused by Sample Rate Conversion, Analog to
Digital Converters (ADC) and Digital to Audio Converters (DAC).
Sample Rate Conversion works by interpolating new samples to redefine the waveform
that was described by the original samples. The interpolation error is likely to be greater at
high frequencies than at low frequencies because high frequency waveforms change value
more between samples than do low frequency waveforms. In the study the input signal was a
20 kHz full-scale sine wave sampled at 44.1 kHz, which was sample rate converted up to 48
kHz. It was noticed that the sample rate conversion process introduced numerous artifacts in
the output signal. PEAQ cannot directly compare programs at different sample rates. For this
reason the experiments were performed by doing sample rate conversion in pairs. In principle,
the amount of distortion can be controlled by adjusting the length of the interpolation filter.
On the other hand, the requirement about allowed time to perform conversion limits the
length of the filter. The material was twice converted, first from 44.1 kHz to 48 kHz, and then
back to 44.1 kHz again, using several prototype and commercial sample rate conversion
programs and devices. The twice converted files were then assessed for quality using PEAQ,
12
the original files acted as reference. Figure 10 shows the progressive degradation of audio
quality as the number of conversions is increased.
Figure 10. Audio quality after tandem sample rate conversions. (Benjamin, 2002)
The length of the interpolation filter does not seem to be a strong determinant of audio
quality until the length of the interpolation filter is reduced to 33 or 17. The degradation
associated with higher length interpolation filters could entirely be due to the issue of
cumulative round-off error.
The quality of both type of converters, ADC and DAC, is the subject of intensive effort
and discussion. For example, the audiophile press claims that PCM process and the converters
associated with Compact Disc format are the cause of substantial audio degradation. Several
methods were exposed for evaluating digital audio converters, e.g., looking at the spectrum of
sine-wave stimuli, but became more difficult to apply with complex signals, either music or
speech. PEAQ is obviously not able to assess impairments in the analog domain. In order to
measure the impairments associated with the conversion process it is necessary to measure
the composite effect of the ADC and DAC. This has a clear disadvantage, the process cannot
directly distinguish whether the artifacts are due to the ADC, or to the DAC, or both. Figure
11 shows how PEAQ was used to measure the audio quality of DACs and ADCs.
Figure 11. Block diagram of DAC/ADC evaluation. (Benjamin, 2002)
The original program material was played back in real-time, either from a CD player, a
DVD-Audio player, or a computer hard disk, in all cases using digital output interfaces. The
digital output is sent to a pair of DACs and DACs and the twice-converted file is recorded at
the same time that the original file is recorded. Then the two files, representing the original 213
channel program and the degraded program, are applied to the PEAQ process, which then
give us an ODG value. The degraded file can be put through the conversion process any
number of times to increase the degradation caused by the conversion process. Even if the
combination of DAC and ADC is nearly transparent, some number of passes through the
conversion process will cause audible degradation. The programs chosen for the test where all
from the collection of recommended material for subjective quality assessment created by
EBU. The harpsichord arpeggio and castanets were chosen as representative of instruments
with extended high frequency spectral content. The results of the tests are shown in Figure 12.
Figure 12. Quality after repetitions of DAC/ADC conversion. (Benjamin, 2002)
The ODG shows a consistent decrease in quality as the number of conversions is
increased. After only one pair of conversions is very small (-0.09). But, after 50 conversions
the quality has dropped to about -1.5, a score between “Perceptible, but not annoying”, and
“Slightly annoying”. The figure shows an abrupt decrease in quality after conversion 16.
Based on his own experience, the author of the investigation concluded that PEAQ does a
very good job of predicting the audibility of errors, and that PEAQ can be used to measure
very small changes in audio quality, even smaller than can be detected in listening tests.
3.4 Beyond PEAQ and PESQ
The estimation of audio quality is becoming more important especially in
telecommunication applications, where Quality of Service is one of the key considerations.
Thus, at this moment, there are many on-going research projects and advances in this area, as
there are remaining challenges as well. The IEEE Signal Processing Society has recently
published a Call for Papers for a special issue of IEEE Transactions on Speech and Audio
Processing that will focus on objective quality assessment of speech and audio. Contributions
will be received until February 2006 and must be related to some of the following topics:
-
Subjective basis for objective quality assessment.
Waveform models - based on waveforms of speech and audio
Parametric models - based on telecommunication or broadcast network parameters
Intrusive models
Non-intrusive (single-ended or output-based) models
Objective diagnosis of quality impairment
14
-
Objective and subjective assessment of conversational quality
Issues and applications relevant to real-world problems
The tentative publication date for this issue is January 2007, so, anyone interested in the
topic should put attention on it.
Besides that, this section will tell about some of the research papers that have presented
more recently and serve to discuss about future and applications of objective measurement of
audio quality. Motivation of these investigations used to be that listening tests are reliable but
very expensive, time consuming and, sometimes, impractical. On the other hand, existing
objective quality assessment methods require either the original audio signal or complicated
computation model, which makes some applications of quality evaluation impossible.
Libin Cai and Jiying Zhao, working at the University of Ottawa, proposed to used digital
audio watermarking to evaluate the quality of speech (Cai, 2005) and audio signals (Cai,
2004). As shown in Figure 13, in order to measure the audio quality, the new scheme
proposed only needs the quantization scale and watermarking key which are used in the
embedding process.
Figure 13. Audio quality measurement based on watermarking. (Cai, 2004)
When a watermarked audio signal is distorted, the correct watermark detection rate
decreases accordingly. In an ideal measurement, the percentages of correct watermark
extraction for all watermarked audio should have the same proportion to the same distortion.
However, different audio signals comprise different frequencies and amplitudes, hence they
have different robustness to the same distortion. It is difficult to measure audio quality if a
fixed quantization step is used. Thus, authors employ an adaptive control method to obtain
the optimized quantization step for different audio signals. Using fixed quantization steps, the
15
system produces the lowest percentage of correct watermark extraction. At the end of the
process the extracted watermark is compared with the original watermark to obtain the
Percentage of Correct Extracted Watermarks bits (PCEW). The signal was artificially
attacked with additive noise, Gaussian noise and low-pass filters. The following figures show
the average PCEW value that can be used to measure the audio quality.
Figure 14. Effects of additive noise, Gaussian noise and low-pass filtering. (Cai, 2004)
The evaluation of this method is more clearly exposed in the paper referred to speech,
which presents correlation coefficients between PCEW and PESQ MOS. The Absolute
Residual Error (ARE) and correlation coefficients are shown in Figure 15.
Figure 15. Accuracy of the watermarking-based assessment method. (Cai, 2005)
16
Rahul Vanam and Charles D. Creusere (Vanam, 2005) demonstrated that the Advanced
Version of PEAQ performs poorly when compared to the previously developed Energy
Equalization Approach (EEA) for evaluating quality of low bit-rate scalable audio (supported
for example in MPEG-4 standard). They also created a modified version of PEAQ, including
Energy Equalization parameter to the other MOVs, and the performance was improved
significantly; even compared with EEA.
Scalable audio compression means that system encodes audio data at a bit-rate and
decodes it at bit-rates less than or equal to the original bit-rate. Objective quality
measurement of low bit-rate scalable audio using PEAQ has been found to be poor for the
Basic Version. The Advanced Version, which was tested during the investigation (Vanam,
2005), also performs poorly; and EEA is superior to it, as can be seen in Figure 16.
Figure 16. Evaluation of both versions of PEAQ and EEA for scalable audio codecs.
(Vanam, 2005)
The corresponding correlation coefficients are 0.365 for Basic Version of PEAQ, 0.325 for
the Advanced Version, and 0.669 for the Energy Equalization approach. Those values are far
from being acceptable, so a new modified version of PEAQ was proposed. The Advanced
Version is modified and an additional MOV is used to calculate the ODG values. The
correlation coefficient for the modified Advanced Version is found to be 0.8254, indicating
that it has superior performance over EEA, as can be seen in Figure 17.
Figure 17. Evaluation of BasicPEAQ and EEA for scalable audio codecs. (Vanam, 2005)
17
While the previous research work is based on a more complex version of PEAQ method,
the last investigation mentioned in this paper is based on a simplified version of PESQ. The
goal of S. Voran (Voran, 1998) was to simplify PESQ algorithm while having minimal effect
on its performance. Modified algorithms reduced the number of floating point operations in
64% with only a 3.5% decrease in average correlation to listener opinions.
There were 6 components of PESQ algorithm that were removed or re-adjusted to create
also 6 different modified versions of PESQ. Figure 18 shows what are the elements under
consideration for each version and their complexity-performance trade-off.
Figure 18. Description of six simplified versions of PESQ and their performance. (Voran,
1998)
According to this study, it appears that a portion of the PESQ algorithm complexity is not
contributing much to the perceived speech quality estimation. At least for the seven
subjective tests considered in the study. Using the proposed simplifications, the algorithm
may be a candidate for inclusion in speech coders. It might provide feedback to parameter
selection, excitation search, and bit-allocation algorithms to ensure that the highest possible
signal quality is obtained at the lowest possible bit-rate.
4 CONCLUSIONS
This paper has presented two ITU recommendations for the subjective assessment of
sound quality. By using them it is possible to compare the performance of different audio
systems and devices in a reliable manner. As we have seen, the ITU-R BS.1116
recommendation is very efficient for evaluating small impairments in audio signals, while the
ITU-R BS.1534 is intended for intermediate quality signals. Thus, the first one can be applied
18
for example to compare performance of Analog Digital Converters, and the latter for
comparing audio codecs at low bit-rates.
Listening tests are very reliable but also very expensive, time consuming and, sometimes,
impractical. Because of that, recommendations for objective measurement of sound quality
have been proposed. The methods standardized by ITU are PEAQ, for wideband audio
signals, and PESQ, for speech signals. Objective methods try to imitate the way human
listeners perceive sounds using psychoacoustic and cognitive models. PEAQ algorithm, for
example, was created combining seven model proponents. When models were evaluated their
performances were not significantly different, so, the original proponents were called to
collaborate developing a new improved model.
Objective measurement algorithms are periodically evaluated, comparing their results with
those of subjective tests assessments and looking the correlation coefficients. PEAQ and
PESQ algorithms seem to be highly reliable in many cases, but for some specific scenarios,
like for evaluating quality of low bit-rate scalable audio, perform very poorly. This is one of
the main motivations for further research in this field. Other investigations follow different
approaches, for instance, trying to achieve similar quality but with simplified versions of
PEAQ and PESQ, or, even, with alternative solutions to them, we have seen for example
methods based on watermarking. At this moment, objective measurements can not be
considered totally reliable and subjective assessments based on listening tests are still needed,
but for many applications they offer sufficient quality, and in some cases are more practical.
REFERENCES
Benjamin, E. 2002. Evaluating digital audio artifacts with PEAQ. AES Convention Paper.
113th AES Convention, Los Angeles, CA, October 2002.
Cai, L. And Zhao, J. 2004. Audio quality measurement by using digital watermarking.
Proceedings of IEEE Canadian Conference on Electrical and Computer Engineering
(CCECE) 2004, Niagara Falls, Ontario, Canada, pp.1159-1162, May 2-5, 2004.
Cai, L. And Zhao, J. 2005. Speech Quality Evaluation: A New Application of Digital
Watermarking. Proceedings of 2005 IEEE Instrumentation and Measurement Technology
Conference, Ottawa, Ontario, Canada, pp.726-731, 17-19 May 2005.
ITU-R BS.1116, Methods for the subjective assessment of small impairments in audio
systems including multichannel sound systems. 1997.
ITU-R BS.1387, Method for objective measurement of perceived audio quality. 1998.
ITU-R BS.1534, Method for the subjective assessment of intermediate quality level of coding
systems. 2001.
ITU-R P.861, Objective quality measurement of telephone-band speech codecs. 1996.
Rix, A., Beerends, J., Hollier, M. and Hekstra, P. 2000. PESQ – the new ITU standard for
end-to-end speech quality assessment. AES Convention Paper. 109th AES Convention,
Los Angeles, CA, September 2000.
Rothauser, H. and Urbanek, G. 1966. Some problems in subjective testing. AES Convention
Paper. 31st AES Convention, New York, October 1966.
19
Seppänen, J. 2004. Mobile multimedia codecs and formats. Multimedia Seminar lecture, Fall
2004. Available at: http://www.tml.tkk.fi/Studies/T-111.550/
Schmidmer, C. 2005. Perceptual wideband audio quality assessments using PEAQ. 2nd
Workshop on Wideband Speech Quality. Mainz, Germany, June 2005.
Stoll, G. and Kozamernik, F. 2000. EBU listening tests on Internet audio codecs. EBU
Technical Review, June 2000
Thiede, T., Treurniet, William C., Bitto, R., Schmidmer, C., Sporer, T., Beerends, John G.,
Colomes, C., Keyhl, M., Stoll, G., Brandenburg, K. and Feiten, B. 2000. PEAQ - The ITU
Standard for Objective Measurement of Perceived Audio Quality. Journal of the Audio
Engineering Society (AES), vol. 48, Number 1/2, Jan/Feb 2000.
Treurniet, William C. and Soulodre G. 2000. Evaluation of the ITU-R Objective Audio
Quality Measurement Method. Journal of the Audio Engineering Society (AES), vol. 48,
Number 3, March 2000.
Vanam, R. And Creusere, C. 2005. Evaluating low bitrate scalable audio quality using
advanced version of PEAQ and energy equalization approach. Proceedings of IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2005,
Vol. 3, pp.189-192, Philadelphia, PA, March 18-23, 2005.
Voran, S. 1998. A simplified version of the ITU algorithm for objective measurement of
speech codec quality. Proceedings of IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP) 1998, Vol. 1, pp.537-540, Seattle, WA, May 12-15,
1998.
20