A Machine Learning Approach to Automatic Chord Extraction Matthew McVicar

A Machine Learning Approach to
Automatic Chord Extraction
Matthew McVicar
Department of Engineering Mathematics
University of Bristol
A dissertation submitted to the University of Bristol in accordance with the
requirements for award of the degree of Doctorate of Philosophy (PhD) in the
Faculty of Engineering
Word Count: 40,583
Abstract
In this thesis we introduce a machine learning based automatic chord recognition algorithm that achieves state of the art performance. This performance is realised by the introduction of a novel Dynamic Bayesian Network and chromagram feature vector, which concurrently recognises chords,
keys and bass note sequences on a set of songs by The Beatles, Queen and
Zweieck.
In the months prior to the completion of this thesis, a large number of
new, fully-labelled datasets have been released to the research community,
meaning that the generalisation potential of models may be tested. When
sufficient training examples are available, we find that our model achieves
similar performance on both the well-known and novel datasets and statistically significantly outperforms a baseline Hidden Markov Model.
Our system is also able to learn from partially-labelled data. This is investigated through the use of guitar chord sequences obtained from the web. In
test, we align these sequences to the audio, accounting for changes in key,
different interpretations, and missing structural information. We find that
this approach increases recognition accuracy from on a set of songs by the
rock group The Beatles. Another use for these sequences is in a training
scenario. Here we align over 1, 000 chord sequences to audio and use them
as an additional training source. These data are exploited using curriculum
learning, where we see an improvement from when testing on a set of 715
songs and evaluated on a complex chord alphabet.
Dedicated to my family
Acknowledgements
I would like to acknowledge the support, advice and guidance offered by
my supervisor, Tijl De Bie. I would also like to thank Yizhao Ni and Ra´
ul
Santos-Rodr´ıguez for their collaborations, proof-reading and friendship.
My work throughout this PhD was funded by the Bristol Centre for Complexity Sciences (BCCS) and the Engineering and Physical Sciences Research Council grant number EP/E501214/1. I am certain that the work
contained within this thesis would not have been possible without the interdisciplinary teaching year at the BCCS, and am extremely grateful for
the staff, students and centre director John Hogan for the opportunity to
be taught by and work amongst these lecturers and students over the last
four years. Special thanks are also due to the BCCS co-ordinator, Sophie
Benoit.
Much of this thesis has built on previously existing concepts, many of which
have generously been made available for research purposes. In particular,
this work would not have been possible without the chord annotations by
Christopher Harte and Matthias Mauch (MIREX dataset), Nocolas Dooley
and Travis Kaufman (USpop dataset), and students at the Centre for Interdisciplinary Research in Music Media and Technology, McGill University
(Billboard dataset). I am also grateful to Dan Ellis for making his tuning
and beat-tracking scripts available online, and I made extensive use of the
software Sonic Visualiser by Chris Cannam at the Centre for Digital Mu-
sic at the Queen Mary, University of London; thank you for keeping this
fantastic software free.
Further thanks are due to Peter Flach, Nello Cristianini, Matthias Mauch,
Elena Hensinger, Owen Rackham, Antoni Matyjaszkiewicz, Angela Onslow,
Tom Irving, Harriet Mills, Petros Mina, Matt Oates, Jonathan Potts, Adam
Sardar, Donata Wasiuk, all the BCCS students past and present, and my
family: Liz, Brian and George McVicar.
Declaration
I declare that the work in this dissertation was carried out in accordance
with the requirements of the University’s Regulations and Code of Practice
for Research Degree Programmes and that it has not been submitted for
any other academic award. Except where indicated by specific reference in
the text, the work is the candidate’s own work. Work done in collaboration
with, or with the assistance of, others, is indicated as such. Any views expressed in the dissertation are those of the author.
SIGNED: .....................................................
DATE: .......................
Contents
List of Figures
xi
List of Tables
xvii
1 Introduction
1
1.1
Music as a Complex System . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Task Description and Motivation . . . . . . . . . . . . . . . . . . . . . .
3
1.2.1
Task Description . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2.2
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.3
Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.4
Contributions and thesis structure . . . . . . . . . . . . . . . . . . . . .
6
1.5
Relevant Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.6
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2 Background
2.1
2.2
13
Chords and their Musical Function . . . . . . . . . . . . . . . . . . . . .
13
2.1.1
Defining Chords . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.1.2
Musical Keys and Chord Construction . . . . . . . . . . . . . . .
16
2.1.3
Chord Voicings . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.1.4
Chord Progressions . . . . . . . . . . . . . . . . . . . . . . . . . .
17
Literature Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
v
CONTENTS
2.3
2.4
2.5
2.6
Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.3.1
Early Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.3.2
Constant-Q Spectra . . . . . . . . . . . . . . . . . . . . . . . . .
26
2.3.3
Background Spectra and Consideration of Harmonics . . . . . . .
26
2.3.4
Tuning Compensation . . . . . . . . . . . . . . . . . . . . . . . .
28
2.3.5
Smoothing/Beat Synchronisation . . . . . . . . . . . . . . . . . .
28
2.3.6
Tonal Centroid Vectors . . . . . . . . . . . . . . . . . . . . . . .
29
2.3.7
Integration of Bass Information . . . . . . . . . . . . . . . . . . .
30
2.3.8
Non-Negative Least Squares Chroma (NNLS) . . . . . . . . . . .
30
Modelling Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
2.4.1
Template Matching . . . . . . . . . . . . . . . . . . . . . . . . . .
32
2.4.2
Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . .
34
2.4.3
Incorporating Key Information . . . . . . . . . . . . . . . . . . .
35
2.4.4
Dynamic Bayesian Networks . . . . . . . . . . . . . . . . . . . .
36
2.4.5
Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
2.4.6
Discriminative Models . . . . . . . . . . . . . . . . . . . . . . . .
38
2.4.7
Genre-Specific Models . . . . . . . . . . . . . . . . . . . . . . . .
38
2.4.8
Emission Probabilities . . . . . . . . . . . . . . . . . . . . . . . .
39
Model Training and Datasets . . . . . . . . . . . . . . . . . . . . . . . .
39
2.5.1
Expert Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . .
40
2.5.2
Learning from Fully-labelled Datasets . . . . . . . . . . . . . . .
41
2.5.3
Learning from Partially-labelled Datasets . . . . . . . . . . . . .
42
Evaluation Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
2.6.1
Relative Correct Overlap . . . . . . . . . . . . . . . . . . . . . .
42
2.6.2
Chord Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
2.6.3
Cross-validation Schemes . . . . . . . . . . . . . . . . . . . . . .
44
2.6.4
The Music Information Retrieval Evaluation eXchange (MIREX)
45
vi
CONTENTS
2.7
The HMM for Chord Recognition . . . . . . . . . . . . . . . . . . . . . .
50
2.8
Conclusion
53
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Chromagram Extraction
3.1
Motivation
55
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
The Definition of Loudness . . . . . . . . . . . . . . . . . . . . .
56
3.2
Preprocessing Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
3.3
Harmonic/Percussive Source Separation . . . . . . . . . . . . . . . . . .
58
3.4
Tuning Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
3.5
Constant Q Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
3.6
Sound Pressure Level Calculation . . . . . . . . . . . . . . . . . . . . . .
63
3.7
A-Weighting & Octave Summation . . . . . . . . . . . . . . . . . . . . .
64
3.8
Beat Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
3.9
Normalisation Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
3.10 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
3.11 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
3.1.1
4 Dynamic Bayesian Network
4.1
4.2
4.3
73
Mathematical Framework . . . . . . . . . . . . . . . . . . . . . . . . . .
74
4.1.1
Mathematical Formulation
. . . . . . . . . . . . . . . . . . . . .
74
4.1.2
Training the Model . . . . . . . . . . . . . . . . . . . . . . . . . .
76
4.1.3
Complexity Considerations . . . . . . . . . . . . . . . . . . . . .
77
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
4.2.1
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . .
79
4.2.2
Chord Accuracies . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
4.2.3
Key Accuracies . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
4.2.4
Bass Accuracies . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
Complex Chords and Evaluation Strategies . . . . . . . . . . . . . . . .
83
vii
CONTENTS
4.4
4.3.1
Increasing the chord alphabet . . . . . . . . . . . . . . . . . . . .
83
4.3.2
Evaluation Schemes . . . . . . . . . . . . . . . . . . . . . . . . .
84
4.3.3
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
5 Exploiting Additional Data
5.1
89
Training across different datasets . . . . . . . . . . . . . . . . . . . . . .
90
5.1.1
Data descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
5.1.2
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
5.2
Leave one out testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3
Learning Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.4
5.5
5.6
5.3.1
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3.2
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Chord Databases for use in testing . . . . . . . . . . . . . . . . . . . . . 105
5.4.1
Untimed Chord Sequences . . . . . . . . . . . . . . . . . . . . . . 105
5.4.2
Constrained Viterbi . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.4.3
Jump Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.4.4
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Chord Databases in Training . . . . . . . . . . . . . . . . . . . . . . . . 117
5.5.1
Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.5.2
Alignment Quality Measure . . . . . . . . . . . . . . . . . . . . . 119
5.5.3
Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 120
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6 Conclusions
125
6.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.2
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
viii
CONTENTS
References
135
A Songs used in Evaluation
151
B Relative chord durations
165
ix
CONTENTS
x
List of Figures
1.1
General approach to Automatic Chord Extraction. Features are extracted directly from audio that has been dissected into short time instances known as frames, and then labelled with the aid of training data
or expert knowledge to yield a prediction file. . . . . . . . . . . . . . . .
1.2
3
Graphical representation of the main processes in this thesis. Rectangles
indicate data sources, whereas rounded rectangles represent processes.
Processes and data with asterisks form the bases of certain chapters.
Chromagram Extraction is the basis for chapter 3, the main decoding
process (HPA decoding) is covered in chapter 4, whilst training is the
basis of chapter 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
Section of a typical chord annotation, showing onset time (first column),
offset time (second column), and chord label (third column).
2.2
11
. . . . . .
18
A typical chromagram feature matrix, shown here for the opening to Let
It Be (Lennon/McCartney). Salience of pitch class p at time t is estimated by the intensity of (p, t)th entry of the chromagram, with lighter
colours in this plot indicating higher energy (see colour bar between
chromagram and annotation). The reference (ground truth) chord annotation is also shown above for comparison, where we have reduced the
chords to major and minor classes for simplicity. . . . . . . . . . . . . .
xi
25
LIST OF FIGURES
2.3
Constant-Q spectrum of a piano playing a single A4 note. Note that, as
well as the fundamental at f0 =A4, there are harmonics at one octave
(A5) and one octave plus a just perfect fifth (E5). Higher harmonics
exist but are outside the frequency range considered here. Notice also
the slight presence of a fast-decaying subharmonic at two octaves down,
A2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4
27
Smoothing techniques for chromagram features. In 2.4a, we see a standard chromagram feature. Figure 2.4b shows a median filter over 20
frames, 2.4c shows a beat-synchronised chromagram. . . . . . . . . . . .
2.5
29
Treble (2.5a) and Bass (2.5b) Chromagrams, with the bass feature taken
over a frequency range of 55 − 207 Hz in an attempt to capture inversions. 31
2.6
Regular (a) and NNLS (b) chromagram feature vectors. Note that the
NNLS chromagram is a beat-synchronised feature. . . . . . . . . . . . .
2.7
31
Template-based approach to the chord recognition task, showing chromagram feature vectors, reference chord annotation and bit mask of chord
templates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8
33
Visualisation of a first order Hidden Markov Model (HMM) of length T.
Hidden states (chords) are shown as circular nodes, which emit observable states (rectangular nodes, chroma frames).
2.9
. . . . . . . . . . . . .
35
Two-chain HMM, here representing hidden nodes for Keys and Chords,
emitting Observed nodes. All possible hidden transitions are shown in
this figure, although these are rarely considered by researchers. . . . . .
36
2.10 Mathhias Mauch’s DBN. Hidden nodes Mi , Ki , Ci , Bi represent metric
position, key, chord and bass annotations, whilst observed nodes Cit and
Cib represent treble and bass chromagrams. . . . . . . . . . . . . . . . .
xii
37
LIST OF FIGURES
2.11 HMM parameters, trained using Maximum likelihood on the MIREX
dataset. Above, left: logarithm of initial distribution p∗ini . Above, right:
logarithm transition probabilities T∗ . Below, left: mean vectors for each
chord µ∗ . Below, right: covariance matrix Σ∗ for a C:maj chord. To
preserve clarity, parallel minors for each chord and accidentals follow to
the right and below. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
54
Flowchart of feature extraction processes in this chapter. We begin with
raw audio, and finish with a chromagram feature matrix. Sections of
this chapter which describe each process are shown in the corresponding
boxes in this Figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
56
Equal loudness curves. Frequency in Hz increases logarithmically across
the horizontal axis, with Sound Pressure Level (dB SPL) on the vertical
axis. Each line shows the current standards as defined in the ISO standard (226:2003 revision [39]) at various loudness levels. Loudness levels
shown are at (top to bottom) 90, 70, 50, 30, 10 Phon, with the limit of
human hearing (0 Phon) shown in blue. . . . . . . . . . . . . . . . . . .
3.3
57
Illustration of Harmonic Percussive Source Separation algorithm. Three
spectra are shown. In Figure 3.3a, we show the spectrogram of a 30
second segment of ‘Hey Jude’ (Lennon-McCartney). Figures 3.3b and
3.3c show the resulting harmonic and percussive spectrograms after performing HPSS, respectively. . . . . . . . . . . . . . . . . . . . . . . . . .
3.4
60
Illustration of our tuning method, taken from [26]. This histogram shows
the tuning discrepancies found over the song “Hey Jude” (Lennon/McCartney),
which are binned into 5 cent bins. The estimated tuning is then found
by choosing the most populated tuning. . . . . . . . . . . . . . . . . . .
xiii
62
LIST OF FIGURES
3.5
Ground Truth extraction process. Given a ground truth annotation (top)
and set of beat locations (middle), we obtain the most prevalent chord
label between each beat to obtain beat-synchronous annotations. . . . .
66
3.6
Chromagram representations for the first 12 seconds of ‘Ticket to Ride’.
71
4.1
Model hierarchy for the Harmony Progression Analyser (HPA). Hidden
nodes (cicles) refer to chord (ci ), key (ki ) and bass note sequences (bi ).
Chords and bass notes emit treble (Xit ) and bass (Cib ) chromagrams,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2
74
Histograms of key accuracies of the Key-HMM (4.2a),Key-Bass-HMM
(4.2b) and HPA (4.2c) models. Accuracies shown are the averages over
100 repetitions of 3-fold cross-validation. . . . . . . . . . . . . . . . . . .
4.3
82
Testing Chord Precision and Note Precision from Table 4.4 for visual
comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
5.1
Section of a typical Billboard dataset entry before processing. . . . . . .
91
5.2
TRCO performances using an HMM trained and tested on all combination of datasets. Chord alphabet complexity increases in successive
graphs, with test groups increasing in clusters of bars. Training groups
follow the same ordering as the test data. . . . . . . . . . . . . . . . . .
5.3
97
Note Precision performances from Table 5.2 presented for visual comparison. Test sets follow the same order as the grouped training sets.
Abbreviations: Bill. = Billboard, C.K. = Carole King. . . . . . . . . . .
5.4
Comparative plots of HPA vs an HMM under various train/test scenarios
and chord alphabets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5
98
99
Distributions of data from Table 5.3. The number of songs attaining
each decile is shown over each of the four alphabets. . . . . . . . . . . . 101
xiv
LIST OF FIGURES
5.6
Learning rate of HPA when using increasing amounts of the Billboard
dataset. Training size increases along the x axis, with either Note or
Chord Precision measured on the y axis. Error bars of width 1 standard
deviation across the randomisations are also shown.
5.7
. . . . . . . . . . . 104
Example e-chords chord and lyric annotation for “All You Need is Love”
(Lennon/McCartney), showing chord labels above lyrics. . . . . . . . . . 106
5.8
Example HMM topology for Figure 5.7. Shown here: (a) Alphabet
Constrained Viterbi (ACV), (b) Alphabet and Transition Constrained
Viterbi (ACV), (c) Untimed Chord Sequence Alignment (UCSA), (d)
Jump Alignment (JA).
5.9
. . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Example application of Jump Alignment for the song presented in Figure
5.7. By allowing jumps from ends of lines to previous and future lines,
we allow an alignment that follows the solid path, then jumps back to
the beginning of the song to repeat the verse chords before continuing
to the chorus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.10 Results from Table 5.5, with UCSA omitted. Increasing amounts of
information from e-chords is used from left to right. Information used
is either simulated (ground truth, dotted line) or genuine (dashed and
solid lines). Performance is measured using Note Precision, and the
TRCO evaluation scheme is used throughout. . . . . . . . . . . . . . . . 117
xv
LIST OF FIGURES
5.11 Using aligned Untimed Chord Sequences as an additional training source.
The alignment quality threshold increases along the x–axis, with the
number of UCSs this corresponds to on the left y–axis. Baseline performance is shown as a grey, dashed line; performance using the additional
UCSs is shown as the solid black line, with performance being measure
in TRCO on the right y–axis. Experiments using random training sets of
equal size to the black line with error bars of width 1 standard deviation
are shown as a black dot–and–dashed line. . . . . . . . . . . . . . . . . . 121
B.1 Histograms of relative chord durations across the entire dataset of fullylabelled chord datasets used in this thesis (MIREX, USpop, Carole King,
Oasis, Billboard) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
xvi
List of Tables
2.1
Chronological summary of advances in automatic chord recognition from
audio, years 1999-2004. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
Chronological Summary of advances in automatic chord recognition from
audio, years 2005-2006. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3
22
Chronological summary of advances in automatic chord recognition from
audio, years 2010-2011. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6
21
Chronological summary of advances in automatic chord recognition from
audio, 2009. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5
20
Chronological summary of advances in automatic chord recognition from
audio, years 2007-2008. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4
19
23
MIREX Systems from 2008-2009, sorted in each year by Total Relative Correct Overlap in the merged evaluation (confusing parallel major/minor chords not considered an error). The best-performing pretrained/expert systems are underlined, best train/test systems are in
boldface. Systems where no data is available are shown by a dash (-). .
2.7
46
MIREX Systems from 2010-2011, sorted in each year by Total Relative
Correct Overlap. The best-performing pretrained/expert systems are
underlined, best train/test systems are in boldface. For 2011, systems
which obtained less than 0.35 TRCO are omitted.
xvii
. . . . . . . . . . . .
47
LIST OF TABLES
3.1
Performance tests for different chromagram feature vectors, evaluated
using Average Relative Correct Overlap (ARCO) and Total Relative
Correct Overlap (TRCO). p−values for the Wilcoxon rank sum test on
successive features are also shown. . . . . . . . . . . . . . . . . . . . . .
4.1
68
Chord recognition performances using various crippled versions of HPA.
Performance is measured using Total Relative Correct Overlap (TRCO)
or Average Relative Correct Overlap (ARCO), and averaged over 100
repetitions of a 3-fold cross-validation experiment. Variances across these
repetitions are shown after each result, and the best results are shown
in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2
80
Bass note recognition performances in models that recognise bass notes.
Performance is measured either using Total Relative Correct Overlap
(TRCO) or Average Relative Correct Overlap (ARCO), and is averaged
over 100 repetitions of a 3–fold cross–validation experiment. Variances
across these repetitions are shown after each result, and best results in
each column are in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3
82
Chord alphabets used for evaluation purposes. Abbreviations: MM =
Matthias Mauch, maj = major, min = minor, N = no chord, aug =
augmented, dim = diminished, sus2 = suspended 2nd, sus4 = suspended
4th, maj6 = major 6th, maj7 = major 7th, 7 = (dominant 7), min7 =
minor 7th, minmaj7 = minor, major 7th, hdim7 = half-diminished 7
(diminished triad, minor 7th). . . . . . . . . . . . . . . . . . . . . . . . .
4.4
5.1
83
HMM and HPA models under various evaluation schemes evaluated at
1, 000 Hz under TRCO. . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
Performances across different training groups using an HMM. . . . . . .
94
xviii
LIST OF TABLES
5.2
Performances across all training/testing groups and all alphabets using
HPA, evaluated using Note and Chord Precision. . . . . . . . . . . . . .
5.3
98
Leave-one-out testing on all data with key annotations (Billboard, MIREX
and Carole King) across four chord alphabets. Chord Precision and Note
Precision are shown in the first row, with the variance across test songs
shown in the second. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4
Pseudocode for the Jump Alignment algorithm. . . . . . . . . . . . . . . 114
5.5
Results using online chord annotations in testing. Amount of information
increases left to right, Note Precision is shown in the first 3 rows. p–
values using the Wilcoxon signed rank test for each result with respect
to that to the left of it are shown in rows 4–6. . . . . . . . . . . . . . . . 115
A.1 Oasis dataset, consisting of 5 chord annotations. . . . . . . . . . . . . . 151
A.2 Carole King dataset, consisting of 7 chord and key annotations. . . . . . 151
A.3 USpop dataset, consisting of 193 chord annotations. . . . . . . . . . . . 154
A.4 MIREX dataset, consisting of 217 chord and key annotations. . . . . . . 156
A.5 Billboard dataset, consisting of 522 chord and key annotations. . . . . . 163
xix
LIST OF TABLES
xx
List of Abbreviations
ACE
Automatic Chord Extraction (task)
ACV
Alphabet Constrained Viterbi
ARCO
Average Relative Correct Overlap
ATCV
Alphabet and Transition Constrained Viterbi
CD
Compact Disc
CL
Curriculum Learning
DBN
Dynamic Bayesian Network
EDS
Extractor Discovery System
FFT
Fast Fourier Transform
xxi
LIST OF ABBREVIATIONS
GTUCS
Ground Truth Untimed Chord Sequence
HMM
Hidden Markov Model
HPA
Harmony Progression Analyser
HPSS
Harmonic Percussive Source Separation
JA
Jump Alignment
MIDI
Musical Instrument Digital Interface
MIR
Music Information Retrieval
MIREX
Music Information Retrieval Evaluation Exchange
ML
Machine Learning
NNLS
Non Negative Least Squares
PCP
Pitch Class Profile
RCO
Relative Correct Overlap
xxii
LIST OF ABBREVIATIONS
SALAMI
Structural Analysis of Large Amounts of Music Information
SPL
Sound Pressure Level
STFT
Short Time Fourier Transform
SVM
Support Vector Machine
TRCO
Total Relative Correct Overlap
UCS
Untimed Chord Sequence
UCSA
Untimed Chord Sequence Alignment
WAV
Windows Wave audio format
xxiii
LIST OF ABBREVIATIONS
xxiv
1
Introduction
This chapter serves as an introduction to the thesis as a whole. We will begin with a
brief discussion of how the project relates to the field of complexity sciences in section
1.1, before stating the task description and motivating our work in section 1.2. From
these motivations we will formulate our objectives in section 1.3. The main contributions of the work are then presented alongside the thesis structure in section 1.4. We
present a list of publications relevant to this thesis in section 1.5 before concluding in
section 1.6.
1.1
Music as a Complex System
Definitions of a complex system vary, but common traits that a complex system exhibit
are1 :
1. It consist of many parts, out of whose interaction “emerges” behaviour not present
in the parts alone.
2. It is coupled to an environment with which it exchanges energy, information, or
other types of resources.
1
from http://bccs.bristol.ac.uk/research/complexsystems.html
1
1. INTRODUCTION
3. It exhibits both order and randomness – in its (spatial) structure or (temporal)
behaviour.
4. The system has memory and feedback and can adapt itself accordingly.
Music as a complex system has been considered by many authors [22, 23, 66, 105]
but is perhaps best summarised by Johnson, in his book Two’s Company, Three’s Complexity [41] when he states that music involves “a spontaneous interaction of collections
of objects (i.e., musicians)” and soloist patterns and motifs that are “interwoven with
original ideas in a truly complex way”.
Musical composition and performance is clearly an example of a complex system
as defined above. For example, melody, chord sequences and musical keys produce an
emergent harmonic structure which is not present in the isolated agents alone. Similarly,
live musicians often interact with their audiences, producing performances “...that arise
in an environment with audience feedback” [41], showing that energy and information
are shared between the system and its environment.
Addressing point 3, the most interesting and popular music falls somewhere between
order and randomness. For instance, signals which are entirely periodic (perfect sine
wave) or random (white noise) are uninteresting musically – signals which fall between
these two instances are where music is found. Finally, repetition is a key element of
music, with melodic, chordal and structural motifs appearing several times in a given
piece.
In most previous computational models of harmony, chords, keys and rhythm were
considered individual elements of music (with the exception of [62], see chapter 2), so
the original “complexity sciences” problem in this domain is a lack of understanding of
the interactions between these elements and a reductionist modelling methodology. To
counteract this, in this thesis we will investigate how an integrated model of chords,
keys, and basslines attempts to unravel the complexity of musical harmony. This will
2
1.2 Task Description and Motivation
be evidenced by the proposed model attaining recognition accuracies that exceed more
simplified approaches, which consider chords an isolated element of music instead of
part of a coherent complex system.
1.2
1.2.1
Task Description and Motivation
Task Description
Formally, Automatic Chord Extraction (ACE) is the task of assigning chord labels
and boundaries to a piece of musical audio, with minimal human involvement. The
process of automatic chord extraction is shown in Figure 1.1. A digital audio waveform
is passed into a feature extractor, which then assigns labels to time chunks known
as “frames”. Labelling of frames is conducted by either the expert knowledge of the
algorithm designers, or is extracted from training data for previously labelled songs.
The final output is a file with start times, end times and chord labels.
Audio Frames
Feature
Extraction
Training Data/
Expert
Knowledge
Decoding
0.000 0.175 N
0.175 1.852 C
1.852 3.454 G
3.454 4.720 A:min
4.720 5.126 A:min/b7
5.126 5.950 F:maj7
5.950 6.778 F:maj6
6.774 8.423 C
8.423 10.014 G
10.014 11.651 F
11.651 13.392 C
Figure 1.1: General approach to Automatic Chord Extraction. Features are extracted
directly from audio that has been dissected into short time instances known as frames, and
then labelled with the aid of training data or expert knowledge to yield a prediction file.
1.2.2
Motivation
The motivation for our work is three-fold: we wish to develop a fully automatic chord
recognition system for amateur musicians that is capable of being used in higher-level
3
1. INTRODUCTION
tasks1 and is based entirely on machine learning techniques. We detail these goals
below.
Automatic Transcription for Amateur Musicians
Chords and chord sequences are mid-level features of music that are typically used
by hobby musicians and professionals as robust representations of a piece for playing
by oneself or in a group. However, annotating the (time-stamped) chords to a song
is a time-consuming task, even for professionals, and typically requires two or more
annotators to resolve disagreements, as well as an annotation time of 3–5 times the
length of the audio, per annotator [13].
In addition to this, many amateur musicians, despite being competent players, lack
sufficient musical training to annotate chord sequences accurately. This is evidenced
by the prevalence of “tab” (tablature, a form of visual representation of popular music)
websites, with hundreds of thousands of tabs and millions of users [60]. However,
such websites are of limited use for Music Information Retrieval (MIR) by themselves
because they lack onset times, which means they cannot be used in higher-level tasks
(see below). With this in mind, the advantage of developing an automatic system is
clear: such a technique could be scaled to work, unaided, across the thousands of songs
in a typical user’s digital music library and could be used by amateur musicians as an
educational or rehearsal tool.
Chords in Higher-level tasks
In addition to use by professional and amateur musicians, chords and chord sequences
have been used by the Music Information Retrieval (MIR) research community in the
simultaneous estimate of beats [89] and musical keys [16], as well as in higher-level tasks
1
In this thesis, we describe low-level features as those extracted directly from the audio (duration,
zero-crossing rate etc.), mid-level features as those which require significant processing beyond this,
and high-level features as those which summarise an entire song. Tasks are defined as mid-level (for
instance) if they attempt to identify mid-level features.
4
1.3 Objectives
such as cover song identification [27], genre detection [91] and lyrics-to-audio alignment
[70]. Thus, advancement in automatic chord recognition will impact beyond the task
itself and lead to developments in some of the areas listed above.
A Machine Learning Approach
One may train a chord recognition system either by using expert knowledge or by making use of previously available training examples, known as “ground truth”, through
Machine Learning (ML). In the annual MIREX (Music Information Retrieval Evaluation eXchange) evaluations, both approaches to the task are very competitive at
present, with algorithms in both cases exceeding 80% accuracy (see Subsection 2.6.4).
In any recognition task where the total number of examples is sufficiently small, an
expert system will be able to perform well, as there will likely be less variance in the
data, and one may specify parameters which fit the data well. At the other extreme, in
cases of large and varied test data, it is impossible to specify the parameters necessary
to attain good performance - a problem known as the acquisition bottleneck [31].
However, if sufficient training data are available for a task, machine learning systems
may lead to higher generalisation potential than expert systems. This point is specifically important in the domain of chord estimation, since a large number of new ground
truths have been made available in recent months, which means that the generalisation
of a machine-learning system may be tested. The prospect of good generalisation of an
ML system to unseen data is the third motivating factor for this work.
1.3
Objectives
The objectives of this thesis echo the motivations discussed above. However, we must
first investigate the literature to define the state of the art and see which techniques
have been used by previous researchers in the field. Thus a thorough review of the
5
1. INTRODUCTION
literature is the first main objective of this thesis.
Once this has been conducted, we may address the second objective: developing a
system that performs at the state of the art (discussions of evaluation strategies are
postponed until Section 2.6). This will involve the construction of two main facets: the
development of a new chromagram feature vector for representing harmony, and the
decoding of these features into chord sequences via a new graphical model.
Finally, we will investigate and exploit one of the main advantages of deploying a
machine learning based chord recognition: it may be retrained on new data as it arises.
Thus, our final objective will be to evaluate how our proposed system performs when
trained on recently available training data and also test the generalisation of our model
to new datasets.
1.4
Contributions and thesis structure
The four main contributions of this thesis are:
• A thorough review of the literature of automatic chord estimation, including the
MIREX evaluations and major publications in the area.
• The development of a new chromagram feature representation which is based on
the human perception of loudness of sounds.
• A new Dynamic Bayesian Network (DBN) which concurrently recognises the
chords, keys and basslines of popular music which, in addition to the above,
attains state of the art performance on a known set of ground truths.
• Detailed train/test scenarios using all the current data available for researchers
in the field, with additional use of online chord databases for use in the training
and testing phase.
6
1.4 Contributions and thesis structure
These contributions are highlighted in the main chapters of this thesis. A graphical
representation of our main algorithm, highlighting the thesis structure, is shown in
Figure 1.2. We also provide brief summaries of the remaining chapters:
Chapter 2: Background
In this chapter, the relevant background information to the field is given. We begin
with some preliminary definitions and discussions of the function of chords in Western
Popular music. We then give a detailed account of the literature to date, with particular focus on feature extraction, modelling strategies, training schemes and evaluation
techniques.
Chapter 3: Chromagram Extraction
Feature extraction is the focus of this chapter. We outline the motivation for loudnessbased chromagrams, and then describe each stage of their calculation. We follow this
by conducting experiments to highlight the efficacy of these features on a trusted set
of 217 popular recordings for which the ground truth sequences are known.
Chapter 4: Dynamic Bayesian Network
This chapter is concerned with our decoding process: a Dynamic Bayesian Network
with hidden nodes that represents chords, keys and basslines/inversions, which we call
the Harmony Progression Analyser (HPA). We begin by formalising the mathematics of
the model and decoding process, before incrementally increasing the model complexity
from a simple Hidden Markov Model (HMM) to HPA, by adding hidden nodes and
transitions.
These models are evaluated in accordance with the MIREX evaluations and are
shown to attain state of the art performance on a set of 25 chord states representing
the 12 major chords, 12 minor chords, and a No Chord symbol for periods of silence,
7
1. INTRODUCTION
speaking or for other times when no chord can be assigned. We finish this chapter
by introducing a wider set of chord alphabets and discuss how one might deal with
evaluating ACE systems on such alphabets.
Chapter 5: Exploiting Additional Data
In previous chapters, we used a trusted set of ground truth chord annotations which
have been used numerous times in the annual MIREX evaluations. However, recently
a number of new annotations have been made public, offering a chance to retrain HPA
on a set of new labels. To this end, chapter 5 deals with training and testing on
these datasets to ascertain whether learning can be transferred between datasets, and
also investigates learning rates for HPA. We then move on to discuss how partially
labelled data may be used in either testing or training a machine learning based chord
estimation algorithm, where we introduce a new method for aligning chord sequences
to audio called jump alignment and additionally an evaluation scheme for estimating
the alignment quality.
Chapter 6: Conclusion
This final chapter summarises the main findings of the thesis and suggests areas where
future research might be advisable.
1.5
Relevant Publications
A selection of relevant publications is presented in this section. Although the author
has had publications outside the domain of automatic chord estimation, the papers
presented here are entirely in this domain and relevant to this thesis. These works
also tie in the main contributions of the thesis: journal paper 3 is an extension of the
literature review from chapter 2, journal paper 1 [81] forms the basis of chapters 3 and
8
1.5 Relevant Publications
4, whilst journal paper 2 [74] and conference paper 1 [73] form the basis of chapter 5.
Journal Papers
• Y. Ni, M. McVicar, R. Santos-Rodriguez. and T. De Bie. An end-to-end machine
learning system for harmonic analysis of music. IEEE Transactions on Audio,
Speech and Language Processing [81]
[81] is based on early work (not otherwise published) by the author on using keyinformation in chord recognition, which has guided the design of the structure the DBN
put forward in this paper. The structure of the DBN is also inspired by musicological
insights contributed by the thesis author. Early research by the author (not otherwise
published) on the use of the constant-Q transform for designing chroma features has
contributed to the design of the LBC feature introduced in this paper. All aspects of
the research were discussed in regular meetings involving all authors. The paper was
written predominantly by the first author, but all authors contributed original material.
• M. McVicar, Y. Ni, R. Santos-Rodriguez. and T. De Bie. Using Online Chord
Databases to Enhance Chord Recognition. Journal of New Music Research, Special
Issue on Music and Machine Learning [74]
The research into using alignment of untimed chord sequences for chord recognition was
initiated by Tijl De Bie and the thesis author. It first led to a workshop paper [72], and
[74] is an extension of this paper which includes also the Jump Alignment algorithm
which was developed by Yizhao Ni but discussed by all authors. The paper was written
collaboratively by all authors. The second author of [73] contributed insight and experiments which did not make it into the final version of the paper, with remainder being
composed and conducted by the first author. The paper was predominantly written by
the first author.
9
1. INTRODUCTION
• M. McVicar, Y. Ni, R. Santos-Rodriguez. and T. De Bie. Automatic Chord
Estimation from Audio: A Review of the State of the Art (submitted). IEEE
Transactions on Audio, Speech and Language Processing [75]
Finally, journal paper three was researched and written primarily by the first author,
with contributions from the third author concerning ACE software.
Conference Papers
1. M. McVicar, Y. Ni, R. Santos-Rodriguez and T. De Bie. Leveraging noisy online
databases for use in chord recognition. In Proceedings of the 12th International
Society for Music Information Retrieval (ISMIR), 2011 [73]
1.6
Conclusions
In this chapter, we discussed the motivation for our subject: automatic chord estimation. We also defined our main research objective: the development of a chord
recognition system based entirely on machine-learning techniques, which may take full
advantage of the newly released data sources that have become available. We went on
to list the main contributions to the field contained within this thesis, and how these
appear within the structure of the work. These contributions were also highlighted in
the main publications by the author.
10
1.6 Conclusions
Training audio
Chromagram
Extraction
(Chap 3)
Training
Chromagram
Test Audio
Fully
labelled
training data
Partially
labelled
training data
HPA training
(Chap. 5)
Chromagram
Extraction*
Testing
Chromagram
MLE parameters Partially
labelled test
data
HPA decoder
(Chap 4)
Predic-on Fully labelled test data Evaluation
scheme
Performance Figure 1.2: Graphical representation of the main processes in this thesis. Rectangles
indicate data sources, whereas rounded rectangles represent processes. Processes and data
with asterisks form the bases of certain chapters. Chromagram Extraction is the basis
for chapter 3, the main decoding process (HPA decoding) is covered in chapter 4, whilst
training is the basis of chapter 5.
11
1. INTRODUCTION
12
2
Background
This chapter is an introduction to the domain of automatic chord estimation. We begin
by describing chords and their function in musical theory in section 2.1. A chronological
account of the literature is given in section 2.2, which is discussed in detail in sections
2.3 - 2.6. We focus here on Feature extraction, Modelling strategies, Datasets and
Training, and finally Evaluation Techniques. Since their use is so ubiquitous in the field,
we devote section 2.7 to the Hidden Markov Model for automatic chord extraction. We
conclude the chapter in section 2.8.
2.1
Chords and their Musical Function
This section serves to introduce the theory behind our chosen subject: musical chords.
The definition and function of chords in musical theory is discussed, with particular
focus on Western Popular music, the genre on which our work will be conducted.
2.1.1
Defining Chords
Before discussing how chords are defined, we must first begin with the more fundamental
definitions of frequency and pitch. Musical instruments (including the voice) are able
13
2. BACKGROUND
to vibrate at a fixed number of oscillations per second, known as their fundamental
frequency, f0 measured in Hertz (Hz). Although frequencies higher (harmonics) and
lower (subharmonics) than f0 are produced simultaneously, we postpone the discussion
of this until section 2.3.
The word pitch, although colloquially similar to frequency, means something quite
different. Pitch is defined as the perceptual ordering of sounds on a frequency scale
[47]. Thus, pitch relates to how we are able to differentiate between lower and higher
fundamental frequencies. Pitch is approximately proportional to the logarithm of frequency, and in Western equal-temperament, the fundamental frequency f of a pitch is
defined as
f = fref 2n/12 ,
n = {. . . , −1, 0, 1, . . .},
(2.1)
where fref is a reference frequency, usually taken to be 440 Hz. The distance (interval)
between two adjacent pitches is known as a semitone, a tone being twice this distance.
Notice from Equation 2.1 that pitches 12 semitones apart have a frequency ratio of 2,
an interval known as an octave, which is a property captured in the notions of pitch
class and pitch height [112].
It has been noted that the human auditory system is able to distinguish pitch
classes, which refers to the value of n mod 12 in Equation 2.1, from pitch height,
n
which describes the value of b 12
c, (b·c represents the floor function) [101]. This means
that, for example, we hear two frequencies an octave apart as the same note. This
phenomenon is known as octave equivalence and has been exploited by researchers in
the design of chromagram features (see section 2.3).
Pitches are often described using modern musical notation to avoid the use of irrational frequency numbers. This is a combination of letters (pitch class) and numbers
(pitch height), where we define A4 = 440 Hz and higher pitches as coming from the
14
2.1 Chords and their Musical Function
pitch class set
PC = {C, C], D, D], E, F, F ], G, G] A, A], B}
(2.2)
until we reach B4, when we loop round to C5 (analogously for lower pitches). In
this discussion and throughout this thesis we will assume equivalence between sharps
and flats, i.e. G]4 = A[4. We now turn our attention to collections of pitches played
together, which is intuitively the notion of a chord.
The word chord has many potential characterisations and there is no universally
agreed upon definition. For example, Merriam-Webster’s dictionary of English usage
[76] claims:
Definition 1. Everyone agrees that chord is used for a group of musical tones,
whilst Krolyi [42] is more specific, stating:
Definition 2. Two or more notes sounding simultaneously are known as a chord.
Note here the concept of pitches being played simultaneously. Note also that it is
not specified that the notes come from one particular voice, so that a chord may be
played by a collection of instruments. Such music is known as Polyphonic (conversely
Monophonic). The Harvard Dictionary of music [93] defines a chord more strictly as a
collection of three or more notes:
Definition 3. Three or more pitches sounded simultaneously or functioning as if
sounded simultaneously.
Here the definition stretches to allow notes played in succession to be a chord - a concept
known as an arpeggio. In this thesis, we define a chord to be a collection of 3 or more
notes played simultaneously. Note however that there will be times when we will need
to be more flexible when dealing with, for instance, pre-made ground truth datasets
such as those by Harte et al. [36]. In cases when datasets such as these contradict our
definition we will map them to a suitable chord to our best knowledge. For instance,
15
2. BACKGROUND
the aforementioned dataset contains examples such as A:(1,3), meaning an A and C]
note played simultaneously, which we will map to a C:maj chord. We now turn our
attention to how chords function within the theory of musical harmony.
2.1.2
Musical Keys and Chord Construction
In popular music, chords are not chosen randomly as collections of pitch classes. Instead, a key is used to define a suitable library of pitch classes and chords. The most
canonical example of a collection of pitch classes is the major scale, which, given a root
(starting note) is defined as the set of intervals Tone-Tone-Semitone-Tone-Tone-ToneSemitone. For instance, the key of C Major contains the pitch classes
C Major = {C, D, E, F, G, A, B}.
(2.3)
For each of these pitch classes we may define a chord. By far the most common
chord types are triads, consisting of three notes. For instance, we may take a chord
root (a pitch class) and add to it a third (two notes up in the key) and a fifth (4 notes)
to create a triad. Doing this for the example case of C Major gives us the following
triads:
{[C, E, G], [D, F, A], [E, G, B], [F, A, C], [G, B, D], [A, C, E], [B, D, F ]}.
(2.4)
Inspecting the intervals in these chords, we see three classes emerge - one in which
we have four semitones followed by three (those with roots C, F, G), one where there are
three semitones followed by four (roots D, E, A) and finally three following three (root
B). These chord types are known as major,minor and diminished triads respectively.
Thus we may define the chords in C Major to be C:maj, D:min, E:min, F:maj, G:maj,
A:min, and B:dim, where we have adopted Chris Harte’s suggested chord notation [36].
There are many other possible chord types other than these, some of which will be
16
2.1 Chords and their Musical Function
considered in our model (see section 4.3).
We have presented the work here as chords being constructed from a key, although
one may conversely consider a collection of chords as defining a key. This thorny issue
was considered by Raphael [95] and a potential solution in modelling terms offered by
some authors [16, 57] by estimating the chords and keys simultaneously (see subsection
2.4 for more details on this strategy). Keys may also change throughout a piece, and
thus the associated chords in a piece may change (a process known as modulation).
This has been modelled by some authors, leading to an improvement in recognition
accuracy of chords [65].
2.1.3
Chord Voicings
On any instrument with a tonal range of over one octave, one has a choice as to which
order to play the notes in a given chord. For instance, C:maj = {C, E, G} can be
played as (C, E, G), (E, G, C) or (G, C, E). These are known as the root position, first
inversion and second inversion of a C Major chord respectively.
When constructing 12–dimensional chromagram vectors (see section 2.3), this poses
a problem: how are we to distinguish between inversions in recognition, or evaluation?
These issues will be dealt with in sections 2.4 and 2.6.
2.1.4
Chord Progressions
Chords are rarely considered in isolation and as such music composers generally collate
chords into a time series. A collection of chords played in sequence is known as a
chord progression, a typical example of which is shown in Figure 2.1, where we have
adopted Chris Harte’s suggested syntax for representing chords, where for the most
part chord symbols are represented as rootnote:chordtype/inversion, with some
shorthand notation for major chords (no chord type) and root inversion (no inversion)
[36].
17
2. BACKGROUND
0.000000 2.612267 N
2.612267 11.459070 E
11.459070 12.921927 A
12.921927 17.443474 E
17.443474 20.410362 B
20.410362 21.908049 E
21.908049 23.370907 E:7/3
23.370907 24.856984 A
...
Figure 2.1: Section of a typical chord annotation, showing onset time (first column),
offset time (second column), and chord label (third column).
Certain chord transitions are more common than others, a fact that has been exploited by authors of expert systems in order to produce more musically meaningful
chord predictions [4, 65].
This concludes our discussion of the musical theory of chords. We now turn our
attention to a thorough review of the literature of automatic chord estimation.
2.2
Literature Summary
A concise chronological review of the associated literature is shown in Tables 2.1 to 2.5.
The following sections deal in detail with the key advancements made by researchers
in the domain.
18
Fujishima, T.
1999
Su, B. et al.
2001
19
Sheh, A. and
Ellis, D.
Yoshioka, T. et al.
2003
2004
Pauws, S.
Raphael, C.
2002
Bartsch, M.A. and
Wakefield, G.H.
Nawab, S.H. et al.
Bello, J.P. et al.
2000
Wakefield, G.H.
Author(s)
Year
Automatic Chord Transcription with Concurrent
Recognition of Chord Symbols and Boundaries [118]
Musical Key Extraction from Audio [90]
Chord Segmentation and Recognition using EM-Trained
Hidden Markov Models [99]
Automatic Transcription of Piano Music [94]
Multi-timbre Chord Classification using Wavelet
Transform and Self-Organized Neural Networks [106]
Identification of Musical Chords using Constant-Q
spectra [79]
To Catch a Chorus: Using Chroma-based Representations
for Thumbnailing [3]
Techniques for Automatic Music Transcription [5]
Realtime Chord Recognition of Musical Sound: a System
Using Common Lisp Music [33]
Mathematical Representation of Joint Time-chroma
Distributions [112]
Title (Reference)
Simultaneous boundary/label
detection
Removal of background spectrum
and processing of harmonics
HMM for chord recognition,
Gaussian emission probabilities,
training from labelled data
HMM for melody extraction
Chroma features for audio
structural segmentation
Use of Wavelets, Self-OrganisingMap
Use of Constant-Q Spectrum
Use of autocorrelation function for
pitch tracking
PCP vector, template matching,
smoothing
Mathematical foundation of
chromagrams feature vectors
Key Contribution(s)
Table 2.1: Chronological summary of advances in automatic chord recognition from audio, years 1999-2004.
2.2 Literature Summary
2006
Bello, J.P. and Pickens, J.
2005
20
Harte, C. et al.
Lee, K.
G´
omez, E. and Herrera, P.
Harte, C. et al.
Burgoyne, J.A. and Saul, L.K.
Shenoy, A. and Wang, Y.
Cabral, G. et al.
Harte, C.A. and Sandler, M.
Author(s)
Year
A Robust Mid-Level Representation for
Harmonic Content in Music Signals [4]
Automatic Chord Identification using a
Quantised Chromagram [38]
Automatic X Traditional Descriptor Extraction:
the Case of Chord Recognition [15]
Key, Chord, and Rhythm Tracking of Popular
Music Recordings [100]
Learning Harmonic Relationships in Digital
Audio with Dirichlet-based Hidden Markov
Models [11]
Symbolic Representation of Musical chords:
A Proposed syntax for Text Annotations [36]
The Song Remains the Same: Identifying versions
Transposed by Key Versions of the Same Piece
using Tonal Descriptors [34]
Automatic Chord Recognition from Audio using
Enhanced Pitch Class Profile [54]
Detecting Harmonic Change in Musical Audio [37]
Title (Reference)
Removal of harmonics to match
PCP templates
Tonal centroid feature
Textual notation of chords,
Beatles dataset
Cover-song identification using
chroma vectors
Dirichlet emission probability
model
Beat-synchronous chroma,
expert parameter knowledge
36-bin chromagram tuning
algorithm
Use of Extractor Discovery
System
Expert key knowledge
Key Contribution(s)
Table 2.2: Chronological Summary of advances in automatic chord recognition from audio, years 2005-2006.
2. BACKGROUND
2008
Catteau, B. et al.
2007
21
Mauch, M. et al.
Lee, K.
Papadopoulos, H and
Peeters, G.
Varewyck, M. et al.
Sumi, K. et al.
Lee, K. and Slaney, M.
Papadopoulos, H. and
Peeters, G.
Zenz, V. and Rauber, A.
Burgoyne, J.A. et al.
Author(s)
Year
Automatic Chord Recognition based on Probabilistic
Integration of Chord Transition and bass Pitch
Estimation [107]
Simultaneous Estimation of Chord Progression and
Downbeats from an Audio File [88]
A Novel Chroma Representation of Polyphonic Music
Based on Multiple Pitch Tracking Techniques [111]
A System for Automatic Chord Transcription from Audio
Using Genre-Specific Hidden Markov Models [55]
A Discrete Mixture Model for Chord Labelling [63]
Large-Scale study of Chord Estimation Algorithms
Based on Chroma Representation and HMM [87]
Automatic Chord Detection Incorporating Beat
and Key Detection [119]
A Unified System for Chord Transcription and Key
Extraction using Hidden Markov Models [56]
A Probabilistic Framework for Tonal Key and Chord
Recognition [16]
A Cross-Validated Study of Modelling Strategies for
Automatic Chord Recognition in Audio [12]
Title (Reference)
Bass chromagram
Simultaneous beat/chord
estimation
Simultaneous background
spectra & harmonic removal
Genre-specific HMMs
Integration of bass pitch
information
Rigorous framework for joint
key/chord estimation
Cross-validation on Beatles
data, Conditional Random
Fields
Comparative study of expert
vs. trained systems
Combined key, beat and chord
model
Key-specific HMMs,
tonal centroid in key detection
Key Contribution(s)
Table 2.3: Chronological summary of advances in automatic chord recognition from audio, years 2007-2008.
2.2 Literature Summary
Scholz, R. et al.
2009
22
Khadkevich, M. and
Omologo, M.
Noland, K. and
Sandler, M.
Mauch, M. et al.
Reed, J.T. et al.
Weller, A. et al.
Weil, J. et al.
Cho, T. and
Bello, J.P.
Oudre, L. et al.
Author(s)
Year
Robust Modelling of Musical Chord Sequences using
Probabilistic N −Grams [98]
Real-time Implementation of HMM-based Chord
Estimation in Musical Audio [19]
Template-Based Chord Recognition: Influence of the
Chord Types [86]
Automatic Generation of Lead Sheets from Polyphonic
Music Signals [114]
Structured Prediction Models for Chord Transcription
of Music Audio [115]
Minimum Classification Error Training to Improve
Isolated Chord Recognition [96]
Using Musical Structure to Enhance Automatic Chord
Transcription [68]
Use of Hidden Markov Models and Factored Language
Models for Automatic Chord Recognition [45]
Influences of Signal Processing, Tone Profiles, and Chord
Progressions on a Model for Estimating the Musical Key
from Audio [83]
Title (Reference)
In-depth study on integrated chord
and key dependencies
Real-time chord recognition
system
Comparison of template distance
metrics and smoothing techniques
Polyphonic extraction of lead
sheets
SVMstruct, incorporating future
frame information
Harmonic and Percussive Source
Separation (HPSS)
Structural segmentation as an
additional information source
Factored language model
n−gram language model
Key Contribution(s)
Table 2.4: Chronological summary of advances in automatic chord recognition from audio, 2009.
2. BACKGROUND
2011
Mauch, M.
2010
23
Macrae, R. and
Dixon, S.
Cho, T. and
Bello, J.P.
Yoshii, K. and
Goto, M.
Jiang, N. et al.
Burgoyne, J.A. et al.
Mauch, M. et al.
Konz, V. et al.
Cho, T. et al.
Ueda, Y. et al.
Author(s)
Year
Billboard Hot 100 dataset of
chord annotations
Comparison of modern
chromagram types
Web-based chord labels
Recurrence plot for
smoothing
Infinity-gram language model
A Feature Smoothing Method for Chord Recognition
Using Recurrence Plots [20]
A Vocabulary-Free Infinity-Gram Model for
Non-parametric Bayesian Chord Progression Analysis [117]
HPSS with additional
post-processing
Comparison of pre and postfiltering techniques
and models
Visualisation of evaluation
techniques
Chord sequences in lyrics
alignment
DBN model, NNLS chroma
Key Contribution(s)
An Expert Ground Truth Set for Audio Chord
Recognition and Music Analysis [13]
Analysing Chroma Feature Types for Automated
Chord Recognition [40]
Guitar Tab Mining, Analysis and Ranking [60]
A Multi-perspective Evaluation Framework for Chord
Recognition [49]
Lyrics-to-audio Alignment and Phrase-level Segmentation
using Incomplete Internet-style Chord Annotations [69]
Automatic Chord Transcription from Audio using
Computational Models of Musical Context [62]
HMM-based approach for Automatic Chord Detection
using Refined Acoustic Features [109]
Exploring Common Variations in State of the Art Chord
Recognition Systems [21]
Title (Reference)
Table 2.5: Chronological summary of advances in automatic chord recognition from audio, years 2010-2011.
2.2 Literature Summary
2. BACKGROUND
2.3
Feature Extraction
The dominant feature used in automatic chord recognition is the chromagram. We
give a detailed account of the signal processing techniques associated with this feature
vector in this section.
2.3.1
Early Work
The first mention of chromagram feature vectors to our knowledge was by Shepard
[101], where it was noticed that two dimensions, (tone height and chroma) were useful
in explaining how the human auditory system functions. The word chroma is used
to describe pitch class, whereas tone height refers to the octave information. Early
methods of chord prediction were based on polyphonic note transcription [1, 17, 43, 61],
although it was Fujishima [33] who first considered automatic chord recognition as a
task unto itself. His Pitch Class Profile (PCP) feature involved taking a Discrete
Fourier Transform of a segment of the input audio, and from this calculating the power
evolution over a set of frequency bands. Frequencies which were close to each pitch
class (C, C ] , . . . , B) were then collected and collapsed to form a 12–dimensional PCP
vector for each time frame.
For a given input signal, the PCP at each time instance was then compared to
a series of chord templates using either nearest neighbour or weighted sum distance.
Audio input was monophonic piano music and an adventurous 27 chord types were used
as an alphabet. Results approached 94%, measured as the total number of correctly
identified frames divided by total number of frames.
The move from the heuristic PCP vectors to the mathematically-defined chromagram was first rigorously treated by Wakefield [112], who showed that a chromagram
is invariant to octave translation, suggested a method for its calculation and also noted
that chromagrams could be useful for visualisation purposes, demonstrated by an ex-
24
2.3 Feature Extraction
Figure 2.2: A typical chromagram feature matrix, shown here for the opening to Let It
Be (Lennon/McCartney). Salience of pitch class p at time t is estimated by the intensity of
(p, t)th entry of the chromagram, with lighter colours in this plot indicating higher energy
(see colour bar between chromagram and annotation). The reference (ground truth) chord
annotation is also shown above for comparison, where we have reduced the chords to major
and minor classes for simplicity.
ample of a solo female voice.
An alternative solution to the pitch tracking problem was proposed by Bello et
al. [5], who suggested using the autocorrelation of the signal to determine pitch class.
25
2. BACKGROUND
Audio used in this paper was a polyphonic, mono-timbral re-synthesis from a digital
score, and an attempt at full transcription of the original was attempted.
An investigation into polyphonic transcription was attempted by Su, B. and Jeng,
S.K. [106], where they suggested using wavelets as audio features, achieving impressive
results on a recording of the 4th movement of Beethoven’s 5th symphony.
2.3.2
Constant-Q Spectra
One of the drawbacks of a Fourier-transform analysis of a signal is that it uses a fixed
window resolution. This means that one must make a trade-off between the frequency
and time resolution. In practice this means that with short windows, one risks being
unable to detect frequencies with long wavelength, whilst with a long window, a poor
time resolution is obtained.
A solution to this is to use a frequency-dependent window length, an idea first
implemented for music in [10]. In terms of the chord recognition task, it was used
in [79], and has become very popular in recent years [4, 68, 118]. The mathematical
details of the constant-Q transform will be discussed in later sections.
2.3.3
Background Spectra and Consideration of Harmonics
Background
When considering a polyphonic musical excerpt, it is clear that not all of the signals
will be beneficial in the understanding of harmony. Some authors [90] have defined this
as the background spectrum, and attempted to remove it in order to enhance the clarity
of their features.
One such background spectrum could be considered the percussive elements of the
music, when working in harmony-related tasks. An attempt to remove this spectrum
was introduced in [84] and used to increase chord recognition performance in [96]. It
is assumed that the percussive elements of a spectrum (drums etc.) occupy a wide
26
2.3 Feature Extraction
frequency range but are narrow in the time domain, and harmony (melody, chords,
bassline) conversely. The spectrum is assumed to be a simple sum of percussive and
harmonic material and can be diffused into two constituent spectra, from which the
harmony can be used for chordal analysis.
This process is known as Harmonic Percussive Source Separation (HPSS) and is
shown in [96] and [109] to improve chord recognition significantly. The latter study
also showed that employing post-processing techniques on the chroma including Fourier
transform of chroma vector and increasing the number of states in the HMM by up to
3 offered improvements in recognition rates.
Harmonics
It is known that musical instruments emit not only pure tones f0 , but a series of harmonics at higher frequencies, and subharmonics at lower frequencies. Such harmonics
can easily confuse feature extraction techniques, and some authors have attempted to
remove them in the feature extraction process [54, 65, 87, 90]. An illustrative example
of (sub)harmonics is shown in Figure 2.3.
Figure 2.3: Constant-Q spectrum of a piano playing a single A4 note. Note that, as well
as the fundamental at f0 =A4, there are harmonics at one octave (A5) and one octave
plus a just perfect fifth (E5). Higher harmonics exist but are outside the frequency range
considered here. Notice also the slight presence of a fast-decaying subharmonic at two
octaves down, A2.
27
2. BACKGROUND
A method of removing the background spectra and harmonics simultaneously was
proposed in [111], based on multiple pitch tracking. They note that their new features matched chord profiles better than unprocessed chromagrams, a technique which
was also employed by [65]. An alternative to processing the spectrum is to introduce
harmonics into the modelling strategy, a concept we will discuss in section 2.4.
2.3.4
Tuning Compensation
In 2003, Sheh and Ellis [99] identified that some popular music tracks are not tuned to
standard pitch A4 = 440 Hz, meaning that for these songs, chromagram features may
misrepresent the salient pitch classes. To counteract this, they constructed finer-grained
chromagram feature vectors of 24, instead of 12, dimensions, allowing for flexibility in
the tuning of the piece. Harte [38] introduced a tuning algorithm which computed a
chromagram feature matrix over a finer granularity of 3 frequency bands per semitone,
and searched for the sub-band which contained the most energy. This was chosen as
the tuning of the piece and the actual saliences inferred by interpolation. This method
also used by Bello and Pickens [4] and in Harte’s own work [37].
2.3.5
Smoothing/Beat Synchronisation
It was noticed by Fujishima [33] that using instantaneous chroma features did not
provide musically meaningful predictions, owing to transients meaning predicted chords
were changing too frequently. As an initial solution, some smoothing of the PCP
vectors was introduced. This heuristic was repeated by other authors using templatebased chord recognition systems (see section 2.4), including [52]. In [4], the concept
of exploiting the fact that chords are relatively stable between beats [35] was used to
create beat-synchronous chromagrams, where the time resolution is reduced to that of
the main pulse. This method was shown to be superior in terms of recognition rate,
but also had the advantage that the overall computation cost is also reduced, owing to
28
2.3 Feature Extraction
the total number of frames typically being reduced. Examples of smoothing techniques
are shown in Figure 2.4.
(a) No Smoothing
(b) Median smoothing
(c) Beat-synchronisation
Figure 2.4: Smoothing techniques for chromagram features. In 2.4a, we see a standard
chromagram feature. Figure 2.4b shows a median filter over 20 frames, 2.4c shows a beatsynchronised chromagram.
Popular methods of smoothing chroma features are to take the mean [4] or median
[65] salience of each of the pitch classes between beats. In more recent work, [20]
recurrence plots were used within similar segments and shown to be superior to beat
synchronisation or mean/median filtering.
Papadopoulus and Peeters [88] noted that a simultaneous estimate of beats led to
an improvement in chords and vice-versa, supporting an argument that an integrated
model of harmony and rhythm may offer improved performance in both tasks. A
comparative study of post-processing techniques was conducted in [21], where they
also compared different pre-filtering and modelling techniques.
2.3.6
Tonal Centroid Vectors
An interesting departure from traditional chromagrams was presented in [37], notably
a transform of the chromagram known as the Tonal Centroid feature. This feature
is based on the idea that close harmonic relationships such as perfect fifths and ma-
29
2. BACKGROUND
jor/minor thirds have large Euclidean distance in a chromagram representation of pitch,
and that a feature which places these pitches closer together may offer superior performance. To this end, the authors suggest mapping the 12 pitch classes onto a 6–
dimensional hypertorus which corresponds closely to Chew’s spiral array model [18].
This feature vector has also been explored by different authors also for key recognition
[55, 56].
2.3.7
Integration of Bass Information
It was first discussed in [107] that considering low bass frequencies as distinct from midrange and higher frequency tones could be beneficial in the task of chord recognition.
Within this work they estimated bass pitches from audio and add a bass probability
into an existing hypothesis-search-based method [118] and discovered an increase in
recognition rate of, on average, 7.9 percentage points when including bass information.
Bass frequencies of 55 − 220 Hz were also considered in [63], although this time
by calculating a distinct bass chromagram over this frequency range. Such a bass
chromagram has the advantage of being able to identify inversions of chords, which
we will discuss in chapter 4. A typical bass chromagram is shown, along with the
corresponding treble chromagram, in Figure 2.5.
2.3.8
Non-Negative Least Squares Chroma (NNLS)
In an attempt to produce feature vectors which closely match chord templates, Mauch
[62] proposed the generation of Non-Negative Least Squares (NNLS) chromagrams,
where it is assumed that the frequency spectrum Y is represented by a linear combination of note profiles from a dictionary matrix E, multiplied by an activation vector
x ≥ 0, Y ∼ Ex.
Then, given a dictionary (a set of chord templates with induced harmonics whose
amplitudes decrease in an arithmetic series [64]), it is required to find x which min-
30
2.3 Feature Extraction
(a) Treble Chromagram
(b) Bass Chromagram
Figure 2.5: Treble (2.5a) and Bass (2.5b) Chromagrams, with the bass feature taken over
a frequency range of 55 − 207 Hz in an attempt to capture inversions.
imises ||Y − Ex||. This is known as a non-negative least squares problem [53] and can
be solved uniquely in the case when E has full rank and more rows than columns.
Within [64] NNLS chroma are shown to achieve an improvement of 6 percentage points
over the then state of the art system by the same authors. An example of an NNLS
chroma is shown in Figure 2.6, showing the low background spectrum level.
(a) Treble Chromagram
(b) NNLS Chromagram
Figure 2.6: Regular (a) and NNLS (b) chromagram feature vectors. Note that the NNLS
chromagram is a beat-synchronised feature.
31
2. BACKGROUND
A comparative study of modern chromagram types was also conducted in [40],
and later developed into a toolkit for research purposes [78]. We have seen many
techniques for chromagram computation in this Section. Some of these (constant-Q,
tuning, beat-synchronisation, bass chromagrams) will be used in the design of our
features (see Chapter 3, whilst others (Tonal centroid vectors) will not. The author
decided against using tonal centroid vectors as they are low-dimensional and therefore
suited to situations with less training data, and also less easily interpreted than a
chromagram representation.
2.4
Modelling Strategies
In this section, we review the next major problem in the domain of chord recognition:
assigning labels to chromagram (or related feature) frames. We begin with a discussion
of simple pattern-matching techniques.
2.4.1
Template Matching
Template matching involves comparing feature vectors against the known distribution
of notes in a chord. Typically, a 12–dimensional chromagram is compared to a binary
vector containing ones where a trial chord has notes present. For example, the template
for a C:major chord would be [1 0 0 0 1 0 0 1 0 0 0 0]. Each frame of the chromagram is
compared to a set of templates, and the template with minimal distance to the chroma
is output as the label for this frame (see Figure 2.7).
This technique was first proposed by Fujishima, where he used either the nearest
neighbour template or a weighted sum [33] as a distance metric between templates and
chroma frames. Similarly, this technique was used by Cabral and collaborators [15]
who compared it to the Extractor Discovery System (EDS) software to classify chords
in Bossa Nova songs.
32
2.4 Modelling Strategies
Figure 2.7: Template-based approach to the chord recognition task, showing chromagram
feature vectors, reference chord annotation and bit mask of chord templates.
An alternative approach to template matching was proposed in [106],where they
used a self-organising map, trained using expert knowledge. Although their system
perfectly recognised the input signal’s chord sequence, it is possible that the system
is overfitted as it was measured on just one song instance. A more recent example
of a template-based method is presented in [86], where they compared three distance
33
2. BACKGROUND
measures and two post-processing smoothing types and found that Kullback-Leibler
divergence [52] and median filtering offered an improvement over the then state of the
art. Further examples of template-based chord recognition systems can be found in
[85].
2.4.2
Hidden Markov Models
Individual pattern matching techniques such as template matching fail to model the
continuous nature of chord sequences. This can be combated by either using smoothing
methods as seen in 2.3 or by including duration in the underlying model. One of the
most common ways of incorporating smoothness in the model is to use a Hidden Markov
Model (HMM, defined formally in Section 2.7).
An HMM models a time-varying process where one witnesses a sequence of observed
variables coming from a corresponding sequence of hidden nodes, and can be used to
formalize a probability distribution jointly for the chromagram feature vectors and the
chord annotations of a song. In this model, the chords are modelled as a first-order
Markovian process. Furthermore, given a chord, the feature vector in the corresponding
time window is assumed to be independent of all other variables in the model. The
chords are commonly referred to as the hidden variables and the chromagram feature
vectors as the observed variables, as the chords are typically unknown and to be inferred
from the given chromagram feature vectors in the chord recognition task. See Figure
2.8 for a visual representation of an HMM.
Arrows in Figure 2.8 refer to the inherent conditional probabilities of the HMM
architecture. Horizontal arrows represent the probability of one chord following another
(the transition probabilities), vertical arrows the probability of a chord emitting a
particular chromagram (the emission probabilities). Learning these probabilities may
either be done using expert knowledge or using labelled training data.
Although HMMs are very common in the domain of speech recognition [92], we
34
2.4 Modelling Strategies
H1
H2
HT-1
HT
O1
O2
OT-1
OT
Figure 2.8: Visualisation of a first order Hidden Markov Model (HMM) of length T. Hidden states (chords) are shown as circular nodes, which emit observable states (rectangular
nodes, chroma frames).
found the first example of an HMM in the domain of transcription to be [61], where
the task was to transcribe piano notation directly from audio. In terms of chord recognition, the first example can be seen in the work by Sheh and Ellis [99], where HMMs
and the Expectation-Maximisation algorithm [77] are used to train a model for chord
boundary prediction and labelling. Although initial results were quite poor (maximum
recognition rate of 26.4%), this work inspired the subsequently dominant use of the
HMM architecture in the chord recognition task.
A real-time adaptation of the HMM architecture was proposed by Cho and Bello
[19], where they found that with a relatively small lag of 20 frames (less than 1 second),
performance is less than 1% worse than an HMM with access to the entire signal. The
idea of real-time analysis was also explored in [104], where they employ a simpler,
template-based approach.
2.4.3
Incorporating Key Information
Simultaneous estimation of chords and keys can be obtained by including an additional
hidden chain into an HMM architecture. An example of this can be seen in Figure 2.9.
The two-chain HMM clearly has many more conditional probabilities than the simpler
HMM, owing to the inclusion of a key chain. This is an issue for both expert systems
35
2. BACKGROUND
K1
K2
KT-1
KT
C1
C2
CT-1
CT
C1
C2
CT-1
CT
Figure 2.9: Two-chain HMM, here representing hidden nodes for Keys and Chords,
emitting Observed nodes. All possible hidden transitions are shown in this figure, although
these are rarely considered by researchers.
and train/test systems, since there may be insufficient knowledge or training data to
accurately estimate these distributions. As such, most authors disregard the diagonal
transitions in Figure 2.9 [65, 100? ].
2.4.4
Dynamic Bayesian Networks
A leap forward in modelling strategies came in 2010 with the introduction of Matthias
Mauch’s 2-Slice Dynamic Bayesian Network model (the two slices referring to the initial
distribution of states and the iterative slice) [62, 65], shown in Figure 2.10.
This complex model has hidden nodes representing metric position, musical key,
chord, and bass note, as well as observed treble and bass chromagrams. Dependencies
between chords and treble chromagrams are as in a standard HMM, but with additional
emissions from bass nodes to lower-range chromagrams, and interplay between metric
position, keys and chords. This model was shown to be extremely effective in the audio
chord estimation task in the MIREX evaluation, setting the cutting-edge performance
of 80.22% chord overlap ratio (see MIREX evaluations in Table 2.7).
36
2.4 Modelling Strategies
M1
M2
MT-1
MT
K1
K2
KT-1
KT
C1
C2
CT-1
CT
B1
B2
BT-1
BT
Cb1
Ct 1
Cb2
CbT-1
Ct 2
CtT-1
CbT
Ct T
Figure 2.10: Mathhias Mauch’s DBN. Hidden nodes Mi , Ki , Ci , Bi represent metric position, key, chord and bass annotations, whilst observed nodes Cit and Cib represent treble
and bass chromagrams.
2.4.5
Language Models
A language model for chord recognition was proposed by Scholz and collaborators [98],
based on earlier work [67, 110]. In particular, they suggest that the typical first-order
Markov assumption is insufficient to model music, and instead suggest the use of higherorder statistics such as n-gram models, for n > 2. They found that n−gram models offer
lower perplexities than HMMs (suggesting superior generalisation), but that results
were sensitive to the type of smoothing used, and that high memory complexity was
also an issue.
This idea was further expanded by the authors of [45], where an improvement of
around 2% was seen by using a factored language model, and further in [117] where
chord idioms similar to [67] are discovered as frequent n−grams, although here they
use an infinity-gram model where a specification of n is not required.
37
2. BACKGROUND
2.4.6
Discriminative Models
The authors of [12] suggest that the commonly-used Hidden Markov Model is not appropriate for use in the chord recognition task, preferring instead the use of a Conditional
Random Field (CRF), a type of discriminative model (as opposed to a generative model
such as an HMM).
During decoding, an HMM seeks to maximise the overall joint over the chords
and feature vectors P (X, Y). However, for a given song example the observation is
always fixed, so it may be more sensible to model the conditional P (Y|X), relaxing
the necessity for the components of the observations to be conditionally independent.
In this way, discriminative models attempt to achieve accurate input (chromagram) to
output (chord sequence) mappings.
An additional potential benefit to this modelling strategy is that one may address
the balance between, for example, the hidden and observation probabilities, or take into
account more than one frame (or indeed an entire chromagram) in labelling a particular
frame. This last approach was explored in [115], where the recently developed SVMstruct algorithm was used as opposed to CRF, in addition to incorporating information
about future chromagram frames to show an improvement over a standard HMM.
2.4.7
Genre-Specific Models
Lee [57] has suggested that training a single model on a wide range of genres may lead
to poor generalisation, an idea which was expanded on in [55], where they found that if
genre information was given (for a range of 6 genres), performance increased almost 10
percentage points. Also, they note that their method can be used to identify genre in
a probabilistic way, by simply testing all genre-specific models and choosing the model
with largest likelihood. Although their classes were very unbalanced, they correctly
identified 24/28 songs as rock (85.71%).
38
2.5 Model Training and Datasets
2.4.8
Emission Probabilities
When considering the probability of a chord emitting a feature vector in graphical
models such as [63, 74, 99] one must specify a probability distribution. A common
method for doing this is to use a 12–dimensional Gaussian distribution, i.e the probability of a chord c emitting a chromagram frame x is set as P (x|c) ∼ N(µ, Σ), with µ a
12–dimensional mean vector for each chord and Σ a covariance matrix for each chord.
One may then estimate µ and Σ from data or expert knowledge and infer the emission
probability for a (chord, chroma) pair.
This technique has been very widely used in the literature (see, for example [4,
40, 45, 99]). A slightly more sophisticated emission model is to consider a mixture of
Gaussians, instead of one per chord. This has been explored in, for example, [20, 96,
107].
A different emission model was proposed in [11], that of a Dirichlet model. Given
a chromagram with pitch classes p = {c1 , . . . , c12 }, each with probability {p1 , . . . , p12 }
P
and 12
i=1 pi = 1, pi > 0 ∀i, a dirichlet distribution with parameters u = {u1 , . . . , u12 }
is defined as
12
1 Y ui −1
P (x|c) =
pi
Nu
(2.5)
i=1
where Nu is a normalisation term. Thus, a Dirichlet distribution is a distribution over
numbers which sum to one, and a good candidate for a chromagram feature vector. This
emission model was implemented in the chord recognition task in [12], with encouraging
results.
2.5
Model Training and Datasets
As mentioned previously, graphical models such as HMMs, two-chain HMMs, and Dynamic Bayesian Networks require training in order to infer the parameters necessary
39
2. BACKGROUND
to predict the chords to an unlabelled song. Various ways of training such models are
discussed in this section, beginning with expert knowledge.
2.5.1
Expert Knowledge
In early chord recognition work, when training data was very scarce, an HMM was used
in chord recognition by the authors of [4], where initial parameter settings such as the
state transition probabilities, mean and covariance matrices were set heuristically by
hand, and then enhanced using the Expectation-Maximisation algorithm [92].
A large amount of knowledge was injected into Shenoy and Wang’s key/chord/rhythm
extraction algorithm [100]. For example, they set high weights to primary chords in
each key (tonic, dominant and subdominant), additionally specifying that if the first
three beats of a bar are a single chord, the last beat must also be this chord, and that
chords non-diatonic to the current key are not permissible. They notice that by making
a rough estimate of the chord sequence, they were able to extract the global key of a
piece (assuming no modulations) with high accuracy (28/30 song examples). Using this
key, chord estimation accuracy increased by an absolute 15.07%.
Expert tuning of key-chord dependencies was also explored in [16], following the
theory set out in Lerdahl [58]. A study of expert knowledge versus training was conducted in [87], where they compared expert setting of Gaussian emissions and transition
probabilities, and found that expert tuning with representation of harmonics performed
the best. However, they only used 110 songs in the evaluation, and it is possible that
with the additional data now available, a trained approach may be superior.
Mauch and Dixon [63] also define chord transitions by hand, in the previously mentioned work by defining an expert transition probability matrix which has a preference
for chords to remain stable.
40
2.5 Model Training and Datasets
2.5.2
Learning from Fully-labelled Datasets
An early dataset for a many-song corpus was presented in [99], containing 20 early
works by the pop group The Beatles. Within, chord labels were annotated by hand
and manually aligned to the audio, for use in a chord recognition task. This was expanded in work by Harte et al. [36], where they introduced a syntax for annotating
chords in flat text, which has since become standard practice, and also increased the
number of annotated songs by this group to 180.
A small set of 35 popular music songs was studied by Veronika Zenz and Andreas
Rauber [119], where they incorporated beat and key information into a heuristic method
for determining chord labels and boundaries. More recently, the Structural Analysis of
Large Amounts of Music Information (SALAMI) project [13, 102] announced a large
amount of partially-labelled chord sequences and structural segmentations, amongst
other meta data. A total of 869 songs appearing in the Billboard Hot 100 were annotated at the structure level in Chris Harte’s format.
We define sets above as Ground Truth datasets (collections of time-aligned chord
sequences curated by an expert in a format similar to Figure 2.1.) Given a set of such
songs, one may attempt to learn model parameters and probability distributions from
these data. For instance, one may collect chromagrams for all time instances when
a C:maj chord is played, and learn how such a chord ‘sounds’, given an appropriate
emission probability model. Similarly for hidden features, one may count transitions
between chords and learn common chord transitions (as well as typical chord durations).
This method has become extremely popular in recent years as the number of training
examples has increased (see, for example [20, 40, 117? ]).
41
2. BACKGROUND
2.5.3
Learning from Partially-labelled Datasets
In addition to our previously published work [72, 74], Macrae and Dixon have been
exploring readily-available chord labels from the internet [2, 59] for ranking, musical
education, and score following. Such annotations are noisy and potentially difficult to
use, but offer much in terms of volume of data available and are very widely used by
musicians. For example, it was found in [60] that the most popular tab websites have
over 2.5 million visitors, whilst sheet music and MIDI sites have under 500,000 and
20,000 visitors respectively.
A large number of examples of each song are available on such sites, which we refer
to as redundancies of tabs. For example, the authors of [60] found 24,746 redundancies
for songs by The Beatles, or an average of 137.5 tabs per song, whilst in [72] it was
found that there were tabs for over 75,000 unique songs. The possibility of using such
data to train a chord recognition model will be investigated in chapter 5.
2.6
Evaluation Strategies
Given the output of a chord recognition system and a known and trusted ground truth,
methods of performance evaluation are required to compare algorithms and define the
state of the art. We discuss strategies for this in the current section.
2.6.1
Relative Correct Overlap
Fujishima [33] first introduced the concept of the ‘relative correct overlap’ measure for
evaluating chord recognition performance, defined as
RCO =
| correctly identified frames |
(×100%)
| total frames |
42
(2.6)
2.6 Evaluation Strategies
When dealing with a collection of more than one song, one may either average the performances over each song, or concatenate all frames together and measure performance
on this collection (mirco vs. macro average). The former treats each song equally
independent of song length, whilst the latter gives more weight to longer songs.
Mathematically, suppose we have a ground truth and prediction for i = {1, . . . , n, . . . N }
songs, denoted by G = {G1 , . . . , Gn , . . . , GN } and P = {P 1 , . . . , P n , . . . , P N }. Suppose
also that the nth ground truth and prediction each have ni frames. Then, given a
distance d(c1 , c2 ) between two chords we may define
ARCO =
ni
N
1 X 1 X
d(Gif , Pfi )
N
ni
i=1
(2.7)
f =1
as the Average Relative Correct Overlap, and
TRCO =
N
X
!−1
ni
i=1
ni
N X
X
d(Gif , Pfi )
(2.8)
i=1 f =1
as the Total Relative Correct Overlap. The most common distance measure is to filter
all chords in the ground truth and prediction according to a pre-defined alphabet and
sample per predicted beat, and set d(Gif , Pfi ) = 1 ⇐⇒ (Gif = Pfi ).
2.6.2
Chord Detail
An issue in the task of chord recognition is the level of detail on which to model and
evaluate. Clearly, there are many permissible chords available in music, and we cannot
hope to correctly classify them all.
Considering chords which do not exceed 1 octave, there are 12 pitch classes which
may or may not be present, leaving us with 212 possible chords. Such a chord alphabet
is clearly prohibitive for modelling (owing to the computational complexity) and also
poses issues in terms of evaluation. For these reasons, researchers in the field have
43
2. BACKGROUND
reduced their reference chord annotations to a subset of workable alphabet.
In early work, Fujishima considered 27 chord types, including advanced examples
such as A:(1,3,]5,7)/G. A step forward to a more workable alphabet came in 2003,
where Sheh and Ellis [99] considered 7 chord types (maj,min,maj7,min7,dom7,aug,dim),
although other authors have explored using just the 4 main triads maj,min,aug and
dim [12, 118]. Suspended chords were identified in [63, 107], the latter study additionally containing a ‘no chord’ symbol for silence, speaking or other times when no chord
can be assigned. A large chord alphabet of 10 chord types including inversions were
recognised by Mauch [65]. However, by far the most common chord alphabet is the set
of major and minor chords in addition to a ‘no chord’ symbol, which we collectively
denote as minmaj [54, 87].
2.6.3
Cross-validation Schemes
For systems which rely on training to learn model parameters, it is worth noting that
choosing ‘fair’ splits from fully-labelled sets is non-trivial. One notable effect is that
musical content can be quite different between albums, for a given artist. This is known
as the Album Effect and is a known issue in artist identification [46, 116]. In this case
it is shown that identification of artists is more challenging when the test set consists
of songs from an album not in the training set.
For ACE, the problem is less well-studied, although, although intuitively the same
property should hold. However, informal experiments by the author revealed that
training on a fixed percentage of each album and testing on the remainder resulted in
lower test set performance. Despite this, the MIREX evaluations are conducted in this
manner, which we emulate to make results comparable.
44
2.6 Evaluation Strategies
2.6.4
The Music Information Retrieval Evaluation eXchange (MIREX)
Since 2008, Audio Chord Estimation algorithms have been compared in an annual evaluation held in conjunction with the International Society for Music Information Retrieval
conference1 . Authors submit algorithms which are tested on a (known) dataset of audio and ground truths and results compared. We present a summary of the algorithms
submitted in Tables 2.6 - 2.7.
1
http://www.music-ir.org/mirex/wiki/MIREX_HOME
45
2009
Train/Test
2008
46
Pretrained
Train/Test
Expert
Pretrained
Train/Test
Pretrained
Train/test
Pretrained
Train/test
Train/test
Pretrained
Category
Year
WEJ4
WEJ2
WEJ3
MD
OGF2
KO2
OGF1
WEJ1
RUSUSL
KO1
DE
PVM1
PVM2
CH
UMS
DE
WD2
BP
MM
RK
PP
KO
WD1
KL2
KL
KL1
ZL
Sub.
A. Weller et al.
A. Weller et al.
A. Weller et al.
M. Mauch et al.
L. Oudre et al.
M. Khadkevich & M. Omologo
L. Oudre et al.
A. Weller et al.
J.T.Reed et al.
M. Khadkevich & M. Omologo
D. Ellis
J. Pauwels et al.
J. Pauwels et al.
C. Harte
Y. Uchiyama et al.
D. Ellis
J. Weil
J. P. Bello, J. Pickens
M. Mehnert
M. Ryynnen, A. Klapuri
H.Papadopoulos, G. Peeters
M. Khadkevich, M. Omologo
J. Weil
K.Lee
K. Lee
K.Lee
X. Jhang, C. Lash
Author(s)
Chroma, SVMstruct+
Chroma, SVMstruct
Chroma, Max-γ
Bass/Treble Chroma, DBN
Chroma, Template
Chroma, HMM
Chroma, Template
Chroma, HMM
Chroma, HMM
Chroma, HMM
Chroma, HMM
Chroma, Key-HMM
Chroma, Template
Chroma + Centroid, Template
Chroma, HMM
Chroma, HMM
Tonal Centroid, HMM
Chroma, HMM
Circular Pitch Space, HMM
Bass/Treble Chroma, HMM
Chroma, HMM
Chroma, HMM
Tonal Centroid, HMM
Chroma, HMM
Approach
0.742
0.723
0.723
0.712
0.711
0.708
0.706
0.704
0.701
0.697
0.697
0.682
0.654
0.654
0.72
0.66
0.66
0.66
0.65
0.64
0.63
0.62
0.60
0.59
0.58
0.56
0.36
Unmerged
0.777
0.762
0.760
0.748
0.777
0.741
0.770
0.743
0.760
0.734
0.731
0.710
0.698
0.698
0.77
0.70
0.70
0.69
0.68
0.69
0.66
0.65
0.66
0.65
0.65
0.60
0.46
Merged
Performance
Table 2.6: MIREX Systems from 2008-2009, sorted in each year by Total Relative Correct Overlap in the merged evaluation
(confusing parallel major/minor chords not considered an error). The best-performing pretrained/expert systems are underlined,
best train/test systems are in boldface. Systems where no data is available are shown by a dash (-).
2. BACKGROUND
2011
Expert
2010
47
Expert
Expert
Train/Test
Train/Test
Pretrained
Pretrained
Train/Test
Expert
Hybrid
Train/Test
Pretrained
Train/Test
Category
Year
NMSD2
KO1
NMSD3
NM1
CB2
CB3
KO2
CB1
NMSD1
UUOS1
PVM1
RHRC1
MD1
MM1
CWB1
KO1
EW4
EW3
UUOS1
OFG1
MK1
EW1
PVM1
EW2
PP1
Sub.
Y. Ni et al.
M. Khadkevich, M. Omologo
Y. Ni et al.
Y. Ni et al.
T. Cho, J. P. Bello
T. Cho, J. P. Bello
M. Khadkevich, M. Omologo
T. Cho, J. P. Bello
Y. Ni et al.
Y. Ueda et al.
J. Pauwels et al.
T. Rocher et al.
M. Mauch and S. Dixon
M. Mauch
T. Cho et al.
M. Khadkevich, M. Omologo
D. Ellis and A. Weller
D. Ellis and A. Weller
Y. Ueda et al.
L. Oudre et al.
M. Khadkevich, M. Omologo
D. Ellis and A. Weller
J. Pauwels et al.
D. Ellis and A. Weller
H. Papadopoulos, G. Peeters
Author(s)
Memorization of Ground Truth
Chroma, HMM
Bass/Treble Chroma, DBN
Bass/Treble Chroma, DBN
Chroma, HMM
Chroma, HMM
Chroma, HMM
Chroma, HMM
Bass/Treble Chroma, DBN
Chroma, Language Model
Chroma, Key-HMM + Templates
Bass/Treble Chroma, DBN
Bass/Treble Chroma, HMM
Bass/Treble Chroma, Language Model
Chroma, SVMstruct
Chroma, SVMstruct
Chroma, Key-HMM
Chroma, Template
Chroma, HMM
Chroma, SVMstruct
Chroma, SVMstruct
Chroma, Joint downbeat/chord estimate
Approach
0.9760
0.8285
0.8277
0.8199
0.8137
0.8091
0.7977
0.7955
0.7938
0.7689
0.7396
0.7289
0.8022
0.7963
0.7937
0.7887
0.7802
0.7718
0.7688
0.7551
0.7511
0.7476
0.7366
0.7296
0.5863
TRCO
0.9736
0.8163
0.8197
0.8114
0.8000
0.7957
0.7822
0.7786
0.7829
0.7564
0.7296
0.7151
0.7945
0.7855
0.7843
0.7761
0.7691
0.7587
0.7567
0.7404
0.7363
0.7337
0.7270
0.7158
0.5729
ARCO
Performance
Table 2.7: MIREX Systems from 2010-2011, sorted in each year by Total Relative Correct Overlap. The best-performing pretrained/expert systems are underlined, best train/test systems are in boldface. For 2011, systems which obtained less than 0.35
TRCO are omitted.
2.6 Evaluation Strategies
2. BACKGROUND
MIREX 2008
Ground truth data for the first MIREX evaluation was provided by Harte [36] and
consisted of 176 songs from The Beatles’ back catalogue. Approximately 2/3 of each
of the 12 studio albums in the dataset was used for training and the remaining 1/3
for testing. Chord detail considered was either the set of major and minor chords, or a
‘merged’ set, where parallel major/minor chords in the predictions and ground truths
were considered equal (i.e. classifying a C:maj chord as C:min was not considered an
error).
Bello and Pickens achieved 0.69 overlap and 0.69 merged scores using a simple
chroma and HMM approach, with Ryyn¨onen and Klapuri achieving a similar merged
performance using a combination of bass and treble chromagrams. Interestingly, Uchiyama
et. al. obtained higher scores under the train/test scenario (0.72/0.77 for overlap/merged).
Given that the training and test data were known in this evaluation, the fact that the
train/test scores are higher suggests that the pretrained systems did not make sufficient
use of the available data in calibrating their models.
MIREX 2009
In 2009, the same evaluations were used, although the dataset increased to include 37
songs by Queen and Zweieck. 7 songs whose average performance across all algorithms
was less than 0.25 were removed, leaving a total of 210. Train/test scenarios were also
evaluated, under the same major/minor or merged chord details.
This year, the top performing algorithm in terms of both evaluations was Weller
et al.’s system, where they used chroma features and a structured output predictor
which accounted for interactions between neighbouring frames. Pretrained and expert
systems again failed to match the performances of train/test systems, although the
OGF2 submission matched WEJ4 on the merged class. The introduction of Mauch’s
Dynamic Bayesian Network (submission MD) shows the first use of a complex graphical
48
2.6 Evaluation Strategies
model for decoding, and attains the best score for a pretrained system, 0.712 overlap.
MIREX 2010
Moving to the evaluation of 2010, the evaluation database stabilised to a set of 217
tracks consisting of 179 tracks by The Beatles (‘Revolution 9’, Lennon/McCartney,
was removed as it was deemed to have no harmonic content), 20 songs by Queen and
18 by Zweieck. This dataset shall henceforth be referred to as the MIREX dataset.
Evaluation in this year was performed using major and minor triads with either the
Total Relative Correct Overlap (TRCO) or Average Relative Correct Overlap (ARCO)
summary.
This year saw the first example of a pretrained system which became the state of
the art performance - Mauch’s MD1 system performed top in terms of both TRCO and
ARCO, beating all other systems by use of an advanced Dynamic Bayesian Network
and NNLS chroma. Interestingly, some train/test systems performed close to MD1
(Cho et al, CWB1).
MIREX 2011
The issue of overfitting the MIREX dataset (given that the test set was known) was
addressed by ourselves in our NMSD2 submission in 2011, where we exploited the fact
that the ground truth of all songs is known. Given this knowledge, the optimal strategy
is to simply find a map between the audio of the signal to the ground truth dataset.
This can be obtained by, for example, audio fingerprinting [113], although we took a
simpler approach of making a simple chord estimate and choosing the ground truth
which most closely matched this estimate. We did not achieve 100% because the CDs
we used to train our model did not exactly match those used to create the ground truth.
This year, the expected trend of pretrained systems outperforming their train/test
counterparts continued, with system KO1 obtaining a cutting-edge performance of
49
2. BACKGROUND
0.8285 TRCO, compared to the train/test CB3, which reached 0.8091.
2.7
The HMM for Chord Recognition
The use of Hidden Markov Models in the task of automatic chord estimation is so
common that we dedicate the current section to a discussion of how ACE may be
modelled as an HMM decoding process. Suppose we have a collection of N songs and
have calculated a chromagram X for each of them. Let
n
X = {Xn |Xn ∈ R12×T }N
n=1
(2.9)
be the chromagram collection, with T n indicating the length of the nth song (in frames).
We will denote the collection of corresponding annotations as
n
Y = {yn |yn ∈ AT }N
n=1 ,
(2.10)
where A is a chord alphabet. HMMs can be used to formalize a probability distribution
P (y, X|Θ) jointly for the chromagram feature vectors X and the annotations y of a
song, where Θ are the parameters of this distribution.
In this model, the chords y = [y1 , . . . , yt ] are modelled as a first-order Markovian
process, meaning that future chords are independent of the past given the present
chord. Furthermore, given a chord, the 12-dimensional chromagram feature vector in
the corresponding time window is assumed to be independent of all other variables in
the model. The chords are commonly referred to as the hidden variables and the chromagram feature vectors as the observed variables, as the chords are typically unknown
and to be inferred from the given chromagram feature vectors in the chord recognition
task.
Mathematically, the Markov and conditional independence assumption allows the
50
2.7 The HMM for Chord Recognition
factorisation of the joint probability of the feature vectors and chords (X, y) of a song
as follows:
P (X, y|Θ) = Pini (y1 |Θ) · Pobs (x1 |y1 , Θ) ·
|y|
Y
Ptr (yt |yt−1 , Θ) · Pobs (xt |yt , Θ).
(2.11)
t=2
Here, Pini (y1 |Θ) is the probability that the first chord is equal to y1 (the initial distribution), Ptr (yt |yt−1 , Θ) is the probability that a chord yt−1 is followed by chord yt in
the subsequent frame (the transition probabilities), and Pobs (xt |yt , Θ) is the probability density for chromagram vector xt given that the chord of the tth frame is yt (the
emission probabilities).
It is common to assume that the HMM is stationary, which means that Ptr (yt |yt−1 , Θ)
and Pobs (xt |yt , Θ) are independent of t. Furthermore, it is common to model the emission probabilities as a 12–dimensional Gaussian distribution, meaning that the parameter set Θ of an HMM for chord recognition are commonly given by
|A|
|A|
Θ = {T, pini , {µi }i=1 , {Σi }i=1 },
(2.12)
where we have gathered the parameters into matrix form: T ∈ R|A|×|A| are the
transition probabilities, pini ∈ R|A| is the initial distribution, and µ ∈ R12×|A| , and
Σ ∈ R12×12×|A| are mean vectors and covariance matrices for a multinomial Gaussian
distribution respectively.
We now turn attention to learning the parameters of this model. In the machine
learning setting, Θ can be estimated as Θ∗ on a set of labelled training data {X, Y},
using Maximum Likelihood Estimation. Mathematically,
Θ∗ = arg max P (X, Y|Θ),
Θ
where P (X, Y|Θ) =
N
Q
(2.13)
P (Xn , yn |Θ). The maximum likelihood solutions for the param-
n=1
51
2. BACKGROUND
n
n
n
i
eter set Θ given a fully-labelled training set {Xn , Yn }N
n=1 with X = [x1 , . . . , xTn ], Y =
[y1n , . . . , yTnn ] are as follows.
The initial distribution is found by simply counting occurrences of the first chord
over the training set:
(
p∗ini
=
N
1 X N
I(y1 = Aa )
|A|
)|A|
n=1
,
(2.14)
a=1
whilst the transition probabilities are calculated by counting transitions between chords:
(
T∗ =
N
)|A|
T
n
1 XX
I(ytn = Aa
|A|
&
i
yt−1
= Ab )
n=1 t=2
.
(2.15)
a,b=1
Emission probabilities are calculated by the known maximum likelihood solutions for
the normal distribution. For the mean vectors,
|A|
µ∗ = {mean of all chromagram frames for which Y = a}a=1 .
(2.16)
whilst for the covariance matrices:
|A|
Σ∗ = {covariance of all chromagram frames for which Y = a}a=1 .
(2.17)
Finally, given the HMM with parameters Θ∗ = {p∗ini , T∗ , µ∗ , Σ∗ }, the chord recognition task can be formalized as the computation of the chord sequence y∗ that maximizes
the joint probability with the chromagram feature vectors X of the given song
y∗ = arg max P (X, y|Θ∗ )
y
(2.18)
It is well known that this task can be solved efficiently using the Viterbi algorithm [92].
We show example parameters (trained on the ground truths from the 2011 MIREX
dataset) in Figure 2.11. Inspection of these features reveals that musically meaningful
52
2.8 Conclusion
parameters can be learned from the data, without need for expert knowledge. Notice,
for example, how the initial distribution is strongly peaked to starting on ‘no chord’, as
expected (most songs begin with no chord). Furthermore, we see strong self-transitions
in line with our expectation that chords are constant over several beats. Mean vectors
bear close resemblance to the pitches present within each chord and the covariance
matrix is almost diagonal, meaning there is little covariance between notes in chords.
2.8
Conclusion
In this chapter, we have discussed the foundations and definitions of chords, both in
the settings of musical theory and signal processing. We saw that there is no welldefined notion of a musical chord, but that it is generally agreed to be a collection of
simultaneous notes or arpeggio. We also saw how chords can be used to define the key
of a piece, or vice-versa. Incorporating these two musical facets has been fruitful in the
task of automatic chord recognition.
Following this, we conducted a study of the literature concerning chord recognition from audio, concentrating on feature extraction, modelling, evaluation, and model
training/datasets. Upon investigating the annual benchmarking system MIREX, we
found that that the dominant architectures are chromagram features with HMM decoding, although more complex features and modelling strategies have also been employed. We also saw that, since the testing data are known to participants, the optimal
strategy is to overfit the test data as much as possible, meaning that these results may
be misleading as a definition of the state of the art.
53
2. BACKGROUND
Figure 2.11: HMM parameters, trained using Maximum likelihood on the MIREX
dataset. Above, left: logarithm of initial distribution p∗ini . Above, right: logarithm
transition probabilities T∗ . Below, left: mean vectors for each chord µ∗ . Below, right:
covariance matrix Σ∗ for a C:maj chord. To preserve clarity, parallel minors for each chord
and accidentals follow to the right and below.
54
3
Chromagram Extraction
This section details our feature extraction process. By far the most prevalent features
used in ACE are known as chromagrams (see chapter 2). Our features are strongly
related to these but are rooted in a sound theoretical foundation, based on the human
perception of loudness of sound.
This chapter is arranged as follows. Section 3.1 informs our approach to forming
loudness-based chromagrams. Sections 3.2 to 3.9 deal with the details of our feature
extraction process, and in section 3.10, we conduct experiments to show the predictive
power of these features using our baseline recognition method. We conclude in section
3.11.
3.1
Motivation
We seek to compute features that are useful in recognising chords, but firmly rooted
in a sound theoretical basis. The human auditory system is complex, involving the
inner, middle and outer ears, hair cells, and the brain. However, evidence exists that
shows that humans are more sensitive to changes frequency magnitude, rather than
temporal representations [24]. One way of doing this computationally is to take a
55
3. CHROMAGRAM EXTRACTION
Fourier transform of the signal, which converts an audio sound x from the time domain
to the frequency domain, the result of which is a spectrogram matrix X.
In previous studies, the salience of musical frequencies was represented by the power
spectrum of the signal, i.e., given a spectrogram X, ||Xf,t ||2 was used to represent the
power of the frequency f of the signal at time t. However, there is no theoretical basis
for using the power spectrum as opposed to the amplitude, for example, where we
would use ||Xf,t ||.
This confusion is confounded by the fact that amplitudes are not additive in the
frequency domain, meaning that for spectrograms X, Y, ||Xf,t ||+||Yf,t || =
6 ||Xf,t +Yf,t ||.
This becomes an issue when summing over frequencies representing the same pitch class
(see section 3.7). Instead of using a loosely-defined notion of energy in this sense, we
introduce the concept of loudness-based chromagrams in the following sections. The
main feature extraction processes are shown in Figure 3.1.
Preprocessing
(3.2)
HPSS
(3.3)
Tuning
(3.4)
Constant-Q
(3.5)
Normalisation
(3.9)
Beat
Identification
(3.8)
A-weighting/
Octave
Summation
(3.7)
SPL
Calculation
(3.6)
Figure 3.1: Flowchart of feature extraction processes in this chapter. We begin with
raw audio, and finish with a chromagram feature matrix. Sections of this chapter which
describe each process are shown in the corresponding boxes in this Figure.
3.1.1
The Definition of Loudness
The loudness of a tone is an extremely complex quantity that depends on frequency,
amplitude and duration of tone, medium temperature, direction, and number of receivers; and can vary from person to person [30]. Loudness is typically measured in the
56
3.1 Motivation
unit of the Sone, whilst loudness level (loudness with respect to a reference) is measured
in Phons.
In this thesis, we note that perception of loudness is not linearly proportional to
the power or amplitude spectrum: and as a result existing chromagrams typically do
not accurately reflect human perception of the audio’s spectral content. Indeed, the
empirical study in [29] showed that loudness is approximately linearly proportional to
the so-called Sound Pressure Level (SPL), proportional to log10 of the normalised power
spectrum.
A further complication is that human perception of loudness does not have a flat
spectral sensitivity, as shown in the Equal-Loudness Contours in Figure 3.2. These
Figure 3.2: Equal loudness curves. Frequency in Hz increases logarithmically across the
horizontal axis, with Sound Pressure Level (dB SPL) on the vertical axis. Each line shows
the current standards as defined in the ISO standard (226:2003 revision [39]) at various
loudness levels. Loudness levels shown are at (top to bottom) 90, 70, 50, 30, 10 Phon, with
the limit of human hearing (0 Phon) shown in blue.
57
3. CHROMAGRAM EXTRACTION
curves come from experimental scenarios where subjects were played a range of tones
and asked how loud they perceived each to be. These curves may be interpreted in
the following way: each curve represents, at a given frequency, the SPL required to
perceive loudness equal to a reference tone at 1, 000 Hz. Note that less amplification
to reach the reference is required in the frequency range 1 − 5 kHz, which supports the
fact that human hearing is most sensitive in this range.
As a solution to this variation in sensitivity, a number of weighting schemes have
been suggested as industrial standard corrections. The most common of these is Aweighting [103], which we adopt in our feature extraction process. The formulae for
calculating the weights are given in subsection 3.7.
3.2
Preprocessing Steps
Before being passed on to the feature calculation stages of our algorithm, we first
collapse all audio to 1 channel by taking the mean over all channels and downsampling
to 11, 025 samples per second using the MATLAB resample command (which utilises a
polyphase filter). This downsampling is used to reduce computation time in the feature
extraction process.
3.3
Harmonic/Percussive Source Separation
It has been suggested by previous research that separating the harmonic components
of the signal from the percussive sounds could lead to improvements in melodic extraction tasks, including chord recognition [84]. The intuition behind this concept is that
percussive sounds do not contribute to the tonal qualities of the piece, and in this sense
can be considered noise.
Under this assumption, we will employ Harmonic and Percussive Sound Separation (HPSS) to extract the harmonic content of x as xh . We follow the method from
58
3.3 Harmonic/Percussive Source Separation
[84], where it is assumed that in a spectrogram, the harmonic component will have
low temporal variation but high spectral variation, with the converse true for percussive components. Given a spectrogram W, the harmonic/percussive components
H = Ht,f , P = Pt,f are found by minimizing
J(H, P) =
1 X
(Ht,f −1 − Ht,f −1 )2 +
2
2σH
t,f
subject to: Ht,f + Pt,f
Ht,f ≥ 0,
1 X
(Pt−1,f − Pt−1,f )2
2σP2
t,f
= Wt,f ,
Pt,f ≥ 0
The optimization scheme to solve this problem can be found in [84]. The HPSS algorithm has a total of 5 parameters, which were set as suggested in [84]:
• STFT window length. Window length for computation of spectrogram - 1024
samples
• STFT hop length. Hop length for computation of spectrogram - 512 samples
• α Balance between horizontal and vertical components - 0.3
• γ Range compression parameter - 0.3
• kmax Number of iterations of the HPSS algorithm - 50
To illustrate the concept behind HPSS, we show a typical spectrogram decomposition in Figure 3.3. Notice that the the harmonic component contains a more stable
horizontal component, whilst in the percussive component, more of the vertical components remain. Audio inspection of the resulting waveforms confirmed that the HPSS
technique had in fact captured much of the harmonic component in one waveform,
whilst removing the percussion.
59
60
(b) Harmonic components of signal
(c) Percussive components of signal
Figure 3.3: Illustration of Harmonic Percussive Source Separation algorithm. Three spectra are shown. In Figure 3.3a, we show
the spectrogram of a 30 second segment of ‘Hey Jude’ (Lennon-McCartney). Figures 3.3b and 3.3c show the resulting harmonic and
percussive spectrograms after performing HPSS, respectively.
(a) Entire spectrum
3. CHROMAGRAM EXTRACTION
3.4 Tuning Considerations
After computing the spectra of the harmonic and percussive elements, we can invert
the transforms to obtain the decomposition x = xh + xp . Discarding the percussive
component of the audio, we now work solely with the harmonic component.
3.4
Tuning Considerations
Before computing our Loudness Based Chromagrams, we must consider the possibility
that the target waveform is not tuned in standard pitch. Most modern recordings are
tuned with A4 = 440 Hz under the twelve-tone equal tempered scale [14]. Deviating
from this assumption could lead to note frequencies estimated incorrectly, meaning that
the chromagram bins are incorrectly estimated which could degrade performance.
Our tuning method follows that of [26], where an initial histogram is calculated of
all frequencies found, relative to standard pitch. The “correct” tuning is then found by
taking the bin with the largest number of entries. The centre frequencies of the spectrum can then be augmented according to this information. We provide an illustrative
example of the tuning algorithm in Figure 3.4.
3.5
Constant Q Calculation
Having pre-processed our waveform, we are ready to compute a spectral representation.
The most natural choice of transform to the frequency domain may be the Fourier
transform [9]. However, this transform has a fixed window size, meaning that if too
small a window is used, some low frequencies may be missed (as they will have a period
larger than the window). Conversely, if the window size used is too large, a poor time
resolution will be obtained.
A balance between time and frequency resolution can be found by having frequencydependent window sizes, a concept that can be implemented via a Constant-Q spectrum. The Q here relates to the ratio of successive window sizes, as explained in the
61
3. CHROMAGRAM EXTRACTION
6000
Note Frequency
5000
4000
3000
2000
1000
0
−50
−40
−30 −20 −10
0
10
20
30
Estimated Tuning discrepancy (cents)
40
50
Figure 3.4: Illustration of our tuning method, taken from [26]. This histogram shows
the tuning discrepancies found over the song “Hey Jude” (Lennon/McCartney), which
are binned into 5 cent bins. The estimated tuning is then found by choosing the most
populated tuning.
following.
Let F be the set of frequencies on the equal-tempered scale (possibly tuned to a
particular song, see subsection 3.4) over a given range. Then a typical chromagram
extraction approach first computes the energy (or amplitude) X ∈ R|F |×n for all frequencies f ∈ F at all time frame indices t ∈ {1, . . . , n}. Then Xf,t reflects the salience
at frequency f and frame t. Mathematically,
Xf,t
Lf −1
−j2πQm
1 X
=
x
.wm,f .e Lf
Lf
hdst − 2 e+m
Lf
(3.1)
m=0
is a constant Q transform [10], and wm,f is a hamming window, used to smooth the effects at the boundaries of the windows (note the dependency of w on f ). The frequency
62
3.6 Sound Pressure Level Calculation
dependent bandwidth Lf is defined as Lf = Q sr
f , where Q represents the constant resolution factor, and sr is the sampling rate of xh . d·e represents the ceiling function,
and j is the imaginary unit.
We note here that we do not use a “hop length” of the windows used in our constantQ spectrum. Instead, we centre the windows on every sample from the signal. In
addition to this, we found that by choosing larger windows than are specified by the
constant-Q ratios, performance increased. This was realised by multiplying all window
lengths by a constant factor to pick up more energy, which we call a “Power factor”,
optimised on the full beat-synchronised loudness-based chromagram. Note that this is
equivalent to using a larger value of Q and then decimating in frequency. We found
that a power factor of 5 worked well for treble frequencies, whilst 3 was slightly better
for bass frequencies, although results were not particularly sensitive to this parameter.
3.6
Sound Pressure Level Calculation
This section deals with our novel loudness calculation for chromagram feature extraction. As described in subsection 3.1.1, the key concept is to transform the spectrum in
such a way that it more closely relates to the human auditory perception of the loudness of the frequency powers. This is achieved by first computing the sound pressure
level of the spectrum, and then correcting for the fact that the powers of low and high
frequencies require higher sound pressure levels for the same perceived loudness as do
mid-frequencies [29].
Given the constant-Q spectrogram representation X, we compute the Sound Pressure Level (SPL) representation by taking the logarithm of the energy spectrum. A
reference pressure level pref is needed, but as we shall see in subsection 3.9, specifying
a specific value is in fact not required and so in practice can be set to 1. We compute
63
3. CHROMAGRAM EXTRACTION
the loudness of the spectrum therefore via:
SP Lf,t = 10 log10
kXf,t k2
kpref k2
, f ∈ F, t = 1, . . . , n,
(3.2)
where pref indicates a reference pressure level. A small constant may be added to
kXf,t k2 to avoid numerical problems in this calculation, although we did not experience
this issue in any of our data.
3.7
A-Weighting & Octave Summation
To compensate for the varying loudness sensitivity across the frequency range, we use
A-weighting [103] to transform the SPL matrix into a representation of the perceived
loudness of each of the frequencies:
Lf,t = SP Lf,t + A(f ), f ∈ F, t = 1, . . . , n,
(3.3)
where the A-weighting functions are as quoted from [103]:
RA (f ) =
(f 2 +20.62 )·
√
122002 ·f 4
(f 2 +107.72 )(f 2 +737.92 )·(f 2 +122002 )
,
(3.4)
A(f ) = 2.0 + 20 log10 (RA (f )).
We are left with a sound pressure level matrix that relates to the human perception
of the loudness of frequency powers in a musical piece. Taking advantage of octave
equivalence, we now sum over frequencies which belong to the same pitch class. It is
known that loudnesses are additive if they are not close in frequency [97]. This allows
us to sum up loudness of sounds in the same pitch class, yielding an octave-summed
loudness matrix, LO :
LO
p,t =
X
δ(M (f ) + 1, p)Lf,t ,
f ∈F
64
p = 1, . . . , 12,
t = 1, . . . , n.
(3.5)
3.8 Beat Identification
Here δ denotes an indicator function and
M (f ) =
!
f
12 log2
+ 0.5 + 69 mod 12.
fA
(3.6)
Exploiting the fact the chords rarely change between beats [35], we next beat synchronise our chromagram features.
3.8
Beat Identification
We use an existing technique to estimate beats in the audio [26], and therefore extract
a vector of estimated beat times b = (b1 , b2 , . . . , bT −1 ). To this we add artificial beats
at time 0, and the end of the song, and take the median chromagram vector between
subsequent beats to beat-synchronise our chromagrams. This yields an octave-summed,
beat synchronised feature composed of T frames:
O
LOB
f,t = {median of L between beats bt−1 and bt } for f = 1, . . . 12,
3.9
t = 2, . . . , T.
Normalisation Scheme
Finally, to account for the fact that overall sound level should be irrelevant in estimating
harmonic content, our loudness-based chromagram C ∈ R12×T is obtained by rangenormalizing LOB :
LOB
LOB
p,t − min
p0 ,t
0
Cp,t =
p
LOB
max
p0 ,t
p0
− min
LOB
p0 ,t
0
, ∀p, t.
(3.7)
p
Note that this normalization is invariant with respect to the reference level, and a
specific pref is therefore not required and can be set to 1 in practice. Note also that
the A-weighting scheme used is a non-linear addition, such that its effect is not lost in
the normalisation.
65
3. CHROMAGRAM EXTRACTION
3.10
Evaluation
In this section, we will evaluate our chromagram feature extraction process. We begin
by explaining how we obtained ground truth labels to match our features. Subsequently,
we comprehensively investigate all aspects of our chromagram feature vectors.
Ground Truth Extraction
Given a chromagram feature vector X = [x1 , . . . , xT ] for a song, we must decide what
the ground truth label for each frame is. This is easily obtained by sampling the ground
truth chord annotations (when available) according to the beat times extracted from
the procedure noted in subsection 3.8.
When a chromagram frame falls entirely within one chord label, we assign this chord
to the frame. When the chromagram frame overlaps two or more chords, we take the
label to be the chord that occupies the majority of time within this window. This
process is shown in Figure 3.5.
C:maj
F:maj
G:maj
A:min
Ground truth
C:maj
Beat locations
C:maj
C:maj
C:maj
F:maj
G:maj
G:maj
A:min
C:maj
Beat - Synchronised
Ground truth
Figure 3.5: Ground Truth extraction process. Given a ground truth annotation (top)
and set of beat locations (middle), we obtain the most prevalent chord label between each
beat to obtain beat-synchronous annotations.
Chords are then mapped to a smaller chord alphabet such as those listed in subsection 2.6.2. Chris Harte’s toolbox [36] was extremely useful in realising this.
66
3.10 Evaluation
Evaluation
To evaluate the effectiveness of our chromagram representation, we collected audio
and ground truth annotations for the MIREX dataset (179 songs by The Beatles1 , 19
by Queen, 18 Zweieck). Wishing to see the effect that each stage of processing had
on recognition accuracy, we incrementally increased the number of signal processing
techniques. We refer to the loudness-based chromagram described in Sections 3.2 to
3.9 as the Loudness Based Chromagram, or LBC. In summary the features used were:
• Constant-Q - a basic constant-Q transform of the signal, taken over frequencies
A1 (55 Hz) to G]6 (∼ 1661 Hz)
• Constant-Q + HPSS - as above, but computed on the harmonic component of
the audio, calculated using the Harmonic Percussive Sound Separation detailed
in subsection 3.3.
• Constant-Q + HPSS + Tuning - as above, with frequency bins tuned to the
nearest semitone by the algorithm in subsection 3.4.
• no A-weighting - as above, with the loudness of the spectrum calculated as the
log10 of the spectrum (without A-weighting).
• LBC - as above, with the loudnesses weighted according to human loudness sensitivity.
• Beat-synchronised LBC - as above, where the median loudnesses across each pitch
are taken between beats identified by the algorithm described in 3.8.
All feature vectors were range-normalised after computation. We show the chromagrams for a particular song for visual comparison in Figure 3.6. Performance in this song
increased from 37.37% to 84.02% by use of HPSS, tuning, loudness and A-weighting
(the ground truth chord label for the entirety of this section is A:maj).
1
“Revolution 9” (Lennon/McCartney) was removed as it was deemed to have no harmonic content
67
3. CHROMAGRAM EXTRACTION
In the first subplot we see that by working with the harmonic component of the
audio, we are able to pick up the C] note in the first beat, and lose some of the noise
in pitch classes A to B. Moving on, we see that the energy from the dominant pitch
classes (A and E) are incorrectly mapped to the neighbouring pitch classes, which is
corrected by tuning (estimated tuning for this song was -40 cents). Calculating the
loudness of this chromagram enhances the loudness of the pitches A and E, which is
further enhanced by A-weighting. Finally, beat-synchronisation means that each frame
now corresponds to a musically meaningful time scale. Ground truths were sampled
according to each feature set and reduced to major and minor chords only, with an
additional “no chord” symbol.
An HMM as per Section 2.7 was used to identify chords in this experiment, trained
and tested on the MIREX dataset. Chord similarity per song was simply measured by
number of correctly identified frames divided by total number of frames and we used
either ARCO or TRCO (see subsection 2.6.1) as the overall evaluation scheme. Overall
performances are shown in Table 3.1. We also conducted the Wilcoxon rank sum test
to test the significance of improvements seen.
Table 3.1: Performance tests for different chromagram feature vectors, evaluated using
Average Relative Correct Overlap (ARCO) and Total Relative Correct Overlap (TRCO).
p−values for the Wilcoxon rank sum test on successive features are also shown.
Performance (%)
Chromagram Type
Constant-Q
Constant-Q with HPSS
Constant-Q with HPSS and Tuning
LBC (no A-weighting)
LBC
Beat-synchronised LBC
ARCO
TRCO
59.40
58.27
61.55
79.92
80.19
80.97
59.08
57.95
61.17
80.02
80.27
80.91
Significance
0.40
0.01
2.95e − 43
0.78
0.34
Investigating the performances in Table 3.1, we see large improvements when using
advanced signal processing techniques, from 59.08% to 80.91% Total Relative Correct
68
3.10 Evaluation
Overlap. Investigating each component separately, we see that Harmonic Percussive
Sound Separation decreases the performance slightly over the full waveform. This
decrease is small in magnitude and can be explained by the suboptimal selection of
the power factor in the chromagram extraction1 . Tuning of musical frequencies shows
an improvement of about 3% over untuned frequency bins, confirming that the tuning
method we used correctly identifies and adjusts songs that are not tuned to standard
pitch.
By far the largest improvement can be seen by taking the log of the spectrum
(LBC, row 4), with a very slight improvement upon adding A-weighting. Although this
increase is not significant, we include it in the feature extraction to ensure the loudness
we calculate models the human perception of loudness. Finally, beat-synchronising
both features and annotations offers an improvement of slightly less than 1% absolute
improvement, and has the additional benefit of ensuring that chord changes occur
on (predicted) beats. Investigating the significance of our findings, we see that the
introduction of tuning and loudness calculation offer significant improvements at the
5% level (p < 0.05).
The results presented here are comparable to the pretrained or expert systems in
MIREX evaluations in section 2.6.4. A thorough investigation of train/test scenarios
is required to test if our model is comparable to train/test algorithms, although this is
postponed until future chapters.
1
Recall that this parameter was optimised on the fully beat-synchronised chromagram, A fixed
power factor of 5 was used throughout these experiments, which was found to perform optimally
over these experimental conditions. Although applying HPSS to the spectrogram degraded performance slightly, the change is small in magnitude (around 1-1.5% absolute) and is consistent with the
perceptually-motived model of harmony presented within this thesis, and is therefore included in all
future experiments
69
3. CHROMAGRAM EXTRACTION
3.11
Conclusions
In this chapter, we introduced our motivation for calculating loudness based chromagrams for the task of audio chord estimation. We saw how the notion of perception
of loudness was difficult to define, although under some relaxed assumptions we can
model it closely. One of the key findings of these studies was that the human auditory
response to the loudness of pitches was non-linear with respect to frequency. With
these studies in mind, we computed loudness based chromagrams that are rigorously
defined and follow the industrial standard of A-weighting of frequencies.
These techniques were enhanced by injecting some musical knowledge into the feature extraction. For example, we tuned the frequencies to correspond to the musical
scale, removed the percussive element of the audio, and beat-synchronised our features.
Experimentally, we saw that by introducing these techniques we achieve a performance
of 80.97% TRCO on a set of 217 songs.
70
3.11 Conclusions
Figure 3.6: Chromagram representations for the first 12 seconds of ‘Ticket to Ride’.
71
3. CHROMAGRAM EXTRACTION
72
4
Dynamic Bayesian Network
In this chapter, we describe our model for the recognition of chords, keys and bass
notes from audio. Having described our feature extraction process in chapter 3, we
must decide on how to assign a chord, key and bass label to each frame.
Motivated by previous work in Dynamic Bayesian Networks (DBNs, [65? ]), our
approach to the automatic recognition of chords from audio will involve the construction
of a graphical model with hidden nodes representing the musical features we wish to
discover, and observed nodes representing the audio signal.
As shown in subsection 2.4.4, DBNs have been shown to be successful in reconstructing chord sequences from audio when trained using expert knowledge [62]. However,
it is possible that these models overfit the available data by hand-tuning of parameters. We will counter this by employing machine learning techniques to infer parameter
settings from fully-labelled data, and testing our results using cross-validation.
The remainder of this chapter is arranged as follows: section 4.1 outlines the mathematical framework for our model. In section 4.2, we build up the DBN, beginning with
a simple HMM and adding nodes, incrementally increasing the model complexity. All
of this work will be based on the minmaj alphabet of 12 major chords, 12 minor chords
and a “No Chord” symbol; and we also discuss issues of computational complexity in
73
4. DYNAMIC BAYESIAN NETWORK
this section. Moving on to section 4.3, we extend the evaluation to more complex chord
alphabets and evaluation techniques. We conclude this chapter in section 4.4.
4.1
Mathematical Framework
We will present the mathematical framework of our proposed model here, before evaluating in the following sections. To test the effectiveness of each element, we will
systematically test simplified versions of the model with hidden and/or observed links
removed (realised by setting the relevant probabilities as zero). Our DBN, which we
call the Harmony Progression Analyser (HPA, [81]), is shown in Figure 4.1.
k1
k2
kT-1
kT
c1
c2
cT-1
cT
b1
b2
bT-1
bT
Xb1
Xc1
Xb2
XbT-1
Xc2
XcT-1
XbT
XcT
Figure 4.1: Model hierarchy for the Harmony Progression Analyser (HPA). Hidden nodes
(cicles) refer to chord (ci ), key (ki ) and bass note sequences (bi ). Chords and bass notes
emit treble (Xit ) and bass (Cib ) chromagrams, respectively.
4.1.1
Mathematical Formulation
As with the baseline Hidden Markov Model described in the chapter 2 , we assume
the chords for a song are a first-order Markovian process, but now apply the same
assumption to the bassline and key sequences. We further assume that the chords emit
74
4.1 Mathematical Framework
a treble chromagram, whilst the bass notes emit a bass chromagram. This is shown
by the fact that HPA’s adopted topology consists of three hidden and two observed
variables. The hidden variables correspond to the key K, the chord label C and the
bass B annotations.
Under this representation, a chord is decomposed into two aspects: chord label and
bass note. Taking the chord G:maj/b7 as an example, the chord state is c = G:maj and
the bass state is b = F. Accordingly, we compute two chromagrams for two frequency
ranges: the treble chromagram Xc , which is emitted by the chord sequence c and the
bass chromagram Xb , which is emitted by the bass sequence b. The reason of applying
this decomposition is that different chords can share the same bass note, resulting in
similar chroma features in the low frequency domain. We hope that by using separated
variables we can increase variation between chord states, so as to better recognise in
particular complex chords. Note that this definition of bass note is non-standard: we
are not referring to the note which the bass instrument (i.e. bass guitar, left hand
piano) is playing, but instead the pitch class of the current chord which has lowest
pitch in the chord.
HPA has a similar structure to the chord estimation model defined by Mauch [62].
Note however the lack of metric position (we are aware of no data to train this node),
and that that the conditional probabilities in the model are different. HPA has, for
example, no link from chord t − 1 to bass t, but instead has a link from bass pitch class
t − 1 to bass pitch class t.
Under this framework, the set Θ of HPA has the following parameters:
Θ =
pi (k1 ), pi (c1 ), pi (b1 ), ptr (kt |kt−1 ), ptr (ct |ct−1 , kt ), ptr (bt |ct ), ptr (bt |bt−1 ),
pe (Xct |ct ), pe (Xbt |bt ) ,
(4.1)
where pi , ptr and pe denote the initial, transition and emission probabilities, respec-
75
4. DYNAMIC BAYESIAN NETWORK
¯ c, X
¯ b } and the
tively. The joint probability of the chromagram feature vectors {X
corresponding annotation sequences {k, c, b} of a song is then given by the formula1
c
b
P (X , X , k, c, b|Θ) = pi (k1 )pi (c1 )pi (b1 )
T
Y
pt (kt |kt−1 )ptr (ct |ct−1 , kt )ptr (bt |ct ) ×
t=2
pt (bt |bt−1 )
T
Y
pe (Xct |ct )pe (Xbt |bt ).
(4.2)
t=1
4.1.2
Training the Model
For estimating the parameters in Equation 4.1, we use Maximum Likelihood Estimation, analogous to the HMM setting in section 2.7. Bass notes were extracted directly
from the chord labels, whilst for keys we used the corresponding key set from the
MIREX dataset2 (although this data is not available to participants of the MIREX
evaluations).
The amount of key data in these files is sparse when compared to chords. Considering only major and minor keys3 as well as a ‘No Key’ symbol, we discovered that almost
all keys appeared at least once (22/25 keys, 88%), although most key transitions were
not seen. Of the 252 = 625 possible key transitions we saw just 130, severely limiting
the amount of data we have for key transitions. To counteract this, following Ellis et.
al [26] in all models involving key information we first transposed each frame to an
arbitrary “home key” (we chose C:maj and A:min) and then learnt parameters in these
two canonical major/minor keys. Model parameters were then transposed 12 times,
leaving us with approximately 12 times as much training data for the hidden chain.
Key to chord transitions were also learnt in this way.
Bass note transitions and initial distribution were learnt using the same maximum
1
Note that we use the approximation ptr (bt |bt−1 , ct ) ∼ ptr (bt |ct )ptr (bt |bt−1 ), which from a purely
probabilistic perspective is not correct. However, this simplification reduces computational and statistical cost and results in better performance in practice.
2
publicly available at http://www.isophonics.net/
3
Modal keys such as “Within You Without You”, (Harrison, in a C] modal key) were assigned to
a related major or minor key to our best judgement
76
4.1 Mathematical Framework
likelihood estimation as described in chapter 2. Similarly, bass note emissions were
assumed to come from a 12–dimensional Gaussian distribution, which was learned from
chromagram/bass note pairs using maximum likelihood estimation.
4.1.3
Complexity Considerations
Given the large number of nodes in our graphical model, we must consider the computational practicalities of decoding the optimal chord, key and bass sequences from the
model. Given a chord, key and bass alphabet size of sizes |Ac |, |Ak |, |Ac |, respectively,
the time complexity for Viterbi decoding a song with T frames is O(|Ac |2 |Ak |2 |Ac |2 |T |),
which easily becomes prohibitive as the alphabets become of reasonable size. To counteract this, we employ a number of search space reduction techniques, detailed below.
Chord Alphabet Constraint
It is unlikely that any one song will use all the chords available in the alphabet in a
song. Therefore, we can reduce the number of chord nodes to search if a chord alphabet
is known, before decoding. To achieve this, we ran a simple HMM with max-gamma
decoder [92] over the observation probability matrix for a song (using the full frequency
0
range), and obtained such an alphabet, Ac . Using this, we are able to set the transition
probabilities for all chords not in this set to be zero, thus drastically reducing our search
space:
0
p (ct |ct−1 , k) =



p(ct |ct−1 , k)
if ct , ct−1 ∈ Ac


0
otherwise
0
(4.3)
Key Transition Constraint
Musical theory tells us that not all key transitions are equally likely, and that if a key
modulates it will most likely to be a related key [51]. Thus, we propose to rule out key
changes that are rarely seen in the training phase of our algorithm, a process known
77
4. DYNAMIC BAYESIAN NETWORK
as threshold pruning in dynamic programming [8]. Thus, we may devise new transition
probabilities as:
¯ =
p0 (k|k)



¯
p(k|k)
¯ >γ
if |kt = k, kt−1 = k|


0
otherwise
(4.4)
where γ ∈ {Z+ ∪ 0} is a threshold parameter that must be specified in advance.
Chord to Bass Constraint
Similarly, we expect that a given chord will be unlikely to emit all possible bass notes.
We may therefore apply another threshold τ to constrain the number of emissions we
consider here. Thus we may set:
p0 (b|c) =



p(b|c)
if |ct = c, bt = b| > τ.


0
otherwise
(4.5)
In our previous work [81], we discovered that by setting γ = 10, τ = 3 we obtain an
approximate 10-fold reduction in decoding time, whilst losing just 0.1% in performance.
We will therefore employ these parameters throughout the remainder of this thesis.
¯ and p0 (b|c) were subsequently normalised to sum to 1 to ensure
p0 (ct |ct−1 , k), p0 (k|k)
they met the probability criterion.
4.2
Evaluation
This section deals with our experimental validation of our model. We will begin with
a baseline HMM approach to chord recognition, which can be realised as using HPA
with all key and bass nodes disabled. To ensure that all frequencies were covered, we
ran this model using a chromagram that covered the entire frequency range (A1-G]6).
Next, we studied the effectiveness of a Key-HMM, which had additional nodes
78
4.2 Evaluation
for key to chord transitions and key self-transitions. Penultimately, we allowed the
model to detect bass notes, and split the chromagram into a bass (A1-G]3) and treble
(A4-G]6) range, before investigating the full HPA architecture. Note that the bass and
treble chromagrams are split arbitrarily into two three octave representations. Different
bass/treble definitions may lead to improved performance but are not considered in this
thesis.
4.2.1
Experimental Setup
We will first investigate the effectiveness of a simple HMM on the MIREX dataset under
a train/test scenario. Under this setting, each fully-labelled training song is designated
to be either a training song on which to learn parameters, or a test song for evaluation.
To achieve balanced splits, we took approximately 1/3 of each album into the test
set, with the remainder as training, and performed 3-fold cross-validation, ensuring that
our results were comparable to the MIREX evaluations. This procedure was repeated
100 times, and performance was measured on the frame level using either TRCO or
ARCO as the average over the three folds. As previously mentioned, to investigate the
effect that various hidden and observed nodes had on performance, we disabled several
of the nodes, beginning at first with a simple HMM as per chapter 3. In summary, the
4 architectures investigated are:
• HMM. A Hidden Markov Model with hidden nodes representing chords and an
emission chromagram ranging from A1 to G]6.
• Key-HMM. As above, with an additional hidden key chain and key to chord links.
• Key-Bass-HMM. As above, with distinct chroma for the bass (A1-G]3) and treble
(A4-G]6) frequencies, and an accompanying chord to bass node.
• HPA. Full Harmony Progression Analyser, i.e. the above with additional bass-tobass links.
79
4. DYNAMIC BAYESIAN NETWORK
We begin by discussing the chord accuracies of the above models.
4.2.2
Chord Accuracies
Chord accuracies for each model are shown in Table 4.1. As can be seen directly from
Table 4.1: Chord recognition performances using various crippled versions of HPA. Performance is measured using Total Relative Correct Overlap (TRCO) or Average Relative
Correct Overlap (ARCO), and averaged over 100 repetitions of a 3-fold cross-validation
experiment. Variances across these repetitions are shown after each result, and the best
results are shown in bold.
TRCO (%)
HMM
Key-HMM
Key-Bass HMM
HPA
ARCO (%)
Train
Test
Train
Test
81.25 ± 0.28
79.10 ± 0.28
82.34 ± 0.26
83.52 ± 0.28
78.40 ± 0.64
80.43 ± 0.56
80.26 ± 0.58
81.56 ± 0.58
81.22 ± 0.32
79.26 ± 0.30
82.60 ± 0.27
83.64 ± 0.30
78.93 ± 0.66
80.67 ± 0.60
81.03 ± 0.59
82.22 ± 0.63
Table 4.1, HPA attains the best performance under both evaluation schemes in both
training and testing phases. In general, we expect the training performance of the
model to increase as the complexity of the model increases down the rows, although
the HMM appears to buck this trend, offering superior performance to the Key-HMM
(rows 1 and 2). However, this pattern is not repeated in the test scenario, suggesting
that the HMM is overfitting the training data in these instances.
The fact that performance increases as the model grows in intricacy demonstrates
the power of the model, and also confirms that we have enough data to train it efficiently.
This result is encouraging, as it shows that it is possible to learn chord models from
fully-labelled data, and also gives us hope that we might build a flexible model capable
of performing chord estimation different artists and genres. The generalisation potential
of HPA will be investigated in chapter 5.
80
4.2 Evaluation
Statistical Significance
We now turn our attention to the significance of our findings. Over a given number
of cross-validations (in our case, 100), we wish to see if the improvements we have
found are genuine enhancements or could be due to random fluctuations in the data.
Upon inspecting the results in Table 4.1, performances were normally distributed across
repetitions of the 3-fold cross-validations.
Therefore, 1-sided, paired t-tests were conducted to assess if each stage of the algorithm was improving on the previous one. With the sole exception of HMM vs.
Key-HMM in training, all models exhibited statistically significant improvements, as
evidenced by p-values of less than 10−25 in both train and test experiments.
4.2.3
Key Accuracies
Each experimental setup except the HMM also outputs a predicted key sequence for
the song. We measured key accuracy in a frame-wise manner, but noticed that the
percentage of frames where the key was correctly identified was strongly non-Gaussian,
as we were generally either predicting the correct key for all frames or the incorrect
key. Providing a mean of such a result is misleading, so we chose instead to provide
the histograms which show the average performance over the 100 repetitions of 3-fold
cross-validation, shown in Figure 4.2.
The performance here is not as high as we may expect, given the accuracy attained
on chord estimation. Reasons for this may include that the key nodes (see Figure
4.1) have no input from other nodes and that evaluation is measured inappropriately
as correct or incorrect, whereas a more flexible metric allowing for related keys to be
considered may be more appropriate. Investigating these scenarios is part of our future
work.
81
4. DYNAMIC BAYESIAN NETWORK
80
60
40
20
0
0
20
40 60 80
Performance
100
100
Average Frequency
100
Average Frequency
Average Frequency
100
80
60
40
20
0
(a) Key-HMM
0
20
40 60 80
Performance
100
80
60
40
20
0
0
(b) Key-Bass-HMM
20
40 60 80
Performance
(c) HPA
Figure 4.2: Histograms of key accuracies of the Key-HMM (4.2a),Key-Bass-HMM (4.2b)
and HPA (4.2c) models. Accuracies shown are the averages over 100 repetitions of 3-fold
cross-validation.
4.2.4
Bass Accuracies
For each experiment which had a bass note node, we also computed bass note accuracies.
These are shown for the final two models in Table 4.2.
Table 4.2: Bass note recognition performances in models that recognise bass notes. Performance is measured either using Total Relative Correct Overlap (TRCO) or Average
Relative Correct Overlap (ARCO), and is averaged over 100 repetitions of a 3–fold cross–
validation experiment. Variances across these repetitions are shown after each result, and
best results in each column are in bold.
TRCO
Key–Bass–HMM
HPA
ARCO
Train
Test
Train
Test
82.34 ± 0.26
86.08 ± 0.26
80.27 ± 0.58
85.71 ± 0.57
82.61 ± 0.27
85.96 ± 0.29
81.03 ± 0.59
85.73 ± 0.63
It is clear that HPA’s bass accuracy is superior to that of a Key–Bass–HMM, shown
by an increase of around five percentage points when bass–to–bass transitions are added
to the model. The recognition rate is also high in general, peaking at 85.73% ARCO
in a test setting. This suggests that recognising bass notes is easier than recognising
chords themselves, which is as expected since the class size (13) is much smaller than
82
100
4.3 Complex Chords and Evaluation Strategies
in the chord recognition case (25). Paired t–tests were conducted as per subsection
4.2.2 to compare the Key–Bass HMM and HPA, and we observed p–values of less than
10−100 in all cases.
What remains to be seen is how bass note recognition affects chord inversion accuracy, although this has been noted by previous authors [65]. We will investigate this
hypothesis in HPA’s context in the following section.
4.3
Complex Chords and Evaluation Strategies
4.3.1
Increasing the chord alphabet
So far, all of our experiments have been conducted on an alphabet of major and minor
chords only. However, as mentioned in chapter 2, there are many other chord types
available to us. We therefore defined 4 sets of chord alphabets for advanced testing,
which are listed in Table 4.3.
Table 4.3: Chord alphabets used for evaluation purposes. Abbreviations: MM = Matthias
Mauch, maj = major, min = minor, N = no chord, aug = augmented, dim = diminished,
sus2 = suspended 2nd, sus4 = suspended 4th, maj6 = major 6th, maj7 = major 7th, 7 =
(dominant 7), min7 = minor 7th, minmaj7 = minor, major 7th, hdim7 = half-diminished
7 (diminished triad, minor 7th).
Alphabet A
|A|
Chord classes
Minmaj
Triads
MM
Quads
25
73
97
133
maj,min,N
maj,min,aug,dim,sus2,sus4,N
maj,min,aug,dim,maj6,maj7,7,min7,X,N
maj,min,aug,dim,sus2,sus4,maj7,min7,7,minmaj7,hdim7,N
Briefly, Triads is a set of major and minor thirds with optional diminished/perfect/augmented
fifths, as well as two “suspended” chords (sus2 = (1,2,5), sus4 = (1,4,5)). MM is
an adaptation of Matthias Mauch’s alphabet of 121 chords [62], although we do not
consider chord inversions such as maj/3, as we consider this to be an issue of evaluation.
Chords labelled as X are not easily mapped to one of the classes listed in [62], and are
83
4. DYNAMIC BAYESIAN NETWORK
always considered incorrect (examples include A:(1) and A:6). Quads is an extension
of Triads, with some common 4-note 7th chords.
We did not attempt to recognise any chords containing intervals above the octave,
since in a chromagram representation we can not distinguish between, for example,
C:add9 and Csus2. Also note that we do not consider inversions of chords such as
C:maj/3 to be unique chord types, although we will consider these chords in evaluation
(see 4.3.2). Reading the ground truth chord annotations and simplifying into one of
the alphabets in Table 4.3 was done via a simple hand-made map.
Larger chord alphabets such as MM pose an interesting question for evaluation. For
example, how should we score a frame whose true label is A:min7 but which we label as
C:maj6? Both chords share the same pitch classes (A,C,E,G) but have different musical
functions. For this reason, we now turn our attention to evaluation schemes.
4.3.2
Evaluation Schemes
When dealing with major and minor chords, it is straightforward to identify when a
mistake has been made. However, for complex chords the question is more open to
interpretation. How should we judge C:maj9/3 against C:maj7/5, for example? The
two chords share the same base triad and 7th, but the exact pitch classes differ slightly,
as well as the order in which they appear in the chord.
We describe here three different similarity functions for evaluating chord recognition
accuracy that, given a predicted and ground truth chord frame, will output a score
between these two chords (1 or 0). We begin with chord precision, which measures 1
only if the ground truth and predicted chord are identical (at the specified alphabet).
Next, Note Precision scores 1 if the pitch classes in the two chords are the same and
0 otherwise. Throughout this thesis, when we evaluate an HMM, we will assume root
position in all of our predictions (the HMM as defined cannot detect bass notes owing
to the lack of a bass node), meaning that this HMM can never label a frame whose
84
4.3 Complex Chords and Evaluation Strategies
ground truth chord is not in root position (C:maj/3, for example) correctly. Finally,
we investigate using the MIREX-style system, which scores 1 if the root and third are
equal in predicted and true chord labels (meaning that C:maj and C:maj7 are considered
equal in this evaluation), which we denote by MIREX.
4.3.3
Experiments
The results of using an HMM and HPA under various evaluation schemes are shown
in Table 4.4. In keeping with the MIREX tradition, we also increased the sample rate
of ground truth and predictions to 1,000 Hz in the following evaluations to reduce the
potential effect of the beat tracking algorithm on performance. We used the TRCO
overall evaluation over the 100 3-fold cross-validations, and also show comparative plots
of an HMM vs HPA in Figure 4.3.
85
Chord Precision (%)
86
55
60
65
70
75
80
82.56 ± 0.27
81.65 ± 0.33
74.31 ± 0.43
74.29 ± 0.48
79.41 ± 0.30
78.34 ± 0.37
71.77 ± 0.43
71.75 ± 0.48
Quads
55
60
65
70
75
80
85
82.56 ± 0.27
82.01 ± 0.31
81.87 ± 0.32
81.86 ± 0.34
80.36 ± 0.27
79.58 ± 0.32
79.23 ± 0.34
78.37 ± 0.31
MIREX
Minmaj
Triads
MIREX
Quads
HMM
HPA
80.66 ± 0.57
80.22 ± 0.59
79.89 ± 0.60
79.86 ± 0.66
77.58 ± 0.63
76.62 ± 0.60
75.41 ± 0.67
74.17 ± 0.69
MM
80.66 ± 0.57
78.85 ± 0.61
66.53 ± 0.71
66.50 ± 0.78
77.58 ± 0.63
74.09 ± 0.65
61.58 ± 0.95
60.51 ± 0.83
Note P.
(b) Test Note Precision
77.61 ± 0.66
75.85 ± 0.71
64.31 ± 0.73
64.28 ± 0.79
74.08 ± 0.70
70.70 ± 0.69
58.36 ± 0.96
57.76 ± 0.84
Chord P.
Test Performance (%)
Figure 4.3: Testing Chord Precision and Note Precision from Table 4.4 for visual comparison.
(a) Test Chord Precision
MM
80.36 ± 0.27
77.94 ± 0.56
71.35 ± 0.62
68.97 ± 0.48
76.44 ± 0.31
73.82 ± 0.58
66.55 ± 0.66
65.55 ± 0.47
HMM
HPA
Note P.
Chord P.
Triads
Minmaj
Triads
MM
Quads
HPA
Minmaj
Minmaj
Triads
MM
Quads
HMM
85
A
Note Precision (%)
Model
Training Performance (%)
Table 4.4: HMM and HPA models under various evaluation schemes evaluated at 1, 000 Hz under TRCO.
4. DYNAMIC BAYESIAN NETWORK
4.3 Complex Chords and Evaluation Strategies
The first observation we can make from Table 4.4 is that HPA outperforms an HMM
in all cases, with non-overlapping error bars of 1 standard deviation. This confirms
HPA’s superiority under all evaluation schemes and chord alphabets. Secondly, we
notice that performance of all types decreases as the chord alphabet increases in size
from minmaj (25 classes) to Quads (133 classes), as expected. Performance drops most
sharply when moving from Triads to MM, possibly owing to the inclusion of 7th chords
and their potential confusion with their constituent triads.
Comparing the different evaluation schemes, we see that Chord Precision is always
lower than Note Precision (as expected), and that the gap between an HMM and HPA
increases as the chord alphabet increases (3.52%−6.52% Chord Precision, 3.08%−5.99%
Note Precision), and is also largest for the Chord Precision metric, confirming that HPA
is more applicable to challenging chord recognition tasks with large chord alphabets
and when evaluation is most stringent.
A brief survey of the MIREX evaluation strategy shows relatively little variation
across models, highlighting a drawback of this evaluation: more complex models are
not “rewarded” for correctly identifying complex chords and/or bass notes. However,
it does allow us to compare HPA to the most recent MIREX evaluation.
Performance under the MIREX evaluation shows that under a train/test scenario,
HPA obtains 80.66 ± 0.57% TRCO (row 5 and final column of Table 4.4), which is to be
compared with Cho and Bello’s submission to MIREX 2011 (Submission CB3 in Table
2.7), which scored 80.91%. Although we have already highlighted the weaknesses of the
MIREX evaluations in the current section and in chapter 2, it is still clear that HPA
performs at a similar level to the cutting edge. The p−values under a paired t−test
for an HMM vs HPA, under all alphabets, the Note Precision and Chord Precision
metrics revealed a maximal value of 3.33 × 10−83 , suggesting that HPA significantly
outperforms an HMM in all of these scenarios.
We also ran HPA in a train/train setting on the MIREX dataset, and found it to
87
4. DYNAMIC BAYESIAN NETWORK
perform at 82.45% TRCO, comparable in magnitude to Khadkevich and Olmologo’s
KO1 submission, which attained 82.85% TRCO (see Table 2.7).
4.4
Conclusions
In this chapter, we revealed our Dynamic Bayesian Network, the Harmony Progression
Analyser (HPA). We formulated HPA mathematically as Viterbi decoding of a pair
of bass and treble chromagrams in a similar way to an HMM, but on a larger state
space consisting of hidden nodes for chord, bass and key sequences. We noted that
this increase in state space has a drawback: computational time increases significantly,
and we introduced machine-learning based techniques (two-stage prediction, dynamic
pruning) to select a subspace of the parameter space to explore.
Next, we tested the accuracy of HPA by gradually increasing the number of nodes,
and found that each additional node statistically significantly increased performance in
a train/test setting. Bass note accuracy peaked at 85.71% TRCO, which was investigated by studying both Chord Precision and Note Precision in the evaluation section
using a complex chord alphabet, where we attained results comparable to the state of
the art.
88
5
Exploiting Additional Data
We have seen that our Dynamic Bayesian Network HPA is able to perform at a cuttingedge level when trained and evaluated on a known set of 217 popular music tracks.
However, one of the main benefits of designing a machine-learning based system is that
it may be retrained on new data as it arises.
Recently, a number of new fully-labelled chord sequence annotations have been made
available. These include the USpop set of 194 tracks [7] and the Billboard dataset of
1,000 tracks, for which the ground truth has been released for 649 (the remainder
being saved for test data in future MIREX evaluations) [13]. We may also make use of
seven Carole King annotations1 and a collection of five tracks by the rock group, Oasis,
curated by ourselves [74].
In addition to these fully-labelled datasets, we have access to Untimed Chord Sequences (UCSs, see section 5.4) for a subset of the MIREX and Billboard datasets, as
well as for an additional set of 1, 822 songs. Such UCSs have been shown by ourselves
in the past to improve chord recognition when training data is limited [73].
There are many ways of combining the data mentioned above, and an almost limitless number of experiments we could perform with the luxury of these newly available
1
obtained with thanks from http://isophonics.net/
89
5. EXPLOITING ADDITIONAL DATA
training sources. To retain our focus we will structure the experiments in this chapter
to investigate the following questions:
1. How similar are the datasets to each other?
2. Can we learn from one of the datasets to test in another (a process known as out
of domain testing)?
3. How do an HMM and HPA compare in each of the above settings?
4. Are any sets similar enough to be combined into one unified training set?
5. How fast does HPA learn?
6. Can we use Untimed Chord Sequences as an additional source of information in
a test setting?
7. Can a large number of UCSs be used as an additional source of training data?
We will answer the above questions in this chapter by following the following structure. Section 5.1 will investigate the similarity between datasets and aims to see if
testing out of domain is possible, answering points 1-3 above. Section 5.2 briefly investigates point 4 by using leave-one-out testing on all songs for which we have key
annotations, whilst learning rates (point 5) are studied in section 5.3. The mathematical framework for using chord databases as an additional data source is introduced in
section 5.4 (point 6). We then move on to see how these data may be used in training
in section 5.5 (point 7) before concluding the chapter in section 5.6.
5.1
Training across different datasets
Machine-learning approaches to a recognition task require training data to learn mappings from features to classes. Such training data may come from varying distributions,
which may affect the type of model learnt, and also the generalisation of the model.
90
5.1 Training across different datasets
#
#
#
#
title: I Don’t mind
artist: James Brown
metre: 6/8
tonic: C
0.0 silence
0.073469387 A, intro, | A:min | A:min | C:maj | C:maj |
8.714013605 | A:min | A:min | C:maj | C:maj |
15.611995464 | A:min | A:min | C:maj | C:maj |
22.346394557 B, verse, | A:min | A:min | C:maj | C:maj |, (voice
29.219433106 | A:min | A:min | C:maj | C:maj |
Figure 5.1: Section of a typical Billboard dataset entry before processing.
For instance, one can imagine that given a large database of classical recordings
and corresponding chord sequences on which to train, a chord recognition system may
struggle to annotate the chords to heavy metal music, owing to the different instrumentation and chord transitions in this genre. In this section we will investigate how
well an HMM and HPA are able to transfer their learning to the data we have at hand.
5.1.1
Data descriptions
In this subsection, we briefly overview the 5 datasets we use in this chapter. A full
artist/track listing can be found in Appendix A.
Billboard
This dataset contains 654 tracks by artists which have at one time appeared on the US
Billboard Hot 100 chart listing, obtained with thanks from [13]. We removed 111 songs
which were cover versions (identified by identical title) as well as 21 songs which had
potential tuning problems (confirmed by the authors of [13]); we were left with 522 key
and chord annotations. Worth noting, however, is that this dataset is not completely
labelled. Specifically, it lacks exact onset times for chord boundaries, although segment
onset times are included. An example annotation is shown in Figure 5.1.
91
5. EXPLOITING ADDITIONAL DATA
Although section starts are time-stamped, exact chord onset times are not present.
To counteract this, we extracted chord labels directly from the text and aligned them
to the corresponding chromagram (many thanks to Ashley Burgoyne for running our
feature extraction software on the music source), assuming that each bar has equal
duration. This process was repeated for the key annotations to yield a set of annotations
in the style of Harte et al. [36].
MIREX
The MIREX dataset, as mentioned in previous chapters, contains 218 tracks with 180
songs by The Beatles, 20 by Queen and 18 by Zweieck. We omitted “Revolution Number
9” from the dataset as it was judged to have no meaningful harmonic content, and were
left with 217 chord and key annotations.
USpop
This dataset of 194 tracks has only very recently been made available, and is sampled
from the USpop2002 dataset of 8,752 songs [7]. Full chord labels are available, although
there is no data on key labels for these songs, meaning they unfortunately cannot be
used to train HPA. Despite this, we may train an HMM on these data, or use them
exclusively for testing purposes.
Carole King
A selection of seven songs by the folk/rock singer Carole King, with corresponding key
annotations. Although these annotations come from the same source as the MIREX
datasets, we do not include them in the MIREX dataset, as they are not included in
the MIREX evaluation and their quality is disputed1 .
1
quote from isophonics.net: [...the annotations] have not been carefully checked, use with care.
92
5.1 Training across different datasets
Oasis
A small set of five songs by the Britpop group Oasis, made by ourselves for one of
our previous publications [74]. These data are not currently complemented by key
annotations.
5.1.2
Experiments
In this subsection we will train an HMM and HPA on the sets of chord and (for HPA)
key annotations, and test on the remaining sets of data to investigate how flexible our
model is, and how much learning may be transferred from one dataset to another.
Unfortunately, we cannot train HPA on the USpop or Oasis datasets as they lack
key information. Therefore, we begin by deploying an HMM on all datasets. Results are
shown in Table 5.1, where we evaluated using Chord Precision and Note Precision; and
utilised TRCO as the overall evaluation metric, sampled at 1, 000 Hz, using all chord
alphabets from the previous chapter. Results for Chord Precession are also shown in
Figure 5.2.
93
Test
Billboard
MIREX
USpop
Carole King
Oasis
Billboard
MIREX
USpop
Carole King
Oasis
Billboard
MIREX
USpop
Carole King
Oasis
Billboard
MIREX
USpop
Carole King
Oasis
Billboard
MIREX
USpop
Carole King
Oasis
Train
Billboard
MIREX
USpop
94
Carole King
Oasis
42.85
52.61
43.62
32.08
79.51
51.59
57.67
52.06
66.82
54.09
65.40
71.86
70.87
57.95
65.47
66.04
75.81
69.10
57.66
64.53
67.97
72.84
69.36
57.17
62.02
Minmaj
42.69
52.58
42.22
31.90
80.79
50.58
56.48
50.57
65.65
55.13
61.35
68.16
65.55
54.74
61.11
62.78
72.75
64.88
55.18
60.99
63.28
68.84
63.98
53.17
57.79
Triads
32.81
44.52
34.26
08.31
80.56
20.08
22.44
20.74
56.25
15.90
48.22
55.87
61.64
33.71
45.49
48.28
65.14
53.88
29.71
46.67
55.04
57.77
54.96
38.88
47.13
MM
Chord Precision (%)
34.73
44.7
34.52
13.93
77.17
24.72
27.43
24.19
64.76
25.23
48.87
55.29
60.66
38.94
48.52
49.5
65.47
53.27
36.18
47.88
55.01
55.5
52.08
45.76
46.46
Quads
44.18
54.14
46.31
37.96
79.51
53.71
60.22
56.02
83.86
54.09
67.80
74.88
75.75
66.26
65.47
68.69
79.26
73.93
68.59
64.53
70.48
75.69
73.78
66.56
62.02
Minmaj
44.01
54.13
44.93
37.77
80.79
52.63
59.01
54.44
82.72
55.13
63.65
71.04
70.84
62.30
61.11
65.29
76.51
69.60
65.26
60.99
65.97
71.61
68.17
61.51
57.79
Triads
33.91
46.28
36.73
14.16
81.57
22.03
24.21
23.35
82.62
15.90
50.78
58.60
67.52
43.74
45.49
50.92
69.40
58.73
42.25
46.67
57.84
60.06
58.64
48.51
47.13
MM
Note Precision (%)
Table 5.1: Performances across different training groups using an HMM.
35.84
46.12
36.93
19.47
77.17
25.77
28.73
26.34
82.94
25.23
50.75
57.48
65.11
44.39
48.52
51.51
68.97
57.28
45.45
47.88
57.04
57.65
55.41
52.70
46.46
Quads
5. EXPLOITING ADDITIONAL DATA
5.1 Training across different datasets
We immediately see a large variation in the performances from Table 5.1 (8.31% −
79.51% Chord Precision and 14.16% − 79.51% Note Precision). Worth noting, however,
is that these extreme values are seen when there are few training examples (training
set Carole King or Oasis). In such cases, when the training and test sets coincide, it
is easy for the model to overfit the model (shown by high performances in train/test
Oasis and Carole King), whilst generalisation is poor (low performances when testing on
Billboard/MIREX/USpop). This is due to the model lacking the necessary information
to train the hidden or observed chain. It is extremely unlikely, for example, that the
full range of Quads chords are seen in the Oasis dataset, meaning that these chords are
rarely decoded by the Viterbi algorithm (although small pseudocounts of 1 chord were
used to try to counteract this).
These extreme cases highlight the dependence of machine-learning based systems
on a large amount of good quality training data. When testing on the small datasets
(Carole King and Oasis), this becomes even more of an issue, in the most extreme case
giving a training set performance of 81.57% and test set performance of 14.16% (test
artist Carole King, MM chord alphabet).
In cases where we have sufficient data however (train sets Billboard, MIREX and
USpop), we see more encouraging results (worst performance at minmaj was 65.40%
when training on USpop, testing on Billboard). Performance in TRCO generally decreases as the alphabet size increases as expected, with the sharpest decrease occurring
from the Triads alphabet to MM. We also see that each performance is highest when
the training/testing data coincide, as expected, and that this is more pronounced as
the chord alphabet increases in complexity. Training/testing performances for the Billboard, MIREX and USpop datasets appear to be quite similar (at most 11.46% difference in Chord Precision and minmaj alphabet, 10.41% for Note Precision), suggesting
that these data may be combined to give a larger training set.
We now move on to see how HPA deals with the variance across datasets. Since
95
5. EXPLOITING ADDITIONAL DATA
we require key annotations for training HPA, we shall restrict ourselves here to the
Billboard, MIREX and Carole King datasets. Results are shown in Table 5.2 and
Figure 5.3. We also show comparative plots between an HMM and HPA in Figure 5.4.
96
5.1 Training across different datasets
Note Precision (%)
Note Precision (%)
Note Precision (%)
Note Precision (%)
Minmaj
100
90
80
70
60
50
40
30
20
10
0
100
90
80
70
60
50
40
30
20
10
0
100
90
80
70
60
50
40
30
20
10
0
100
90
80
70
60
50
40
30
20
10
0
Billboard
MIREX
USpop
Test Set
Triads
Carole King
Oasis
Billboard
MIREX
USpop
Test Set
MM
Carole King
Oasis
Billboard
MIREX
USpop
Test Set
Quads
Carole King
Oasis
Billboard
MIREX
USpop
Test Set
Carole King
Oasis
Figure 5.2: TRCO performances using an HMM trained and tested on all combination
of datasets. Chord alphabet complexity increases in successive graphs, with test groups
increasing in clusters of bars. Training groups follow the same ordering as the test data.
97
Billboard
MIREX
Carole King
Carole King
98
Bill. MIREX C. K.
0
20
40
60
80
100
Triads
53.88
60.24
69.96
67.48
78.51
56.97
68.17
74.18
58.40
Triads
Bill. MIREX C. K.
51.63
57.92
74.52
69.06
79.41
63.36
70.84
76.56
59.96
Minmaj
0
20
40
60
80
100
26.42
28.63
74.80
53.79
70.81
41.72
58.79
60.90
46.69
MM
MM
56.47
63.89
81.82
71.26
82.45
68.64
72.77
79.17
64.60
Minmaj
Bill. MIREX C. K.
30.20
33.34
67.27
53.48
67.78
43.17
58.40
58.94
50.23
Quads
0
20
40
60
80
100
55.42
62.14
77.63
69.51
81.65
63.10
70.04
76.77
62.85
Triads
31.12
34.53
75.60
55.10
70.28
48.87
60.03
60.55
54.21
Quads
Bill. MIREX C. K.
Quads
29.51
32.38
81.28
55.43
73.20
45.05
60.26
62.60
49.66
MM
Figure 5.3: Note Precision performances from Table 5.2 presented for visual comparison. Test sets follow the same order as the
grouped training sets. Abbreviations: Bill. = Billboard, C.K. = Carole King.
0
20
40
60
80
Minmaj
Billboard
MIREX
Carole King
MIREX
100
Billboard
MIREX
Carole King
Billboard
Note Precision (%)
Test
Note Precision (%)
Train
Note Precision (%)
Note Precision (%)
Note Precision (%)
Chord Precision (%)
Table 5.2: Performances across all training/testing groups and all alphabets using HPA, evaluated using Note and Chord Precision.
5. EXPLOITING ADDITIONAL DATA
99
Note Precision (%)
Note Precision (%)
Triads
MM
Quads
MM
Quads
20
Minmaj
30
40
50
60
70
80
Triads
MM
Quads
Train on Carole King, test on Billboard
Triads
Train on MIREX, test on Billboard
20
Minmaj
30
40
50
60
70
80
20
Minmaj
30
40
50
60
70
HPA
HMM
Train on Billboard, test on Billboard
MM
Quads
MM
Quads
20
Minmaj
30
40
50
60
70
80
Triads
MM
Quads
Train on Carole King, test on MIREX
Triads
Train on MIREX, test on MIREX
20
Minmaj
30
40
50
60
70
80
Triads
Train on Billboard, test on MIREX
20
Minmaj
30
40
50
60
70
80
MM
Quads
Triads
MM
Quads
20
Minmaj
30
40
50
60
70
80
Triads
MM
Quads
Train on Carole King, test on Carole King
20
Minmaj
30
40
50
60
70
80
Train on MIREX, test on Carole King
20
Minmaj
30
40
50
60
70
Triads
Train on Billboard, test on Carole King
80
Figure 5.4: Comparative plots of HPA vs an HMM under various train/test scenarios and chord alphabets.
Note Precision (%)
Note Precision (%)
Note Precision (%)
Note Precision (%)
Note Precision (%)
Note Precision (%)
Note Precision (%)
80
5.1 Training across different datasets
5. EXPLOITING ADDITIONAL DATA
Comparing results for HPA with those for HMM, we see an improvement in almost
all cases, although when testing on the small set of Carole King it is difficult to tell which
method is best. The effect of overfitting on limited training data is most obviously seen
in Figure 5.4, bottom row. When training and testing on Carole King (lower right),
an HMM is able to attain above 80% on all chord alphabets. However, testing these
parameters on the Billboard or MIREX datasets (lower left and lower centre of Figure
5.4), performance does not exceed 65%.
In contrast to this, the Billboard and MIREX datasets offer more comparable performances under train/test. Indeed, the largest difference between train and test performances under the minmaj alphabet is at most 11.2% (train/test on MIREX vs train
on Billboard, test on MIREX). It is also encouraging to see that by training on the
Billboard data, we attain higher performance when testing on MIREX (76.56% minmaj
Chord Precision) than when testing on the Billboard dataset itself (70.84%), as this
means we may combine these datasets to form a large training set.
5.2
Leave one out testing
Before moving on to discuss the learning rate of HPA, we digress to a simple experiment
to test if all annotations with key annotations may be combined to form a large training
set. One method is to test on each data point, with the training set consisting of all
other examples, a process known as “leave-one-out” testing [48]. Results for these
experiments are shown in Table 5.3 and Figure 5.5.
100
101
0
0
0 20 40 60 80 100
Chord Precision (%)
50
100
150
200
Triads
57.18
19.16
MM
0
50
100
150
200
56.31
19.05
Quads
0 20 40 60 80 100
Chord Precision (%)
70.16
16.06
Triads
50
100
150
200
Minmaj
72.91
15.53
Frequency
Accuracy
Variance
Minmaj
Frequency
Note Precision (%)
MM
68.06
16.82
Triads
0 20 40 60 80 100
Chord Precision (%)
70.71
16.30
Minmaj
0
50
100
150
200
Quads
54.73
19.42
Quads
0 20 40 60 80 100
Chord Precision (%)
55.70
19.53
MM
Chord Precision (%)
Frequency
Figure 5.5: Distributions of data from Table 5.3. The number of songs attaining each decile is shown over each of the four
alphabets.
Frequency
Table 5.3: Leave-one-out testing on all data with key annotations (Billboard, MIREX and Carole King) across four chord alphabets.
Chord Precision and Note Precision are shown in the first row, with the variance across test songs shown in the second.
5.2 Leave one out testing
5. EXPLOITING ADDITIONAL DATA
Leave one out testing offers a trade-off between the benefit of a large training size
and the high variance of the prediction accuracies. The relatively high performances
seen in this setting of 70.84% Chord Precision shows that the MIREX and Billboard
datasets are fairly similar, although the variance is large, as expected from a leaveone-out setting. Upon inspecting the histograms in Figure 5.5, we see that most songs
perform at around 60 − 80% Chord Precision for the minmaj alphabet with a positive
skew. The variance across songs is shown by the width of the histograms, highlighting
the range of difficulty in prediction across this dataset.
5.3
Learning Rates
We have seen that it is possible to train HPA under various circumstances and attain
good performance under a range of training/test schemes. However, an important
question that remains to be answered is how quickly HPA learns from training data.
The current section will address this concern by incrementally increasing the amount
of training data that HPA is exposed to.
5.3.1
Experiments
The experiments for this section will follow that of 5.2, using HPA on all songs with
key annotations. We saw in this section that combining these datasets offers good performance when using leave-one-out testing, although the variance was large. However,
in the Billboard dataset, the number of songs is sufficiently large (522) that we may
perform train-test experiments. Instead of using a fixed ratio of train to test, we will
increase the training ratio to see how fast HPA and an HMM learn.
This is obtained by partitioning the set of 522 songs into disjoint subsets of increasing size, with the remainder being held out for testing. Since there are many ways
to do this, the process is repeated many times to assess variance. We chose training
102
5.3 Learning Rates
sizes of approximately [10%, 30%, . . . , 90%] with 100 repetitions of each training set
size. Results averaged over these repetitions are shown in Figure 5.6.
103
Note Precision
104
63
65
65
66
67
68
69
70
65
64
10 30 50 70 90
Training size (%)
Minmaj
10 30 50 70 90
Training size (%)
66
67
68
69
70
66
67
68
69
70
71
72
73
67
68
69
70
71
72
73
Note Precision
Chord Precision
71
10 30 50 70 90
Training size (%)
Triads
10 30 50 70 90
Training size (%)
Triads
Note Precision
Chord Precision
58
57
56
55
54
53
52
51
50
49
48
47
46
59
58
57
56
55
54
53
52
51
50
49
48
10 30 50 70 90
Training size (%)
MM
10 30 50 70 90
Training size (%)
MM
57
56
55
54
53
52
51
50
49
48
47
46
45
58
57
56
55
54
53
52
51
50
49
48
47
10 30 50 70 90
Training size (%)
Quads
10 30 50 70 90
Training size (%)
Quads
Figure 5.6: Learning rate of HPA when using increasing amounts of the Billboard dataset. Training size increases along the x axis,
with either Note or Chord Precision measured on the y axis. Error bars of width 1 standard deviation across the randomisations
are also shown.
Chord Precision
Minmaj
Note Precision
Chord Precision
74
5. EXPLOITING ADDITIONAL DATA
5.4 Chord Databases for use in testing
5.3.2
Discussion
Generally speaking, we see from Figure 5.6 that test performance improves as the
amount of data increases. Performance increases about 2.5 percentage points for the
minmaj alphabet, and around 4 percentage points for the MM/Quads alphabet. The
performance for the Triads alphabet appears to plateau very quickly to 65%, with
manual inspection revealing that the performance increased very rapidly from 0 to
10% training size. In all cases the increase slightly more pronounced under the Chord
Precision evaluation, which we would expect as it is the more challenging evaluation
and benefits the most from additional data.
5.4
Chord Databases for use in testing
Owing to the scarcity of fully labelled data until very recent times, some authors have
explored other sources of information to train models, as we have done in our previous
work [60, 72, 73, 74]. One such source of information is guitarist websites such as echords1 . These websites typically include chord labels and lyrics annotated for many
thousands of songs. In the present section we will investigate if such websites can be
used to aid chord recognition, following our previous work in the area [74].
5.4.1
Untimed Chord Sequences
e-chords.com is a website where registered users are able to upload the chords, lyrics,
keys, and structural information for popular songs2 . Although the lyrics may provide
useful information, we discard them in the current analysis.
Some e-chords annotations contain key information, although informal investigations have led us to believe that this information is highly noisy, so it will be discarded
1
www.e-chords.com
although many websites similar to e-chords exist, we chose to work with this owing to its size
(annotations for over 140, 000 songs) and the ease of extraction (chord labels are enclosed in html tags,
making them easy to robustly “scrape” from the web).
2
105
5. EXPLOITING ADDITIONAL DATA
G
D7/A
Em
Love, Love, Love
G
D7/A
Em
Love, Love, Love
D7/A
G
D7/A
Love, Love, Love
G
D7/F#
Em
There’s nothing you can do that can’t be done
G
D7/F#
Em
There’s nothing you can sing that can’t be sung
G
D7/A
D/F#
D7
Nothing you can say but you can learn to play the game
D7/A
D
It’s easy
There’s nothing you can make than can’t be made
No one you can save that can’t be saved
Nothing you can do but you can learn to be you in time
It’s easy
Chorus:
G
A7sus
D7
All you need is love
Figure 5.7: Example e-chords chord and lyric annotation for “All You Need is Love”
(Lennon/McCartney), showing chord labels above lyrics.
in this work. A typical section of an e-chords annotation is shown in Figure 5.7.
Notice that the duration of the chords is not explicitly stated, although an indication
of the chord boundaries is given by their position on the page. We will exploit this
information in section 5.4.2. Since timings are absent in the e-chords annotations,
we refer to each chord sequence as an Untimed Chord Sequence (UCS), and denote it
e ∈ A|e| , where A is the chord alphabet used. For instance, the UCS corresponding to
the song in Figure 5.7 (with line breaks also annotated) is
e = N C G D7/A Em [newline] G D7/A Em [newline] . . . D7 N C .
Note that we cannot infer periods of silence from a UCS. To counteract the need for
106
5.4 Chord Databases for use in testing
silence at the beginning and end of songs, we added a no-chord symbol at the start and
end of each UCS.
It is worth noting that multiple versions of some songs exist. A variation may have
different but similar-sounding chord sequences (we assume the annotations on e-chords
are uploaded by people without formal musical training), different recordings of the
same song, or in a transposed key (the last of these is common because some keys on
the guitar are easier to play in than others). We refer to the multiple files as song
redundancies, and to be exhaustive we consider each of the redundancies in every key
transposition. We will discuss a way of choosing the best key and redundancy in section
5.4.3.
The principle of this section is to use the UCSs to constrain, in a certain way, the
set of possible chord transitions for a given test song. Mathematically, this is done by
modelling the joint probability of chords and chromagrams of a song (X, y) by
0
P (X, y|Θ, e) = Pini (y1 |Θ) · Pobs (x1 |y1 , Θ) ·
|y|
Y
Ptr0 (yt |yt−1 , Θ, e) · Pobs (xt |yt , Θ). (5.1)
t=2
This distribution is the same as in Equation 2.11, except that the transition distribution
Ptr0 now also depends on the e-chord UCS e for this song, essentially by constraining
the transitions that are allowed, as we will detail in subsection 5.4.2.
An important benefit of this approach is that the chord recognition task can still be
solved by the Viterbi algorithm, albeit applied to an altered model with an augmented
transition probability distribution. Chord recognition using the extra information from
the UCS then amounts to solving
y∗ = arg max P 0 (X, y|Θ, e).
y
(5.2)
The more stringent the constraints imposed on Ptr0 , the more information from the UCS
is used, but the effect of noise will be more detrimental. On the other hand, if the extent
107
5. EXPLOITING ADDITIONAL DATA
of reliance on the UCS is less detailed, noise will have a smaller effect. The challenge
is to find the right balance and to understand which information from the UCSs can
be trusted for most of the songs. In the following subsections we will explore various
ways in which e-chords UCSs can be used to constrain chord transitions, in search for
the optimal trade-off. The empirical results will be demonstrated in subsection 5.4.4.
5.4.2
Constrained Viterbi
In this subsection, we detail the ways in which we will use increasing information for
the e-chords UCSs in the decoding process.
Alphabet Constrained Viterbi (ACV)
Given the e-chord UCS e ∈ A|e| for a test song, the most obvious constraint that can
be placed on the original state diagram is to restrict the output to only those chords
appearing in e. This is implemented simply by setting the new transition distribution
Ptr0 as
Ptr0 (aj |ai , Θ, e)
=



1
Z Ptr (ai , aj )

 0
if ai ∈ e & aj ∈ e
,
(5.3)
otherwise
with Z as a normalization factor1 . An example of this constraint for a segment of the
Beatles song “All You Need Is Love” (Figure 5.7) is illustrated in Figure 5.8 (a), where
the hidden states (chords) with 0 transition probabilities are removed. We call this
method Alphabet Constrained Viterbi, or ACV.
Alphabet and Transition Constrained Viterbi (ATCV)
We can also directly restrict the transitions that are allowed to occur by setting all
Ptr (ai , aj ) = 0 unless we observe a transition from chord ai to chord aj in the e-chords
1
0
0
normalization factor Z is used to re-normalize Ptr
so that Ptr
meets the probability criterion
P The
0
Ptr
(aj |ai , Θ, e) = 1. Similar operations are done for the three methods presented in this subsection.
aj ∈A
108
5.4 Chord Databases for use in testing
(c)
(a)
G
G
(L1)
(d)
G
(L1)
D/F#
(L1)
D/F#
(L1)
Em
(L1)
Em
(L1)
End of Line 1
D/F#
Em
A7sus
D7/A
D7
G
(L2)
G
(L2)
D/F#
(L2)
D/F#
(L2)
Em
(L2)
Em
(L2)
End of Line 2
(b)
G
D7/A
(L3)
D7/A
(L3)
G
(L3)
G
(L3)
D/F#
(L3)
D/F#
(L3)
D7
(L3)
D7
(L3)
Em
D/F#
End of Line 3
A7sus
D7/A
D7
(L4)
D7
(L4)
D7/A
(L4)
D7/A
(L4)
D7
End of Line 4
G
(L5)
G
(L5)
A7sus
(L5)
A7sus
(L5)
D7
(L5)
D7
(L5)
Figure 5.8: Example HMM topology for Figure 5.7. Shown here: (a) Alphabet Constrained Viterbi (ACV), (b) Alphabet and Transition Constrained Viterbi (ACV), (c) Untimed Chord Sequence Alignment (UCSA), (d) Jump Alignment (JA).
109
5. EXPLOITING ADDITIONAL DATA
file (e.g. Figure 5.8 (b)). This is equivalent to constraining Ptr0 such that,
Ptr0 (aj |ai , Θ, e)
=



1
Z Ptr (ai , aj )

 0
if ai aj ∈ e or ai = aj and ai ∈ e
,
(5.4)
otherwise
where ai aj denotes a transition pair and Z is the normalization factor. We call this
method Alphabet and Transition Constrained Viterbi, ATCV. The topology for this
method is shown in Figure 5.8(b).
Untimed Chord Sequence Alignment (UCSA)
An even more stringent constraint on the chord sequence y for a test song is to require
it to respect the exact order of chords as seen in the UCS e. Doing this corresponds
to finding an alignment of e to the audio, since all that remains for the decoder to do
is ascertain the duration of each chord. In fact, symbolic-to-audio sequence alignment
has previously been exploited as a chord recognition scheme and was shown to achieve
promising results on a small set of Beatles’ and classical music [99], albeit in an ideal
noise-free setting.
Interestingly, sequence alignment can be formalized as Viterbi inference in an HMM
with a special set of states and state transitions (see e.g., the pair-HMM discussed in
[25]). In our case, this new hidden state set A0 = {1, . . . , |e|} corresponds to the ordered
indices of the chords in the UCS e (see Figure 5.8 (c)). The state transitions are then
constrained by designing Ptr0 , such that,
Ptr0 (j|i, Θ, e) =



1
Z Ptr (ei , ej )

 0
if j ∈ {i, i + 1}
,
(5.5)
otherwise
where Z denotes the normalization factor for the new hidden state ei .
Briefly speaking, each state (i.e. each circle in Figure 5.8 (c)) can only undergo a
110
5.4 Chord Databases for use in testing
self-transition or move to the next state, constraining the chord prediction to follow
the same order as appeared in the e-chord UCS. This method is named Untimed Chord
Sequence Alignment (UCSA), and shown in Figure 5.8(c).
5.4.3
Jump Alignment
A prominent and highly disruptive type of noise in e-chords is that the chord sequence
is not always complete or in the correct order. As we will show in section 5.4.4, exact
alignment of chords to audio results in a decrease in performance accuracy. This is
due to repetition cues (e.g., “Play verse chords twice”) not being understood by our
scraper. Here we suggest a way to overcome this by means of a more flexible form of
alignment to which we refer to as Jump Alignment (JA)1 , which makes use of the line
information of the UCSs.2
In the UCSA setting, the only options were to remain on a chord, or progress to the
next one. As we discussed, the drawback of this is that we sometimes want to jump
to other parts of the annotation. The salient feature of JA is that instead of moving
from chord to chord in the e-chords sequence, at the end of an annotation line we allow
jumps to the beginning of the current line, as well as all previous and subsequent lines.
This means that it is possible to repeat sections that may correspond to repeating verse
chords, etc.
An example of a potential JA is shown in Figure 5.9. In the strict alignment method
(UCSA), the decoder would be forced to go from the D7 above “easy” to the G7 to
start the chorus (see Figure 5.8 (c)). We now have the option of “jumping back” from
1
Although Jump Alignment is similar to the jump dynamic time warping (jumpDTW) method
presented in [32], it is worth pointing out that the situation we encountered is more difficult than that
faced by music score-performance synchronization, where the music sections to be aligned are generally
noise-free, and where clear cues are available in the score as to where jumps may occur. Furthermore,
since the applications of JA and jumpDTW are in different areas, the optimisation functions and
topologies are different.
2
We should point out that our method depends on the availability of line information. However,
most online chord databases contain this, such that the JA method is applicable not only to UCSs from
the large e-chords database, but also beyond it.
111
5. EXPLOITING ADDITIONAL DATA
G
D7/A
Em
G
D7/A
Em
D7/A
G
D7/A
Love, Love, Love
Love, Love, Love
1
Love, Love, Love
G
D7/F#
Em
There’s nothing you can do that can’t be done
G
D7/F#
Em
There’s nothing you can sing that can’t be sung
G
D7/A
D/F#
D7
Nothing you can say but you can learn to play the game
D7/A
D
It’s easy
There’s nothing you can make than can’t be made
No one you can save that can’t be saved
Nothing you can do but you can learn to be you in time
It’s easy
2
Chorus:
G
A7sus
D7
All you need is love
Figure 5.9: Example application of Jump Alignment for the song presented in Figure 5.7.
By allowing jumps from ends of lines to previous and future lines, we allow an alignment
that follows the solid path, then jumps back to the beginning of the song to repeat the
verse chords before continuing to the chorus.
112
5.4 Chord Databases for use in testing
the D7 to the beginning of the first line (or any other line). We can therefore take the
solid line path, then jump back (dashed path 1), repeat the solid line path, and then
jump to the chorus (dashed path 2). This gives us a path through the chord sequence
that is better aligned to the global structure of the audio.
This flexibility is implemented by allowing transitions corresponding to jumps backward (green arrows in Figure 5.8 (d)) and jumps forward (blue arrows in Figure 5.8
(d)). The transition probability distribution Ptr0 (still on the new augmented state
space A0 = {1, . . . , |e|} introduced in section 5.4.2) is then expressed as,
Ptr0 (j|i, Θ, e)
=









1
Z Ptr (ei , ej )
if j ∈ {i, i + 1}
pf
Z Ptr (ei , ej )
if i + 1 < j & i is the end and j the beginning of a line

pb


Z Ptr (ei , ej )




 0
if i > j & i is the end and j the beginning of a line
otherwise.
(5.6)
Hence, if the current chord to be aligned is not the end of an annotation line,
the only transitions allowed are to itself or the next chord, which executes the same
operations as in UCSA. At the end of a line, an additional choice to jump backward
or forward to the beginning of any line is permitted with a certain probability. In
effect, Jump Alignment can be regarded as a constrained Viterbi alignment, in which
the length of the Viterbi path is fixed to be |X|.
This extra flexibility comes at a cost: we must specify a jump backward probability
pb and a jump forward probability pf to constrain the jumps. To tune these parameters,
we used maximum likelihood estimation, which exhaustively searches a pre-defined
(pb , pf ) matrix and picks up the pair that generates the most probable chord labelling
for an input X (note that UCSA is a special case of JA that is obtained by setting both
jump probabilities (pb , pf ) to 0).
The pseudo-code of the JA algorithm is presented in Table 5.4, where two additional matrices Pobs = {Pobs (xt |ai , Θ)|t = 1, . . . , |X| and i = 1, . . . , |A|} and P0tr =
113
,
5. EXPLOITING ADDITIONAL DATA
{Ptr0 (j|i, Θ, e)|i, j = 1, . . . , |e|}, are introduced for notational convenience.
Table 5.4: Pseudocode for the Jump Alignment algorithm.
Input: A chromagram X and its UCS e, the observation probability matrix
Pobs , the transition probability matrix Ptr , the initial distribution vector
Pini and the jump probabilities pb and pf
1) Restructure the transition probabilities
Initialise a new transition matrix P0tr ∈ R|e|×|e|
for i = 1, . . . , |e|
for j = 1, . . . , |e|
if i = j then P0tr (i, j) = Ptr (ei , ei )
if i = j − 1 then P0tr (i, j) = Ptr (ei , ej )
if i is the end of a line and j is the beginning of a line
if i > j then P0tr (i, j) = pb × Ptr (ei , ej )
if i < j then P0tr (i, j) = pf × Ptr (ei , ej )
else P0tr (i, j)=0
Re-normalise P0tr such that each row sums to 1
2) Fill in the travel grid
Initialise a travel grid G ∈ R|X|×|e|
Initialise a path tracing grid TR ∈ R|X|×|e|
for j = 1, . . . , |e|
G(1, j) = Pobs (x1 , ej ) × Pini (ej )
for t = 2, . . . , |X|
for j = 1, . . . , |e|
|e|
G(t, j) = Pobs (xt , ej ) × max(G(t − 1, i) × P0tr (i, j))
i=1
|e|
T R(t, j) = arg max(G(t − 1, i) × P0tr (i, j))
i=1
3) Derive the Viterbi path
The path probability P =
|e|)
G(|X|,
The Viterbi path V P = |e|
for t = |X|,
...,2
V P = T R(t, V P (1)), V P
V P = e(V P )
Output: The Viterbi path V P and the path likelihood P
Choosing the Best Key and Redundancy
In all the above methods we needed a way of predicting which key transposition and redundancy was the best to use, since there were multiple versions and key transpositions
in the database. Similar to the authors of [57], we suggest to use the log-likelihood as
a measure of the quality of the prediction (we refer to this scheme as “Likelihood”).
114
5.4 Chord Databases for use in testing
In the experiments in section 5.4.4 we investigate the performance of this approach
to estimate the correct transposition, showing that it is almost as accurate as using the
key and transposition that maximised the performance (which we call “Accuracy”).
5.4.4
Experiments
In order to evaluate the performance of using online chord databases in testing, we
must test on songs for which the ground truth is currently available. Being the most
prominent single artist in any of our datasets, we chose The Beatles as our test set.
We used the USpop dataset to train the parameters for an HMM and used these, in
addition with increasing amounts of online information, to decode the chord sequence
for each of the songs in the test set.
We found that 174 of the 180 songs had at least one file on e-chords.com, and we
therefore used this as our test set. Although a full range of complex chords are present
in the UCSs, we choose to work with the minmaj alphabet as a proof of concept.
We used either the true chord sequence (GTUCS), devoid of timing information, or the
genuine UCS; and chose the best key and redundancy using either the largest likelihood
or best performance. Results are shown in Table 5.5. From a baseline prediction level
Table 5.5: Results using online chord annotations in testing. Amount of information
increases left to right, Note Precision is shown in the first 3 rows. p–values using the
Wilcoxon signed rank test for each result with respect to that to the left of it are shown
in rows 4–6.
Model
HMM
ACV
ATCV
UCSA
JA
NP (%)
GTUCS
Accuracy
Likelihood
76.33
76.33
76.33
80.40
79.56
79.02
83.54
81.19
80.95
88.76
73.10
72.61
−
83.64
82.12
p-value
GTUCS
Accuracy
Likelihood
−
−
−
2.73e − 28
7.07e − 12
1.63e − 15
1.06e − 23
5.52e − 11
2.3e − 10
1.28e − 29
4.13e − 14
3.05e − 13
−
4.67e − 9
7.19e − 27
115
5. EXPLOITING ADDITIONAL DATA
of 76.33% Note Precision, we see a rapid improvement in recognition rates by using
the ground truth UCS (top row of Table 5.5, peaking at 88.76%). Note that JA is
neither possible nor necessary with the ground truths, as we know that the chords in
the Ground Truth are in the correct order.
When using genuine UCSs, we also see an improvement when using Alphabet Constrained Viterbi (ACV, column 2) and Alphabet and Transition Constrained Viterbi
(ATCV, column 3). However, when attempting to align the UCSs to the chromagram
(UCSA, column 4), performance decreases. Upon inspection of the decoded sequences,
we discovered that this was because complex line information (Play these chords twice,
etc.) were not understood by our scraper. To counteract this, we employed Jump Alignment (JA, final column) where we saw an increase in recognition rate, although the
recognition rate naturally does not match performance when using the true sequence.
Comparing the likelihood method to the accuracy (rows 2 to 3), we see that both
models are very competitive, suggesting that using the likelihood is often picking the
correct key and most useful redundancy of a UCS. Inspecting the p–values (rows 4–
6) shows that all increases in performances are statistically significant at the 1% level.
This is a significant result, as it shows that knowledge of the correct key and most informative redundancy offers only a slight improvement over the fully automatic approach.
However, statistical tests were also conducted to ascertain whether the difference between the Accuracy and Likelihood settings of Table 5.5 were significant on models
involving the use of UCSs. Wlicoxon signed rank tests yielded p-values of less than
0.05 in all cases, suggesting that true knowledge of the ‘best’ key and transposition
offers significant benefits when exploiting UCSs in ACE. We show the data from Table
5.5 in Figure 5.10, where the benefit of using additional information from internet chord
annotations and the similarity between the “likelihood” and “accuracy” schemes are
easily seen.
116
5.5 Chord Databases in Training
90
88
Note Precision (%)
86
84
82
80
78
Best Guess
Best Accuracy
Ground Truth
76
74
HMM
ACV
ATCV
JA
Figure 5.10: Results from Table 5.5, with UCSA omitted. Increasing amounts of information from e-chords is used from left to right. Information used is either simulated
(ground truth, dotted line) or genuine (dashed and solid lines). Performance is measured
using Note Precision, and the TRCO evaluation scheme is used throughout.
5.5
Chord Databases in Training
We have seen that it is possible to align UCSs to chromagram feature vectors by the
use of Jump Alignment, and that this leads to improved recognition rates. However,
an interesting question now arises: Can we align a large number of UCSs to form a new
large training set? This question will be investigated in the current section, the basis
of which is one of our publications [74].
As we will show, in this setting this basic approach unfortunately deteriorates performance, rather than improving it. The cause of this seems to be the high proportion
of low quality aligned UCSs. A key concept in this section is a resolution of this issue,
using a curriculum learning approach. We briefly introduce the concept of curriculum
learning before presenting the details of our experiments.
117
5. EXPLOITING ADDITIONAL DATA
5.5.1
Curriculum Learning
It has been shown that humans and animals learn more efficiently when training examples are presented in a meaningful way, rather than in a homogeneous manner [28, 50].
Exploiting this feature of learners is referred to as Shaping, in the animal training
community, and Curriculum Learning (CL), in the machine learning discipline [6].
The core assumption of the CL paradigm is that starting with easy examples and
slowly generalising leads to more efficient learning. In a machine learning setting this
can be realised by carefully selecting training data from a large set of examples. In
[6], the authors hypothesize that CL offers faster training (both in optimization and
statistical terms) in online training settings, owing to the fact that the learner wastes
less time with noisy or harder–to–predict examples. Additionally, the authors assume
that guiding the training into a desirable parameter space leads to better generalization.
Due to high variability in the quality of e–chords UCSs, CL seems a particularly
promising idea to help us make use of aligned UCSs in an appropriate preference order,
from easy to difficult. Until now we have not defined what we understand by “easy”
examples or how to sort the available examples in order of increasing difficulty. The
CL paradigm provides little formal guidance for how to do this, but generally speaking,
easy examples are those that the recognition system can already handle fairly well, such
that considering them will only incrementally alter the recognition system.
Thus, we need a way to quantify how well our chord recognition system is able to
annotate chords to audio for which we only have UCSs and no ground truth annotations.
To this end, we propose a new metric for evaluating chord sequences based on a UCS
only. We will refer to this metric as the Alignment Quality Measure.
In summary, our CL approach rests on two hypotheses:
1. Introducing “easy” examples into the training set leads to faster learning.
2. The Alignment Quality Measure quantifies how “easy” a song with associated
118
5.5 Chord Databases in Training
UCS is for the current chord recognition system, more specifically whether it is
able to accurately annotate the song with chords.
Both these hypotheses are non–trivial, and we will empirically confirm their validity
below.
5.5.2
Alignment Quality Measure
We first address the issue of determining the quality of UCS alignment without the aid
of ground truth. In our previous work [73], we used the likelihood of the alignment
(normalised by the number of frames) as a proxy for the alignment quality. In this
work we take a slightly different approach, which we have found to be more robust.
Let {AUCS}N
n=1 be a set of UCSs aligned using Jump Alignment. For each UCS
chromagram, we made a simple HMM prediction using the core training set to create
a set of predictions {HMM}N
n=1 . We then compared these predictions to the aligned
UCS to estimate how close the alignment has come to a rough estimate of the chords.
Thus, we define:
1
γi =
|AUCSi |
|AUCSi |
X
I(AUCSti = HMMti )
(5.7)
t=1
where I is an indicator function and AUCSti and HMMti represent the tth frame of the
ith aligned UCS and HMM prediction, respectively.
We tested the ability of this metric to rank the quality of the alignments, using
the set–up from the experiments in subsection 5.4.4 (ground truths were required to
test this method). We found the rank correlation between γ and the actual HMM
performance to be 0.74, with a highly significant p–value of p < 10−30 , indicating that
point 2 has been answered (i.e., we have an automatic method of measuring how good
the alignment of a UCS to a chromagram is).
119
5. EXPLOITING ADDITIONAL DATA
5.5.3
Results and Discussion
Confident that we now have a method for assessing alignment quality, we set about
aligning a large number of UCSs to form a new training set. We took the MIREX
dataset as the core training set, and trained an HMM on these data. These parameters
were then used to align 1, 683 UCSs for which we had audio (we only used UCSs that had
at least 10 chord symbols to clean the data, reducing the dataset from 1, 822 examples).
We then ran an HMM over these chroma and calculated the alignment quality γ for
each of the aligned UCSs. These were then sorted and added in descending order to
the core training set. Finally, an HMM was re–trained on the union of the core and
expansion sets and tested on the union of the USpop and Billboard datasets.
From our previous work [73], we know that expanding the training set is only
beneficial when the task is sufficiently challenging (a system that already performs well
has little need of additional training data). For this reason, we evaluated this task on
the MM alphabet. Results are shown in Figure 5.11.
Here we show the alignment quality threshold on the x–axis, with the number
of UCSs this corresponds to on the left y–axis. The baseline performance occurs at
alignment quality threshold ∞, i.e., when we use no UCSs and the threshold is shown
as a grey, dashed line; whilst performance using the additional UCSs is shown as a
solid black line, with performance being measured in both cases in TRCO on the right
y–axis.
The first observation is that there are a large number of poor–quality aligned UCSs,
as shown by the large number of expansion songs in the left–most bin for number of
expansion songs. Including all of these sequences leads to a large drop in performance,
from a baseline of 52.34% to 47.50% TRCO Note Precision. Fortunately, we can automatically remove these poor–quality aligned UCSs via the alignment quality measure
γ. By being more stringent with our data (γ → 1), we see that, although the number of
additional training examples drops, we begin to see a boost in performance, peaking at
120
5.5 Chord Databases in Training
55
53
1600
51
49
1200
47
800
45
43
400
Performance (% TRCO)
Number of Expansion Songs
2000
41
0
0
0.2
0.4
0.6
Alignment Quality Threshold
0.8
1
39
Figure 5.11: Using aligned Untimed Chord Sequences as an additional training source.
The alignment quality threshold increases along the x–axis, with the number of UCSs this
corresponds to on the left y–axis. Baseline performance is shown as a grey, dashed line;
performance using the additional UCSs is shown as the solid black line, with performance
being measure in TRCO on the right y–axis. Experiments using random training sets of
equal size to the black line with error bars of width 1 standard deviation are shown as a
black dot–and–dashed line.
54.66% when setting γ = .5. However, apart from the extreme case of using all aligned
USCs, each threshold leads to an improvement over the baseline, suggesting that this
method is not too sensitive to the parameter γ. The test performances were compared
to the baseline method in a paired t–test and, apart from the cases when we use all or
no UCSs (γ = 0, 1 resp.), all improvements were seen to be significant, as indicated by
p–values of less than 10−5 . The p−value for the best performing case when γ = 0.5 was
numerically 0, which corresponded to an improvement in 477 of the 715 test songs.
To see if curriculum learning genuinely offered improvements over homogeneous
learning, we also included aligned UCSs into the training set in random batches of the
same size as the previous experiment, and repeated 30 times to account for random
variations. The mean and standard deviations over the 30 repeats are shown as the
121
5. EXPLOITING ADDITIONAL DATA
dot–and–dashed line and bars in Figure 5.11. We can see that the specific ordering of
the expansion set offers substantial improvement over randomly selecting the expansion
set, and in fact, ordering the data randomly never reaches the baseline performance.
This is good evidence that curriculum learning is the method of choice for navigating a
large set of training examples, and also demonstrates that the first assumption of the
Curriculum Learning paradigm holds.
5.6
Conclusions
This chapter was concerned with retraining our model on datasets outside the MIREX
paradigm. We saw that training a model on a small amount of data can lead to
strong overfitting and poor generalisation (for instance, training on seven Carole King
tracks). However, when sufficient training data exists we attain good training and
test performances, and noted in particular that generalisation between the Billboard,
MIREX and USpop datasets is good. Across more complex chord alphabets, we see a
drop in performance as the complexity of chords increases, as is to be expected.
We also showed the dominance of HPA over the baseline HMM on all datasets that
contained key information on which to train. Using leave–one–out testing, we saw that
an overall estimate of the test set performance was 54.73% − 70.71% TRCO, depending
on the alphabet used, although the variance in this setting is large. Following this,
we investigated how fast HPA learns by constructing learning curves, and found that
the initial learning rate is fast, but appears to plateau for simpler alphabets such as
minmaj.
The next main section of this chapter looked at online chord databases as an additional source of information. We first investigated if chord sequences obtained from
the web could be used in a test setting. Specifically, we constrained the output of the
Viterbi decoder according to these sequences to see if they could aid decoding perfor-
122
5.6 Conclusions
mance. We experienced an increase in recognition performance from 76.33% to 79.02%
by constraining the alphabet, and 80.95% by constraining the alphabet and transitions,
but a drop to 72.61% when aligning the sequences to the audio. However, this drop
was resolved by the use of Jump Alignment, where we attained 82.12% accuracy. All
of the results above were obtained by choosing the key and redundancy for a UCS
automatically.
Next, we investigated whether aligning a large number of UCSs to audio could form
a new training set. By training on the MIREX dataset, we aligned a large number
of UCSs to chromagram feature vectors and experienced an increase of 2.5 percentage
points when using a complex chord alphabet. This was obtained by using an alignment
quality measure γ to estimate how successful an alignment of a UCS to audio was. These
were then sorted and added to the data in decreasing order, in a form of curriculum
learning. Performance peaked when using γ = 0.5, although using any number of
sequences apart from the worst ones led to an improvement. We also experimentally
verified that the curriculum learning setting is essential if we are to use UCSs as a
training source by adding aligned UCSs to the expansion set in random order.
123
5. EXPLOITING ADDITIONAL DATA
124
6
Conclusions
In this thesis, we have designed and tested a new method for the extraction of musical
chords from audio. To achieve this, we conducted a review of the literature in the field,
including the annual benchmarking MIREX evaluations. We also defined a new feature
for use in chord recognition, the loudness-based chromagram. Decoding was achieved by
Viterbi inference using our Dynamic Bayesian Network HPA (the Harmony Progression
Analyser); we achieved cutting-edge performance when deploying this method on the
MIREX dataset. We also saw that HPA may be re-trained on new ground truth data
as it arises, and tested this on several new datasets.
In this brief chapter, we review the main findings and results in section 6.1 and
suggest areas for further research in section 6.2.
6.1
Summary
Chapter 1: Introduction
In the opening chapter, we first defined the task of automatic chord estimation as the
unaided extraction of chord labels and boundaries from audio. We then motivated our
work as a combination of three factors: the desire to make a tool for amateur musi-
125
6. CONCLUSIONS
cians for educational purposes, the use of chord sequences in higher-level MIR tasks,
and the promise that recent machine-learning techniques have shown in tasks such as
image recognition and automatic translation. Next, we outlined our research objectives
and contributions, with reference to the thesis structure and main publications by the
author.
Chapter 2: Background
In chapter 2, we looked at chords and their musical function. We defined a chord
as occurring when three or more notes are sounded simultaneously, or functioning as
if sounded simultaneously [93]. This led into a discussion of musical keys, and we
commented that it is sometimes more convenient to think of a group of chords as
defining a key - sometimes conversely. Several authors have exploited this fact by
estimating chords and keys simultaneously [16, 57].
We next gave a chronological account of the literature for the domain of Automatic
Chord Estimation. We found that through early work on Pitch Class Profiles, Fujishima [33] was able to estimate the chords played on a solo piano by using pattern
matching techniques in real time. A breakthrough in feature extraction came in 2001
when [79] used a constant-Q spectrum to characterise the energy of the pitch classes
in a chromagram. Since then, other techniques for improving the accuracy of chord
recognition systems have included the removal of background spectra and/or harmonics [65, 96, 111], compensation for tuning [38, 44, 99], smoothing/beat synchronisation
[4, 52], mapping to the tonal centroid space [37], and integrating bass information
[63, 107].
We saw that the two dominant models in the literature are template-based methods
[15, 86, 106] and Hidden Markov Models [19, 87, 99]. Some authors have also explored
using more complex models, such as HMMs, with an additional chain for the musical
key [100, 119] or larger Dynamic Bayesian Networks [65]. In addition to this, some
126
6.1 Summary
research has explored whether a language model is appropriate for modelling chords
[98, 117], or if discriminative modelling [12, 115] or genre-specific models [55] offer
superior performance.
With regard to evaluation, the number of correctly identified frames divided by the
total number of frames is the standard way of measuring performance for a song, with
Total Relative Correct Overlap and Average Relative Correct Overlap being the most
common evaluation schemes when dealing with many songs. Most authors in the field
reduce their ground truth and predicted chord labels to major and minor chords only
[54, 87], although the main triads [12, 118] and larger alphabets [65, 99] have also been
considered.
Finally, we conducted a review in this chapter of the Music Information Evaluation eXchange (MIREX), which has been benchmarking ACE systems since 2008.
Significantly, we noted that the expected trend of pre-trained systems outperforming
train/test systems was not observed every year. This, however, was highlighted by
our own submission NMSD2 in 2011, which attained 97.60 TRCO, underscoring the
difficulty in using MIREX as a benchmarking system when the test data is known.
Chapter 3: Chromagram Extraction
In this chapter, we first discussed our motivation for calculating loudness-based chromagram feature vectors. We then detailed the preprocessing that an audio waveform
undergoes before analysis. Specifically, we downsample to 11, 025 samples per second,
collapse to mono, and employ Harmonic and Percussive Sound Separation to the waveform. We then estimate the tuning of the piece using an existing algorithm [26] to
modify the frequencies we search for in the calculation of a constant-Q based spectrogram. The loudness at each frequency is then calculated and adjusted for human
sensitivity by the industry-standard A-weighting [103] before octave summing, beatsynchronising and normalising our features.
127
6. CONCLUSIONS
Experimentally, we first described how we attained beat-synchronised ground truth
annotations to match our features. We then tested each aspect of our feature extraction process on the MIREX dataset of 217 songs, and found that the best performance
(80.91% TRCO) was attained by using the full complement of signal processing techniques.
Chapter 4: Dynamic Bayesian Network
A mathematical description of our Dynamic Bayesian Network (DBN), the Harmony
Progression Analyser (HPA), was the first objective of this chapter. This DBN has
hidden nodes for chords, bass notes, and key sequences and observed nodes representing
the treble and bass frequencies of a musical piece. We noted that this number of nodes
and links places enormous constraints on the decoding and memory costs of HPA, but
we showed that two-stage predictions and making use of the training data permitted
us to reduce the search space to an acceptable level.
Experimentally, we then built up the nodes used in HPA from a basic HMM. We
found that the full HPA model performed the best in a train/test setting, achieving
83.52% TRCO in an experiment comparable to the MIREX competition, and attaining
a result equal to the current state of the art. We also introduced two metrics for
evaluating ACE systems: chord precision (which measures 1 if the chord symbols in
ground truth and prediction are identical), and note precision (1 if the notes in the
chords are the same, 0 otherwise). We noted that the key accuracies for our model
were quite poor. Bass accuracies on the other hand were high, peaking at 86.08%.
Once the experiments on major and minor chords were complete (Section 4.2),
we moved on to larger chord alphabets, including all triads and some chords with 4
notes, such as 7th s. We found that chord accuracies generally decreased, which was
as expected, but that results were at worst 57.76% (chord precision, Quads alphabet,
c.f. Minmaj at 74.08%). Specifically, performance for the triads alphabet peaked
128
6.1 Summary
at 78.85% Note Precision TRCO, whilst the results for the MM and Quads alphabets
peaked at 66.53% and 66.50%, respectively. Not much change was seen across alphabets
when using the MIREX metric, which means that this method is not appropriate for
evaluating complex chord alphabets. We also saw that HPA significantly outperformed
an HMM in all tasks described in this chapter, and attained performance in line with
the current state of the art (82.45% TRCO c.f. KO1 submission in 2011, 82.85%).
Chapter 5: Exploiting Additional Data
In chapter 5, we tested HPA on a variety of ground truth datasets that have recently
become available. These included the USpop set of 194 ground truth annotations, and
Billboard set of 522 songs, as well as two small sets by Carole King (7 songs) and Oasis
(5 songs). We saw poor performances when training on the small datasets of Carole
King and Oasis, which highlights a disadvantage of using data-driven systems such as
HPA.
However, when training data is sufficient, we attain good performances on all
chord alphabets. Particularly interesting was that training and testing on the Billboard/MIREX datasets gave performances similar to using HPA (train Billboard, Test
MIREX = 76.56% CP TRCO, train MIREX, test Billboard = 69.06% CP TRCO in
the minmaj alphabet), although the difficulty of testing on varied artists is highlighted
by the poorer performance when testing on Billboard. This does, however, show that
HPA is able to transfer learning from one dataset to another, and gives us hope that it
has good potential for generalisation.
Through leave-one-out testing, we were able to generate a good estimate of how
HPA deals with a mixed-test set of the MIREX, Billboard and Carole King datasets.
Performances here were slightly lower than in earlier experiments, and the variance was
high, again underscoring the difficulty of testing on a diverse set. We also investigated
how quickly HPA learns. Through plotting learning curves, we found out that HPA is
129
6. CONCLUSIONS
able to attain good performances on the Billboard, and that learning is fastest when
the task is most challenging (MM and Quads alphabets).
We then went on to see how Untimed Chord Sequences (UCSs) can be used to
enhance prediction accuracy for songs, when available. This was conducted by using
increasing amounts of information from UCSs from e-chords.com, where we found that
prediction accuracy increased from a baseline of 76.33% NP to 79.02% and 80.95% by
constraining the alphabet, and then transitions, allowed in the Viterbi inference. When
we tried to align the UCSs to the audio, we experienced a drop in performance to
72.61%, which we attributed to our assumption that the chord symbols on the website
are in the correct order, with no jumping through the annotation required. However,
this problem was overcome by the use of the Jump Alignment algorithm, which was
able to resolve these issues and attained performance of 82.12%.
In addition to their use in a test setting, we also discovered that aligned UCSs may
be used in a training scenario. Motivated by the steep learning curves for complex
chord alphabets seen in 5.3 and our previous results [73], we set about aligning a set of
1, 683 UCSs to audio, using the MIREX dataset as a core training set. We then trained
an HMM on the core training set, as well as the union of the core and expansion set,
and tested on the USpop and Billboard datasets, where we experienced an increase in
recognition rate from 52.34% to 54.66% TRCO. This was attained by sorting the aligned
UCSs according to alignment quality, and adding to the expansion set incrementally,
beginning with the “easiest” examples first in a form of curriculum learning, that was
shown to lead to an improvement in learning as opposed to homogeneous training.
6.2
Future Work
Through the course of this thesis, we have come across numerous situations where
further investigation would be interesting or insightful. We present a summary of these
130
6.2 Future Work
concepts here.
Publication of Literature Summary
In the review of the field that we conducted in section 2.2, we collated many of the
main research papers conducted on automatic chord estimation, and also summarised
the results of the MIREX evaluations from the past four years. We feel that such work
could be of use to the research community as an overview or introduction to the field,
and hence worthy of publication.
Local Tuning
The tuning algorithm we used [26] estimates global tuning by peak selecting in the
histogram of frequencies found in a piece. However, it is possible that the tuning
may change within one song, and that a local tuning method may yield more accurate
chromagram features. “Strawberry Fields Forever” (Lennon/McCartney) is an example
of one such song, where the CD recording is a concatenation of two sessions, each with
slightly different pitch.
Investigation of Key Accuracies
In section 4.2.3, we found that the key accuracy of HPA was quite poor in comparison
to the results attained when recognising chords. It seems that we were either correctly
identifying the correct key for all frames, or completely wrong (see Figures 4.2a, 4.2b,
4.2c). The reason for this could be an inappropriate model or an issue of evaluation.
For example, an error in predicting the key of G Major instead of C Major is a distance
of 1 around the cycle of fifths and is not as severe as confusing C Major with F] Major.
This is not currently factored into the frame-wise performance metric employed in this
work (nor is it for evaluation of chords).
131
6. CONCLUSIONS
Evaluation Strategies
We introduced two metrics for ACE in this thesis (Note Precision and Chord Precision) to add to the MIREX-style evaluation. However, each of these outputs a binary
correct/incorrect label for each frame, whereas a more flexible approach is more likely
to give insight into the kinds of errors ACE systems are making.
Intelligent Training
In subsection 5.1.2, we saw that HPA is able to learn from one dataset (i.e., MIREX) and
test on another (USpop), yielding good performance when training data is sufficient.
However, within this section and throughout this thesis, we have assumed that the
training and testing data come from the same distribution, whereas this may not be
the case in reality.
One way of dealing with this problem would be to use transfer learning [82] to share
information (model parameters) between tasks, which has been used in the past on a
series of related tasks in medical diagnostics and car insurance risk analysis. We believe
that this paradigm could lead to greater generalisation than the training scheme offered
within this thesis.
Another approach would be to use a genre-specific model, as proposed by Lee [55].
Although genre tags are not readily available for all of our datasets, information could
be gathered from several sources, including last.fm1 , the echonest2 or e-chords3 . This
information could be used to learn one model per genre in training, with all genre
models being used for testing, and a probabilistic method being used to assign the
most likely genre/model to a test song.
1
www.last.fm
the.echonest.com
3
www.e-chords.com
2
132
6.2 Future Work
Key Annotations for the USpop data
It is unfortunate that we could not train HPA on the USpop dataset, owing to the lack
of key annotations. Given that this is a relatively small dataset, a fruitful area of future
work would be to hand-annotate these data.
Improving UCS to chromagram pairings
When we wish to obtain the UCS for a given song (defined as an artist/title pair), we
need to query the database of artists and song titles from our data source to see how
many, if any, UCSs are available for this song. Currently, this is obtained by computing
a string equality between the artist and song title in the online database and our audio.
However, this method neglects errors in spelling, punctuation, and abbreviations, which
are rife in our online source (consider the number of possible spellings and abbreviations
of “Sgt. Pepper’s Lonely Hearts Club Band”).
This pairing could be improved by using techniques from the named entity recognition literature [108], perhaps in conjunction with some domain specific heuristics such
as stripping of “DJ” (Disk Jockey) or “MC” (Master of Ceremonies). An alternative
approach would be to make use of services from the echonest or musicbrainz1 , who specialise in such tasks. Improvements in this area will undoubtedly lead to more UCSs
being available, and yield higher gains when these data are used in a testing setting via
Jump Alignment.
Improvements in Curriculum Learning
We saw in section 5.5.1 that a curriculum learning paradigm was necessary to see
improvements when using UCSs as an additional training source. The specification of
the alignment quality measure γ was noticed to show improvements for γ ≥ 0.15, but
1
musicbrainz.org/
133
6. CONCLUSIONS
a more thorough investigation of the sensitivity of this parameter and how it may be
set may lead to further improvements in this setting.
Creation of an Aligned Chord Database
As an additional resource to researchers, it would be beneficial to release a large number
of aligned UCSs to the community. Although we know that these data must be used
with care, releasing such a database would still be a valuable tool to researchers and
would constitute by far the largest and most varied database of chord annotations
available.
Applications to Higher-level tasks
We mentioned in the introduction that applications to higher-level tasks was one motivation for this work. Given that we now have a cutting-edge system, we may begin
to think about possible application areas in the field of MIR. Previously, for example,
the author has worked on mood detection [71] and hit song science [80], where predicted chord sequences could be used as features for identifying melancholy or tense
songs (large number of minor/diminished chords) or successful harmonic progressions
(popular chord n−grams)
134
References
[1] Techniques for note identification in polyphonic music. CCRMA, Department of
Music, Stanford University, 1985.
[2] M. Barthet, A. Anglade, G. Fazekas, S. Kolozali, and R. Macrae. Music recommendation for music learning: Hotttabs, a multimedia guitar tutor. Workshop on
Music Recommendation and Discory, collated with ACM RecSys 2011 Chicago,
IL, USA October 23, 2011, page 7, 2011.
[3] M.A. Bartsch and G.H. Wakefield. To catch a chorus: Using chroma-based representations for audio thumbnailing. In Applications of Signal Processing to Audio
and Acoustics, 2001 IEEE Workshop on the, pages 15–18. IEEE, 2001.
[4] J.P. Bello and J. Pickens. A robust mid-level representation for harmonic content in music signals. In Proceedings of the 6th International Society for Music
Information Retrieval (ISMIR), pages 304–311, 2005.
[5] J.P. Bello, G. Monti, and M. Sandler. Techniques for automatic music transcription. In International Symposium on Music Information Retrieval, pages 23–25,
2000.
[6] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning.
In Proceedings of International Conference on Machine Learning, pages 41–48.
ACM, 2009.
135
REFERENCES
[7] A. Berenzweig, B. Logan, D.P.W. Ellis, and B. Whitman. A large-scale evaluation
of acoustic and subjective music-similarity measures. Computer Music Journal,
28(2):63–76, 2004.
[8] R. Bisiani. Beam search. Encyclopedia of Artificial Intelligence, 2, 1987.
[9] E.O. Brigham and R.E. Morrow. The fast Fourier transform. Spectrum, IEEE,
4(12):63–70, 1967.
[10] J. Brown. Calculation of a constant q spectral transform. Journal of the Acoustical
Society of America, 89(1):425–434, 1991.
[11] J.A. Burgoyne and L.K. Saul. Learning harmonic relationships in digital audio
with Dirichlet-based hidden Markov models. In Proceedings of the International
Conference on Music Information Retrieval (ISMIR), pages 438–443, 2005.
[12] J.A. Burgoyne, L. Pugin, C. Kereliuk, and I. Fujinaga. A cross-validated study of
modelling strategies for automatic chord recognition in audio. In Proceedings of
the 8th International Conference on Music Information Retrieval, pages 251–254,
2007.
[13] J.A. Burgoyne, J. Wild, and I. Fujinaga. An expert ground truth set for audio
chord recognition and music analysis. In Proceedings of the 12th International
Society for Music Information Retrieval (ISMIR), pages 633–638, 2011.
[14] E.M. Burns and W.D. Ward. Intervals, scales, and tuning. The psychology of
music, 2:215–264, 1999.
[15] G. Cabral, F. Pachet, and J.P. Briot. Automatic x traditional descriptor extraction: The case of chord recognition. In Proceedings of the 6th international
conference on music information retrieval, pages 444–449, 2005.
136
REFERENCES
[16] B. Catteau, J.P. Martens, and M. Leman. A probabilistic framework for audiobased tonal key and chord recognition. Advances in Data Analysis, pages 637–644,
2007.
[17] C. Chafe and D. Jaffe. Source separation and note identification in polyphonic
music. In Acoustics, Speech, and Signal Processing, IEEE International Conference on, volume 11, pages 1289–1292. IEEE, 1986.
[18] E. Chew. Towards a mathematical model of tonality. PhD thesis, Massachusetts
Institute of Technology, 2000.
[19] T. Cho and J.P. Bello. Real-time implementation of HMM-based chord estimation
in musical audio. In Proceedings of the International Computer Music Conference
(ICMC), pages 16–21, 2009.
[20] T. Cho and J.P. Bello. A feature smoothing method for chord recognition using recurrence plots. In Proceedings of the 12th International Society for Music
Information Retrieval Conference (ISMIR), 2011.
[21] T. Cho, R.J. Weiss, and J.P. Bello. Exploring common variations in state of the
art chord recognition systems. In Proceedings of the Sound and Music Computing
Conferecne (SMC), 2010.
[22] D. Conklin and I.H. Witten. Prediction and entropy of music. Master’s thesis,
Department of Computer Science, University of Calgary, 1990.
[23] D. Cope. Hidden structure: music analysis using computers, volume 23. AR
Editions, 2008.
[24] D. Deutsch. The psychology of music. Academic Press, 1999.
[25] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis:
137
REFERENCES
probabilistic models of proteins and nucleic acids. Cambridge University Press,
1998.
[26] D. Ellis and A. Weller. The 2010 labROSA chord recognition system. Proceedings
of the 11th Society for Music Information Retrieval, Music Information Retrieval
Evaluation exchange paper), 2010.
[27] D.P.W. Ellis and G.E. Poliner. Identifying ‘cover songs’ with chroma features and
dynamic programming beat tracking. In Acoustics, Speech and Signal Processing,
IEEE International Conference on, volume 4, pages IV–1429. IEEE, 2007.
[28] J.L. Elman. Learning and development in neural networks: the importance of
starting small. Cognition, 48(1):71–99, 1993. ISSN 0010-0277.
[29] H. Fletcher. Loudness, its definition, measurement and calculation. Journal of
the Acoustical Society of America, 5(2):82, 1933.
[30] M. Florentine. It’s not recruitment-gasp!! it’s softness imperception. The Hearing
Journal, 56(3):10, 2003.
[31] D. Fogel, J.C. Hanson, R. Kick, H.A. Malki, C. Sigwart, M. Stinson, E. Turban,
and S.H. Chairman-Rubin. The impact of machine learning on expert systems.
In Proceedings of the 1993 ACM conference on Computer science, pages 522–527.
ACM, 1993.
[32] C. Fremerey, M. M¨
uller, and M. Clausen. Handling repeats and jumps in scoreperformance synchronization. In Proceedings of the 11th International Society for
Music Information Retrieval (ISMIR), pages 243–248, 2010.
[33] T. Fujishima. Realtime chord recognition of musical sound: a system using common lisp music. In Proceedings of the International Computer Music Conference,
pages 464–467, 1999.
138
REFERENCES
[34] E. G´
omez and P. Herrera. The song remains the same: Identifying versions of
the same piece using tonal descriptors. In Proceedings of the 7th International
Conference on Music Information Retrieval (ISMIR), pages 180–185, 2006.
[35] M. Goto and Y. Muraoka. Real-time beat tracking for drumless audio signals:
Chord change detection for musical decisions. Speech Communication, 27(3):
311–335, 1999.
[36] C. Harte, M. Sandler, S. Abdallah, and E. G´omez. Symbolic representation of
musical chords: A proposed syntax for text annotations. In Proceedings of the 6th
International Conference on Music Information Retrieval (ISMIR), pages 66–71.
Citeseer, 2005.
[37] C. Harte, M. Sandler, and M. Gasser. Detecting harmonic change in musical
audio. In Proceedings of the 1st ACM workshop on Audio and music computing
multimedia, pages 21–26. ACM, 2006.
[38] C.A. Harte and M. Sandler. Automatic chord identification using a quantised
chromagram. In Proceedings of the Audio Engineering Society, pages 291–301,
2005.
[39] BS ISO. 226: Acoustics normal equal loudness-level contours. International
Organization for Standardization, 2003.
[40] N. Jiang, P. Grosche, V. Konz, and M. M¨
uller. Analyzing chroma feature types
for automated chord recognition. In Proceedings of the 42nd Audio Engineering
Society Conference, 2011.
[41] N.F. Johnson. Two’s company, three is complexity: a simple guide to the science
of all sciences. Oneworld Publications Ltd, 2007.
[42] O. Karolyi. Introducing music. Penguin (Non-Classics), 1965.
139
REFERENCES
[43] K. Kashino and N. Hagita. A music scene analysis system with the MRF-based
information integration scheme. In Pattern Recognition, Proceedings of the 13th
International Conference on, volume 2, pages 725–729. IEEE, 1996.
[44] M. Khadkevich and M. Omologo. Phase-change based tuning for automatic chord
recognition. In Proceedings of Digital Audio Effects Conference (DAFx), 2009.
[45] M. Khadkevich and M. Omologo. Use of hidden Markov models and factored
language models for automatic chord recognition. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 561–
566, 2009.
[46] Y.E. Kim, D.S. Williamson, and S. Pilli. Towards quantifying the album effect in
artist identification. In Proceedings of the 7th International Conference on Music
Information Retrieval (ISMIR), pages 393–394, 2006.
[47] A. Klapuri and M. Davy. Signal processing methods for music transcription.
Springer-Verlag New York Inc, 2006.
[48] R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation
and model selection. In International Joint Conference on Artificial Intelligence,
volume 14, pages 1137–1145, 1995.
[49] V. Konz, M. M¨
uller, and S. Ewert. A multi-perspective evaluation framework for
chord recognition. In Proceedings of the 11th International Conference on Music
Information Retrieval (ISMIR), pages 9–14, 2010.
[50] K.A. Krueger and P. Dayan. Flexible shaping: How learning in small steps helps.
Cognition, 110(3):380–394, 2009. ISSN 0010-0277.
[51] C.L. Krumhansl. Cognitive foundations of musical pitch. Oxford University Press,
USA, 2001.
140
REFERENCES
[52] S. Kullback and R.A. Leibler. On information and sufficiency. The Annals of
Mathematical Statistics, 22(1):79–86, 1951.
[53] C.L. Lawson and R.J. Hanson. Solving least squares problems, volume 15. Society
for Industrial Mathematics, 1995.
[54] K. Lee. Automatic chord recognition from audio using enhanced pitch class profile. In Proc. of the Intern. Computer Music Conference (ICMC), New Orleans,
USA, 2006.
[55] K. Lee. A system for automatic chord transcription from audio using genrespecific hidden Markov models. Adaptive Multimedial Retrieval: Retrieval, User,
and Semantics, pages 134–146, 2008.
[56] K. Lee and M. Slaney. A unified system for chord transcription and key extraction
using hidden Markov models. In Proceedings of the 8th International Conference
on Music Information Retrieval (ISMIR), 2007.
[57] K. Lee and M. Slaney. Acoustic chord transcription and key extraction from
audio using key-dependent HMMs trained on synthesized audio. Audio, Speech,
and Language Processing, IEEE Transactions on, 16(2):291–301, 2008.
[58] F. Lerdahl. Tonal pitch space. Oxford University Press, USA, 2005.
[59] R. Macrae and S. Dixon. A guitar tablature score follower. In Multimedia and
Expo (ICME), 2010 IEEE International Conference on, pages 725–726. IEEE,
2010.
[60] R. Macrae and S. Dixon. Guitar tab mining, analysis and ranking. In Proceedings
of the 12th International Society for Music Information Retrieval Conference
(ISMIR), 2011.
141
REFERENCES
[61] K.D. Martin. A blackboard system for automatic transcription of simple polyphonic music. Massachusetts Institute of Technology Media Laboratory Perceptual
Computing Section Technical Report, (385), 1996.
[62] M. Mauch. Automatic chord transcription from audio using computational models of musical context. unpublished PhD dissertation Queen Mary University of
London, pages 1–168, 2010.
[63] M. Mauch and S. Dixon. A discrete mixture model for chord labelling. In Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR), pages 45–50, 2008.
[64] M. Mauch and S. Dixon. Approximate note transcription for the improved identification of difficult chords. In Proceedings of the 11th International Society for
Music Information Retrieval Conference (ISMIR), pages 135–140, 2010.
[65] M. Mauch and S. Dixon. Simultaneous estimation of chords and musical context
from audio. Audio, Speech, and Language Processing, IEEE Transactions on, 18
(6):1280–1289, 2010.
[66] M. Mauch and M. Levy. Structural change on multiple time scales as a correlate
of musical complexity. In Proceedings of the 12th International Conference on
Music Information Retrieval (ISMIR 2011), pages 489–494, 2011.
[67] M. Mauch, S. Dixon, C. Harte, M. Casey, and B. Fields. Discovering chord idioms
through beatles and real book songs. In Proceedings of the 8th International
Conference on Music Information Retrieval ISMIR, pages 225–258.
[68] M. Mauch, K. Noland, and S. Dixon. Using musical structure to enhance automatic chord transcription. In Proceedings of the 10th International Conference
on Music Information Retrieval, pages 231–236, 2009.
142
REFERENCES
[69] M. Mauch, H. Fujihara, and M. Goto. Lyrics-to-audio alignment and phrase-level
segmentation using incomplete internet-style chord annotations. In Proceedings
of the 7th Sound and Music Computing Conference (SMC), pages 9–16, 2010.
[70] M. Mauch, H. Fujihara, and M. Goto. Integrating additional chord information into HMM-based lyrics-to-audio alignment. Audio, Speech, and Language
Processing, IEEE Transactions on, pages 200–210, 2012.
[71] M. McVicar and T. De Bie. CCA and a multi-way extension for investigating
common components between audio, lyrics and tags. In Proceedings of the 9th
International Symposium on Computer Music Modelling and Retrieval (CMMR),
2003.
[72] M. McVicar and T. De Bie. Enhancing chord recognition accuracy using web
resources. In Proceedings of 3rd international workshop on Machine learning and
music, pages 41–44. ACM, 2010.
[73] M. McVicar, Y. Ni, R. Santos-Rodriguez, and T. De Bie. Leveraging noisy online databases for use in chord recognition. Proeedings of the 12th International
Society on Music Information Retrieval (ISMIR), pages 639–644, 2011.
[74] M. McVicar, Y. Ni, R. Santos-Rodriguez, and T. De Bie. Using online chord
databases to enhance chord recognition. Journal of New Music Research, 40(2):
139–152, 2011.
[75] M. McVicar, Y. Ni, R. Santos-Rodriguez, and T. De Bie. Automatic chord estimation from audio: A review of the state of the art (under review). Audio,
Speech, and Language Processing, IEEE Transactions on, 2013.
[76] Inc Merriam-Webster. Merriam-Webster’s dictionary of English usage. Merriam
Webster, 1995.
143
REFERENCES
[77] T.K. Moon. The expectation-maximization algorithm. Signal Processing Magazine, IEEE, 13(6):47–60, 1996.
[78] M. M¨
uller and S. Ewert. Chroma Toolbox: MATLAB implementations for extracting variants of chroma-based audio features. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR), pages 215–220,
2011.
[79] S.H. Nawab, S.A. Ayyash, and R. Wotiz. Identification of musical chords using
constant-q spectra. In Acoustics, Speech, and Signal Processing, IEEE International Conference on (ICASSP), volume 5, pages 3373–3376. IEEE, 2001.
[80] Y. Ni, R. Santos-Rodriguez, M McVicar, and T. De Bie. Hit song science once
again a science?
In Proceedings of 4th international workshop on Music and
Machine Learning, 2011.
[81] Y. Ni, M. McVicar, R. Santos-Rodriguez, and T. De Bie. An end-to-end machine
learning system for harmonic analysis of music. Audio, Speech, and Language
Processing, IEEE Transactions on, 20(6):1771 –1783, aug. 2012. ISSN 1558-7916.
doi: 10.1109/TASL.2012.2188516.
[82] A. Niculescu-Mizil and R. Caruana. Inductive transfer for Bayesian network
structure learning. In Eleventh International Conference on Artificial Intelligence
and Statistics (AISTATS-07), 2007.
[83] K. Noland and M. Sandler. Influences of signal processing, tone profiles, and chord
progressions on a model for estimating the musical key from audio. Computer
Music Journal, 33(1):42–56, 2009.
[84] N. Ono, K. Miyamoto, J. Le Roux, H. Kameoka, and S. Sagayama. Separation of
a monaural audio signal into harmonic/percussive components by complementary
144
REFERENCES
diffusion on spectrogram. In Proceedings of European Signal Processing Conference, 2008.
[85] L. Oudre, Y. Grenier, and C. F´evotte. Chord recognition using measures of fit,
chord templates and filtering methods. In Applications of Signal Processing to
Audio and Acoustics, IEEE Workshop on (W)., pages 9–12. IEEE, 2009.
[86] L. Oudre, Y. Grenier, and C. F´evotte. Template-based chord recognition: Influence of the chord types. In Proceedings of the 10th International Society for
Music Information Retrieval Conference (ISMIR), pages 153–158, 2009.
[87] H. Papadopoulos and G. Peeters. Large-scale study of chord estimation algorithms based on chroma representation and HMM. In Content-Based Multimedia
Indexing, IEEE Workshop on., pages 53–60. IEEE, 2007.
[88] H. Papadopoulos and G. Peeters. Simultaneous estimation of chord progression
and downbeats from an audio file. In Acoustics, Speech and Signal Processing,
IEEE International Conference on., pages 121–124. IEEE, 2008.
[89] H. Papadopoulos and G. Peeters. Joint estimation of chords and downbeats from
an audio signal. Audio, Speech, and Language Processing, IEEE Transactions on,
19(1):138–152, 2011.
[90] S. Pauws. Musical key extraction from audio. In Proceedings of the 5th International Conference on Music Information Retrieval (ISMIR), 2004.
[91] C. Perez-Sancho, D. Rizo, and J.M. I˜
nesta. Genre classification using chords and
stochastic language models. Connection science, 21(2-3):145–159, 2009.
[92] L.R. Rabiner. A tutorial on hidden Markov models and selected applications in
speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
[93] D.M. Randel. The Harvard dictionary of music. Belknap Press, 2003.
145
REFERENCES
[94] C. Raphael. Automatic transcription of piano music. In Proceedings of the 3rd
International Conference on Music Information Retrieval (ISMIR), pages 13–17,
2002.
[95] C. Raphael. A graphical model for recognizing sung melodies. In Proceedings
of 6th International Conference on Music Information Retrieval (ISMIR), pages
658–663, 2005.
[96] J.T. Reed, Y. Ueda, S. Siniscalchi, Y. Uchiyama, S. Sagayama, and C.H. Lee.
Minimum classification error training to improve isolated chord recognition. Proceedings of the 10th International Society for Music Information Retrieval (ISMIR), pages 609–614, 2009.
[97] T. D. Rossing. The science of sound (second edition). Addison-Wesley, 1990.
[98] R. Scholz, E. Vincent, and F. Bimbot. Robust modelling of musical chord sequences using probabilistic n-grams. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, pages 53–56. IEEE,
2009.
[99] A. Sheh and D.P.W. Ellis. Chord segmentation and recognition using EM-trained
hidden Markov models. In Proceedings of the 4th International Society for Music
Information Retrieval (ISMIR), pages 183–189, 2003.
[100] A. Shenoy and Y. Wang. Key, chord, and rhythm tracking of popular music
recordings. Computer Music Journal, 29(3):75–86, 2005.
[101] R.N. Shepard. Circularity in judgments of relative pitch. The Journal of the
Acoustical Society of America, 36:2346, 1964.
[102] J.B.L. Smith, J.A. Burgoyne, I. Fujinaga, D. De Roure, and J.S. Downie. Design
146
REFERENCES
and creation of a large-scale database of structural annotations. In Proceedings of
the 12th International Society for Music Information Retrieval Conference, 2011.
[103] M. T. Smith. Audio engineer’s reference book. Focal Press, 1999.
[104] A.M. Stark and M.D. Plumbley. Real-time chord recognition for live performance.
In Proceedings of International Computer Music Conference, number i, pages
585–593, 2009.
[105] S. Streich. Music complexity: a multi-faceted description of audio content. PhD
thesis, Universitat Pompeu Fabra, 2007.
[106] B. Su and S.K. Jeng. Multi-timbre chord classification using wavelet transform
and self-organized map neural networks. In Acoustics, Speech, and Signal Processing, IEEE International Conference on, volume 5, pages 3377–3380. IEEE,
2001.
[107] K. Sumi, K. Itoyama, K. Yoshii, K. Komatani, T. Ogata, and H. Okuno. Automatic chord recognition based on probabilistic integration of chord transition and
bass pitch estimation. In Proceedings of the International Conference on Music
Information Retrieval (ISMIR), pages 39–44, 2008.
[108] E.F. Tjong Kim Sang and F. De Meulder. Introduction to the CoNLL-2003
shared task: Language-independent named entity recognition. In Proceedings of
the seventh conference on natural language learning at HLT-NAACL 2003-Volume
4, pages 142–147. Association for Computational Linguistics, 2003.
[109] Y. Ueda, Y. Uchiyama, T. Nishimoto, N. Ono, and S. Sagayama. HMM-based approach for automatic chord detection using refined acoustic features. In Acoustics
Speech and Signal Processing, IEEE International Conference on, pages 5518–
5521. IEEE, 2010.
147
REFERENCES
[110] E. Unal, P.G. Georgiou, S.S. Narayanan, and E. Chew. Statistical modeling
and retrieval of polyphonic music. In Multimedia Signal Processing, IEEE 9th
Workshop on, pages 405–409. IEEE, 2007.
[111] M. Varewyck, J. Pauwels, and J.P. Martens. A novel chroma representation of
polyphonic music based on multiple pitch tracking techniques. In Proceedings
of the 16th ACM international conference on Multimedia, pages 667–670. ACM,
2008.
[112] G.H. Wakefield. Mathematical representation of joint time-chroma distributions.
In International Symposium on Optical Science, Engineering, and Instrumentation, SPIE, volume 99, pages 18–23, 1999.
[113] A.L.C. Wang and J.O. Smith III. System and methods for recognizing sound and
music signals in high noise and distortion, 2006. US Patent 6,990,453.
[114] J. Weil, T. Sikora, J.L. Durrieu, and G. Richard. Automatic generation of lead
sheets from polyphonic music signals. In Proceedings of the 10th International
Society for Music Information Retrieval Conference (ISMIR)., 2009.
[115] A. Weller, D. Ellis, and T. Jebara. Structured prediction models for chord transcription of music audio. In Machine Learning and Applications, International
Conference on, pages 590–595. IEEE, 2009.
[116] B. Whitman, G. Flake, and S. Lawrence. Artist detection in music with minnowmatch. In Neural Networks for Signal Processing XI, Proceedings of the IEEE
Signal Processing Society Workshop, pages 559–568. IEEE, 2001.
[117] K. Yoshii and M. Goto. A vocabulary-free infinity-gram model for nonparametric
Bayesian chord progression analysis. 2011.
148
REFERENCES
[118] T. Yoshioka, T. Kitahara, K. Komatani, T. Ogata, and H.G. Okuno. Automatic
chord transcription with concurrent recognition of chord symbols and boundaries. In Proceedings of the 5th International Conference on Music Information
Retrieval (ISMIR), pages 100–105, 2004.
[119] V. Zenz and A. Rauber. Automatic chord detection incorporating beat and key
detection. In Signal Processing and Communications, IEEE International Conference on, pages 1175–1178. IEEE, 2007.
149
REFERENCES
150
Appendix A
Songs used in Evaluation
Artist
Title
Artist
Title
Oasis
Bring it on down
Cigarettes and alcohol
Don’t look back in anger
What’s the story morning glory
My big mouth
ABBA
ACDC
ACDC
Aerosmith
Alanis Morissette
Alanis Morissette
All Saints
Aqua
Backstreet Boys
Backstreet Boys
Super trouper
Hells bells
Have a drink on me
Falling in love
Ironic
Uninvited
Never ever
Doctor jones
I want it that way
Show me the meaning
of being lonely
No one else comes
close
God only knows
Surfin safari
Surfin USA
Loser
The rose
Eyes without a face
Piano man
Just the way you are
Only the good die
young
She’s always a woman
War pigs
Iron man
I believe
Table A.1: Oasis dataset, consisting of
5 chord annotations.
Backstreet Boys
Artist
Title
Carole King
I feel the earth move
So far away
It’s too late
Home again
Beautiful
Way over yonder
You’ve got a friend
Beach Boys
Beach Boys
Beach Boys
Beck
Bette Midler
Billy Idol
Billy Joel
Billy Joel
Billy Joel
Table A.2: Carole King dataset, consisting of 7 chord and key annotations.
Artist
Title
3 Doors Down
A Ha
ABBA
ABBA
ABBA
Kryptonite
Take on me
Dancing queen
I have a dream
Thank you for the
music
Fernando
ABBA
Billy Joel
Black Sabbath
Black Sabbath
Blessid Union of
Souls
Blink 182
Blondie
Bob Marley
Bob Marley
Bob Marley
151
Mutt
One way or another
Natural mystic
Jamming
No woman no cry
A. SONGS USED IN EVALUATION
Artist
Title
Artist
Title
Bon Jovi
Bon Jovi
Bon Jovi
Runaway
I’ll be there for you
You give love a bad
name
Total eclipse of the
heart
Baby one more time
When you’re gone
Elton John
I guess that’s why
they call it the blues
Santa bring my baby
back to me
A day without rain
Wild child
Only time
Flora’s secret
Tea-house moon
Watermark
Storms in africa
Evening falls
Layla
Wonderful tonight
Tears in heaven
Sweet dreams are
made of this
Father of mine
Santa monica
All I have to do
is dream
I want to know what
love is
Moonlight in
Vermont
Killing me softly
Rise
Only happy when
it rains
In too deep
Name
Black balloon
Drive
19 again
Purple haze
Bonnie Tyler
Britney Spears
Bryan Adams
with Melanie C
Bryan Adams
Bryan Adams
Bryan Adams
Carly Simon
Cat Stevens
Celine Dion
Celine Dion
Celine Dion
Celine Dion
Cher
Chicago
Christina Aguilera
Christina Aguilera
Coldplay
Corrs
Cranberries
Creedence
Clearwater Revival
Creedence
Clearwater Revival
Creedence
Clearwater Revival
Cyndi Lauper
Deep Purple
Dido
Dire Straits
Don Mclean
Don Mclean
Doors
Doors
Elton John
Elton John
Elton John
Elton John
Elvis Presley
Enya
Enya
Enya
Enya
Enya
Enya
Enya
Enya
Eric Clapton
Eric Clapton
Eric Clapton
Eurythmics
Summer of 69
Heaven
Everything I do
(I do it for you)
You’re so vain
Morning has broken
My heart will go on
It’s all coming back to
me now
Falling into you
All by myself
If I could turn back
time
If you leave me now
Genie in a bottle
What a girl wants
Yellow
One night
Zombie
Have you ever
seen the rain
Proud mary
Everclear
Everclear
Everly Brothers
Foreigner
Frank Sinatra
Fugees
Gabrielle
Garbage
Genesis
Goo Goo Dolls
Goo Goo Dolls
Incubus
Janet Jackson
Jimi Hendrix
Experience
Joe Cocker
John Denver
John Denver
Cotton fields
Girls just want to
have fun
Smoke on the water
Thank you
Romeo and juliet
And I love her so
Vincent
Riders on the storm
Light my fire
Daniel
Sorry seems to be the
hardest word
Candle in the wind
Your song
John Denver
John Denver
Kansas
Leann Rimes
Leann Rimes
Led Zeppelin
152
You are so beautiful
Annie’s song
Poems prayers and
promises
My sweet lady
Take me home country
roads
Dust in the wind
I need you
Can’t fight the
moonlight
Stairway to heaven
Artist
Title
Artist
Title
Lionel Richie
Live
Madonna
Mariah Carey
Mariah Carey
Mariah Carey
Endless love
I alone
Take a bow
One sweet day
Hero
Anytime you need
a friend
Without you
Nothing else matters
Heal the world
Beat it
Moonlight shadow
Unintended
Torn
Sweet caroline
Red red wine
Shilo
Play me
Song sung blue
I am I said
Love will keep us
together
Laughter in the rain
Smells like teen spirit
Just a girl
Don’t speak
Bye bye bye
This I promise you
All or nothing
Wonderwall
Self esteem
If you love me
I honestly love you
Last resort
Bridge over troubled
water
The sound of silence
Steam
I wish it would
rain down
In the air tonight
Message in a bottle
Every breath you
take
Roxanne
Lump
Procol Harum
A whiter shade
of pale
I believe I can fly
Karma police
Everybody hurts
Right here waiting
Livin la vida loca
Sailing
Have i told you lately
Honky tonk woman
Beast of burden
Blue bayou
Crying
Oh pretty woman
Adia
Angel
Dreaming of you
Angel
El condor pasa
(If I could)
Mariah Carey
Metallica
Michael Jackson
Michael Jackson
Mike Oldfield
Muse
Natalie Imbruglia
Neil Diamond
Neil Diamond
Neil Diamond
Neil Diamond
Neil Diamond
Neil Diamond
Neil Sedaka
Neil Sedaka
Nirvana
No Doubt
No Doubt
Nsync
Nsync
O Town
Oasis
Offspring
Olivia Newton John
Olivia Newton John
Papa Roach
Paul Simon
Paul Simon
Peter Gabriel
Phil Collins
Phil Collins
Police
Police
Police
Presidents of the
USA
R Kelly
Radiohead
Rem
Richard Marx
Ricky Martin
Rod Stewart
Rod Stewart
Rolling Stones
Rolling Stones
Roy Orbison
Roy Orbison
Roy Orbison
Sarah Mclachlan
Sarah Mclachlan
Selena
Shaggy
Simon and
Garfunkel
Simon and
Garfunkel
Simon and
Garfunkel
Sixpence None
The Richer
Soft Cell
Soundgarden
Spice Girls
Stevie Wonder
Stevie Wonder
Stevie Wonder
Stevie Wonder
Sting
Sublime
Sublime
Survivor
Temple Of The Dog
Third Eye Blind
Tom Petty
U2
Van Halen
Van Morrison
Vengaboys
Weezer
Weezer
153
Cecilia
The boxer
Kiss me
Tainted love
Black hole sun
2 become 1
Sir duke
Isn’t she lovely
You are the sunshine
of my life
Superstition
Fields of gold
Don t push
Santeria
Eye of the tiger
Hunger strike
Semi-charmed life
I won’t back down
With or without you
Jump
Brown eyed girl
We’re going to Ibiza
El scorcho
Hash pipe
A. SONGS USED IN EVALUATION
Artist
Title
Weezer
Whitney Houston
Island in the sun
Greatest love of all
Artist
Any time at all
I’ll cry instead
Things we said today
When I get home
You can’t do that
I’ll be back
No reply
I’m a loser
Baby’s in black
Rock and roll music
I’ll follow the sun
Mr moonlight
Kansas city hey hey
Eight days a week
Words of love
Honey don’t
Every little thing
I don’t want to spoil the party
What you’re doing
Everybody s trying to be my baby
Help
The night before
You’ve got to hide your love away
I need you
Another girl
You’re going to lose that girl
Ticket to ride
Act naturally
It’s only love
You like me too much
Tell me what you see
I’ve just seen a face
Yesterday
Dizzy miss lizzy
Drive my car
Norwegian wood (this bird has flown)
You won’t see me
Nowhere man
Think for yourself
The word
Michelle
What goes on
Girl
I’m looking through you
In my life
Wait
If I needed someone
Run for your life
Table A.3: USpop dataset, consisting
of 193 chord annotations.
Artist
Title
The Beatles
I saw her standing there
Misery
Anna go to him
Chains
Boys
Ask me why
Please please me
Love me do
P.S. I love you
Baby it’s you
Do you want to know
a secret
A taste of honey
There’s a place
Twist and shout
It won’t be long
All I’ve got to do
All my loving
Don’t bother me
Little child
Till there was you
Please mister postman
Roll over beethoven
Hold me tight
You really got a hold
on me
I wanna be your man
Devil in her heart
Not a second time
Money that’s what I want
A hard day’s night
I should have known better
If I fell
I’m happy just to dance
with you
And I love her
Tell me why
Can’t buy me love
Title
154
Artist
Title
Artist
Taxman
Eleanor Rigby
I’m only sleeping
Love you to
Here there and everywhere
Yellow submarine
She said she said
Good day sunshine
And your bird can sing
For no one
Doctor Robert
I want to tell you
Got to get you into my life
Tomorrow never knows
Sgt. Pepper’s lonely hearts club
band
With a little help from my
friends
Lucy in the sky with
diamonds
Getting better
Fixing a hole
She’s leaving home
Being for the benefit of
Mr Kite
Within you without you
When I’m sixty-four
Lovely Rita
Good morning good morning
Sgt. pepper’s lonely hearts club
band (reprise)
A day in the life
Magical mystery tour
The fool on the hill
Flying
Blue jay way
Your mother should know
I am the walrus
Hello goodbye
Strawberry fields forever
Penny lane
Baby you’re a rich man
All you need is love
Back in the USSR
Dear prudence
Glass onion
Ob-la-di ob-la-da
Wild honey pie
Title
The continuing story of bungalow
Bill
While my guitar gently weeps
Happiness is a warm gun
Martha my dear
I’m so tired
Blackbird
Piggies
Rocky raccoon
Don’t pass me by
Why don’t we do it in the road
I will
Julia
Birthday
Yer blues
Mother nature’s son
Everybody’s got something to hide
except me and my monkey
Sexy sadie
Helter skelter
Long long long
Revolution 1
Honey pie
Savoy truffle
Cry baby cry
Good night
Come together
Something
Maxwell’s silver hammer
Oh darling
Octopus’s garden
I want you (she’s so heavy)
Here comes the sun
Because
You never give me your money
Sun king
Mean Mr Mustard
Polythene Pam
She came in through the bathroom
window
Golden slumbers
Carry that weight
The end
Her majesty
Two of us
Dig a pony
Across the universe
I me mine
155
A. SONGS USED IN EVALUATION
Artist
Title
Artist
Dig it
Let it be
Maggie mae
I’ve got a feeling
One after 909
The long and winding road
For you blue
Get back
Queen
Zweieck
Title
Zuhause
Table A.4: MIREX dataset, consisting
of 217 chord and key annotations.
Bohemian rhapsody
Another one bites the dust
Fat bottomed girls
Bicycle race
You’re my best friend
Don’t stop me now
Save me
Crazy little thing called love
Somebody to love
Good old-fashioned lover boy
Play the game
Seven seas of rhye
We will rock you
We are the champions
A kind of magic
I want it all
I want to break free
Who wants to live forever
Hammer to fall
Friends will be friends
Artist
Title
25 or 6 to 4
ABBA
ABBA
ABBA
ABBA
ABBA
ABBA
Aerosmith
Al Green
Chicago
Chiquitita
Knowing me, knowing you
Honey honey
Fernando
On and on and on
Take a chance on me
Last child
Oh me, oh my
(dreams in my arms)
Undercover angel
Schoool’s out
Hey stoopid
Shadow dancing
Giving you the best
that I got
Caught up in
the rapture
Sweet love
I can’t stand the rain
Could I have this
dance
Daydream believer
A love song
I never loved a man
Chain of fools
Sweet soul music
How blue can you get
The thrill is gone
Roll on down the
highway
Heartaches
Alan O’Day
Alice Cooper
Alice Cooper
Andy Gibb
Anita Baker
Anita Baker
Anita Baker
Ann Peebles
Anne Murray
Spiel mir eine alte Melodie
Rawhide
She
Erbauliche Gedanken eines
Tobackrauchers
Andersherum
Tigerfest
Akne
Blass
Mr. Morgan
Liebesleid
Ich kann heute nicht
Jakob und Marie
Paparazzi
Santa Donna Lucia
Mobile
Es wird alles wieder gut, Herr
Professor
Zu leise f¨
ur mich
Duell
Anne Murray
Anne Murray
Aretha Franklin
Aretha Franklin
Arthur Conley
B.B. King
B.B. King
Bachman-Turner
Overdrive
Bachman-Turner
Overdrive
Bad Company
Badfinger
Baltimora
Bananarama
Bananarama
156
Rock ’n’ roll
fantasy
Maybe tomorrow
Tarzan boy
A trick of the night
Venus
Artist
Title
Artist
Barbara Lewis
Barbara Streisand
Barry White
Hello stranger
People
You’re the first,
the last, my
everything
Brass monkey
Gordon lightfoot
Amor
Oh my angel
The rose
Flesh for fantasy
White wedding
Catch my fall
Hot in the city
Pressure
Just the way you
are
Don’t ask me why
With you I’m born
again
Don’t say you love
me
The stroke
Silent night
Just a friend
One way or
another
Willie nelson
You can’t judge a
book by the cover
Gotta serve
somebody
Old time rock &
roll
Like a rock
Trying to live my
life without you
I love you so
Detroit city
That’s the way I
feel about cha
Sweet Caroline (good
times never seemed
so good)
Nick of time
Feelin’ satisfied
Motown philly
If
Bread
Brenda
Brenda
Brenda
Brenda
Brenda
Brenda
Brenda
Beastie Boys
Beautiful
Ben E. King
Bertha Tillman
Bette Midler
Billy Idol
Billy Idol
Billy Idol
Billy Idol
Billy Joel
Billy Joel
Billy Joel
Billy Preston
Billy Squier
Billy Squier
Bing Crosby
Biz Markie
Blondie
Blue Eyes Cryin’
Bo Diddley
Bob Dylan
Bob Seger
Bob Seger
Bob Seger & The Silver
Bullet Band
Bobbi Martin
Bobby Bare
Bobby Womack
Bobby Womack
Bonnie Raitt
Boston
Boyz II Men
Bread
157
Title
Lee
Lee
Lee
Lee
Lee
Lee
Lee
Brenda Lee
Brother Jack McDuff
Brownsville Station
Bruce Channel
Candi Staton
Canned Heat
Canned Heat
Carl Carlton
Charlie Rich
Cheap Trick
Cheap Trick
Cheap Trick
Cheap Trick
Cher
Cher
Chicago
Chicago
Chicago
Chico DeBarge
Chiffons
Chubby Checker
Chuck Berry
Chuck Berry
Clarence Carter
Clarence Carter
Cliff Richard
Commodores
Corey Hart
Sweet surrender
Sweet nothin’s
As usual
Dum dum
Losing you
Heart in hand
Too many rivers
Everybody loves me
but you
Coming on strong
Theme from electric
surfboard
Smokin’ in the
boys room
Hey baby
Young hearts run
free
Let’s work together
On the road again
Everlasting love
A very special
love song
I want you to
want me
Dream police
Stop this game
Surrender
Just like Jesse James
If I could turn
back time
Along comes a
woman
Feelin’ stronger
every day
Old days
Talk to me
Swing talkin’ guy
The twist
Sweet little
rock n’ roll
Almost grown
Patches
Too weak to
fight
Carrie
Still
In your soul
A. SONGS USED IN EVALUATION
Artist
Title
Artist
Title
Cornelius Brothers
& Sister Rose
Cream
Treat her like
a lady
Sunshine of
your love
I put a spell
on you
Bad moon
rising
Got it made
Suite
Southern cross
Teach your
children
Karma
chameleon
She bop
All through the
night
The goonies ‘r
good enough
True colors
Maneater
Sara smile
Six days on
the road
Space oddity
Golden years
Blue jean
Me myself and I
I will
World in
my eyes
Unforgettable
Where are you
Runaround sue
Love came to me
Where or when
Baby I’m burnin’
Starting over
again
Superman
Last dance
Sunshine
superman
A lesson in leavin
Years from now
Sexy eyes
If not you
Dr. John
Right place wrong
time
Lyin’ eyes
The long run
Getaway
September
Two tickets to
paradise
War
I had too much
to dream
Philadelphia
freedom
Levon
Creedence
Clearwater Revival
Creedence
Clearwater Revival
Crosby Stills & Nash
Crosby, Stills & Nash
Crosby, Stills & Nash
Crosby, Stills & Nash
Culture Club
Cyndi Lauper
Cyndi Lauper
Cyndi Lauper
Cyndi Lauper
Daryl Hall & John Oates
Daryl Hall & John Oates
Dave Dudley
David Bowie
David Bowie
David Bowie
De La Soul
Dean Martin
Depeche Mode
Dinah Washington
Dinah Washington
Dion
Dion
Dion
Dolly Parton
Dolly Parton
Donna Fargo
Donna Summer
Donovan
Dottie West
Dr. Hook
Dr. Hook
Dr. Hook
Eagles
Eagles
Earth Wind And Fire
Earth, Wind & Fire
Eddie Money
Edwin Starr
Electric Prunes
Elton John
Elton John
Elton John
Elton John
Elvis Presley
Elvis Presley
Elvis Presley
Elvis
Elvis
Elvis
Elvis
Elvis
Elvis
Elvis
Presley
Presley
Presley
Presley
Presley
Presley
Presley
Elvis Presley
Engelbert
Humperdinck
Eric Carmen
Eric Carmen
Eric Clapton
Eric Clapton
Eric Clapton
Eric Clapton
Eric Clapton
Etta James
Etta James
Etta James
158
Goodbye yellow
brick road
The bitch is back
Little sister
For ol’ times sake
I really don’t want
to know
One night
If I can dream
Judy
His latest flame
Ask me
My way
She thinks I
still care
There goes my
everything
After the lovin’
Hungry eyes
Sunrise
Let it rain
Promises
Forever man
Willie and the
hand jive
I can’t stand it
Stop the wedding
Fool that I am
Would it make
any difference
to you
Artist
Title
Artist
Title
Evelyn “Champagne”
King
Fats Domino
I’m in love
Grand Funk
Railroad
Harry Chapin
Walk like a man
Firehouse
Five Man
Electrical Band
Flatt & Scruggs
Floyd Cramer
Focus
Foghat
Fontella Bass
Freddie Jackson
Freddy Fender
Freddy Fender
Gary U.S. Bonds
General Public
Genesis
Genesis
George Benson
George Harrison
George Harrison
George Harrison
George Harrison
Gino Vanelli
Gino Vannelli
Gladys Knight &
The Pips
Gladys Knight &
The Pips
Gladys Knight
The Pips
Gladys Night &
The Pips
Glen Campbell
Glen Campbell
Glen Campbell
Glen Campbell
Gloria Gaynor
Graham Nash
I want to walk
you home
Don t treat
me bad
Absolutely right
Heart
Heart
Heart
Hi-Five
Foggy mountain
breakdown
Last date
Hocus pocus
Drivin’ wheel
Rescue me
Have you ever
loved somebody
Living it down
Secret love
Quarter to three
Tenderness
Tonight, tonight,
tonight
Misunderstanding
Breezin’
This song
Years ago
Years ago
I got my mind
set on you
Hurts to be in love
Black cars
Letter full of tears
Huey Lewis &
The News
INXS
Ike & Tina Turner
Ike & Tina Turner
Irma Thomas
Iron Butterfly
Isaac Hayes
Isaac Hayes
J. Frank Wilson &
The Cavaliers
J. Geils Band
Jackie Wilson
Jackson Browne
Jackson Browne
Jackson Browne
James Brown
James Brown
James Brown
James Brown
James Brown
James Brown
James Brown
Best thing that ever
happened to me
Baby don’t change
your mind
If I were your
woman
Galveston
Rhinestone cowboy
Sunflower
It’s only make
believe
Never can say
goodbye
Chicago
James Taylor
Jan & Dean
Jan & Dean
Jeff Beck
Jerry Jeff Walker
159
Sunday morning
sunshine
Crazy on you
Magic man
There’s the girl
I like the way
(the kissing game)
I want a new drug
Need you tonight
I want to take you
higher
It’s gonna work out
fine
Wish someone would
care
In-a-gadda-da-vida
Do your thing
The look of love
Last kiss
One last kiss
Baby workout
Here come those
tears again
Redneck friend
Boulevard
I don’t mind
Cold sweat - part 1
I got you (I feel good)
My thang
Baby you’re right
Think
Get up
(I feel like being like a)
sex machine (part 1)
Country road
Little old lady from
pasadena
The anaheim, azusa
& cucamonga sewing
circle, book review and
timing association
People get ready
Mr. Bojangles
A. SONGS USED IN EVALUATION
Artist
Title
Artist
Title
Jerry Reed
Jethro Tull
Jimmy Buffett
Jimmy Clanton
Jimmy Cliff
Ko-ko joe
Living in the past
Come monday
Just a dream
Wonderful world,
beautiful people
Handy man
What becomes of
the brokenhearted
Walk on the wild
side (part 1)
With a little help from
my friends
Annie’s song
Back home again
It amazes me
Seasons of the heart
Some days are diamonds
(some days are stone)
Rocky mountain high
The ways of a woman
in love
The battle of
New Orleans
Worried guy
I rise, I fall
Jimmy’s girl
Out of my mind
(You’re so square) baby,
I don’t care
Big yellow taxi
You’ve got another
thing comin’
Break it to me gently
Queen of hearts
Running up that hill
You decorated my life
Through the years
Scarlet fever
Sweet music man
I don’t need you
Lucille
Rocket ride
Jungle boogie
Led Zeppelin
Over the hills and
far away
Trampled under foot
Dyer maker
You make me feel like
dancing
California nights
Somthing about you
Bongo stomp
Jimmy Jones
Jimmy Ruffin
Jimmy Smith
Joe Cocker
John
John
John
John
John
Denver
Denver
Denver
Denver
Denver
John Denver
Johnny Cash
Johnny Horton
Johnny Tillotson
Johnny Tillotson
Johnny Tillotson
Johnny Tillotson
Joni Mitchell
Joni Mitchell
Judas Priest
Juice Newton
Juice Newton
Kate Bush
Kenny Rogers
Kenny Rogers
Kenny Rogers
Kenny Rogers
Kenny Rogers
Kenny Rogers
Kiss
Kool And
the Gang
LaVern Baker
LaVern Baker
Laura Branigan
Led Zeppelin
Led Zepplin
Leo Sayer
Leslie Gore
Level 42
Little Joey &
The Flips
Little River Band
Little River Band
Little River Band
Looking Glass
Louis Armstrong
Louis Prima &
Keely Smith
Lynyrd Skynyrd
Marc Cohn
Marianne Faithfull
Marky Mark &
The Funky Bunch
Marvin Gaye
Marvin Gaye &
Kim Weston
Marvin Gaye &
Tammy Terrell
Max Frost &
The Troopers
Meat Loaf
Meat Loaf
Mel Torme
Melba Montgomery
Metallica
Michael Jackson
Michael Jackson
Michael Jackson
Michael Jackson
Michael Johnson
Michael Sembello
Milli Vanilli
See see rider
I cried a tear
Gloria
160
We two
Help is on the way
The other guy
Brandy
Hello dolly
That old black magic
Sweet home alabama
Walking in memphis
Come and stay
with me
Good vibrations
I want you
It takes two
If I could build my
whole world around you
Shape of things
to come
You took the words
right out of my mouth
Paradise by the
dashboard light
Comin’ home baby
No charge
One
I just can’t stop
loving you
Wanna be startin’
somethin’
Beat it
Human nature
Almost by being in love
Maniac
Girl you know it’s
true
Artist
Title
Artist
Title
Naked Eyes
Always something there
to remind me
These boots are
made for walkin’
I’ve got love on
my mind
Kisses on the wind
Hot child in the city
Buy for me the rain
Pointer Sisters
Poison
Police
He’s so shy
Unskinny bop
Don’t stand so
close to me
Pretty in pink
Amie
Harden my heart
We are the
champions
Silent lucidity
Indiana wants me
Time for me to fly
Just when I needed
you most
Crying time
Let’s go get stoned
Eleanor Rigby
Ghostbusters theme
For the good times
Come and get your love
Give it to me baby
Super freak part one
Jessie’s girl
Don’t talk to strangers
Your love has lifted
me higher
Lonely eyes
Addicted to love
Feel like making love
Dirty water
Somebody’s watching
me
This should go on
forever
Maggie may
Twisting the night
away
You can’t roller skate
in a buffalo herd
It’s only rock and roll
(but I like it)
Wild horses
Dandelion
Waiting on a friend
Time is on my side
Not fade away
I wouldn’t have missed
it for the world
Nancy Sinatra
Nathalie Cole
Neneh Cherry
Nick Gilder
Nitty Gritty Dirt
Band
Nitty Gritty Dirt
Band
Oak Ridge Boys
Ocean
Oliver
Otis Redding
Otis Redding
Otis Redding
Paper Lace
Pat Benatar
Pat Benatar
Pat Benetar
Patrick Hernandez
Paul Anka
Paul
Paul
Paul
Paul
McCartney
McCartney
McCartney
Simon
Peaches And Herb
Peggy Lee
Peggy Lee
Pet Shop Boys
Pet Shop Boys
Pet Shop Boys
Peter Gabriel
Phil Collins
Pink Floyd
Psychedelic Furs
Pure Prairie League
Quarterflash
Queen
Queensryche
R Dean Tayloy
REO Speedwagon
Randy Vanwarmer
Make a little magic
Elvira
Put your hand in
the hand
Good morning
starshine
Ray Charles
Ray Charles
Ray Charles
Ray Parker Jr
Ray Price
Redbone
Rick James
Rick James
Rick Springfield
Rick Springfield
Rita Coolidge
I’ve been loving you
too long (to stop now)
(sittin’ on) the dock
of the bay
Chained and bound
The night chicago
died
Promises in the
dark
Little too late
Fire and ice
Born to be alive
Love me warm and
tender
Maybe I’m amazed
With a little luck
Press
50 ways to leave
your lover
Shake your groove
thing
Is that all there is
Fever
Always on my mind
Where the streets
have no names
Love comes quickly
Shock the monkey
Two hearts
Money
Robert John
Robert Palmer
Roberta Flack
Rock And Hyde
Rockwell
Rod Bernard
Rod Stewart
Rod Stewart
Roger Miller
Rolling Stones
Rolling Stones
Rolling Stones
Rolling Stones
Rolling Stones
Rolling Stones
Ronnie Milsap
161
A. SONGS USED IN EVALUATION
Artist
Title
Artist
Title
Roxette
Roxy
Roy Orbison
Run-D.M.C.
Rush
Sammy Hagar
Sammy Hager
The Animals
The look
Dance away
Cry softly lonely one
Walk this way
The spirit of radio
Give to live
I can’t drive 55
San Franciscan
nights
Evil ways
Sleep walk
Don gibson
Woman to woman
Sanctify yourself
Mrs. Robinson
The 5th Dimension
(Last night) I didn’t
get to sleep at all
If I could reach you
Eye in the sky
Santana
Santo & Johnny
Sea Of Heartbreak
Shirley Brown
Simple Minds
Simon &
Garfunkel
Sinead O Connor
Sly & The Family
Stone
Smokey Robinson
Snap
Soft Cell
Sonny & Cher
Spandau Ballet
Steppenwolf
Steve Miller Band
Stevie B
Stevie Wonder
Stevie Wonder
Stevie Wonder
Stevie Wonder
Sting
Styx
Swingin’ Medallions
Talking Heads
Talking Heads
Tanya Tucker
Teddy Pendergrass
Ten Years After
The 5th Dimension
The 5th Dimension
The Allan Parsons
Project
The Allman Brothers
The Amboy Dukes
The
The
The
The
The
Band
Beach
Beach
Beach
Beach
Boys
Boys
Boys
Boys
The Beach Boys
The Beach Boys
The emperor’s
new clothes
Hot fun in the
summertime
The Beatles
The Beatles
The Beatles
The Beatles
Cruisin’
The power
Tainted love
All I ever need is you
True
Born to be wild
The joker
Because I love you
Higher ground
If you really love me
That girl
Do I do
If you love somebody
set them free
Fooling yourself (the
angry young man)
Double shot
(of my baby’s love)
And she was
Burning down
the house
Here’s some love
I don’t love you
anymore
I’d love to change
the world
Never my love
The Beginning of
the End
The Box Tops
The Buckinghams
The Byrds
The Castaways
The Commodores
The Commodores
The Contours
The Cowsills
The Crystals
The Cure
The Doors
The Drifters
The Eagles
The Everly Brothers
The Everly Brothers
The Everly Brothers
The Falcons
The Fifth Dimension
The Fireballs
The Hollies
162
Straight from the
heart
Journey to the
center of the mind
Life is a carnival
Still cruisin
Sail on sailor
In my room
Bluebirds over the
mountain
Wendy
Surfin’ safari
Do you want to
know a secret
Come together
Eight days a week
I saw her standing
there
Funky nassau
Cry like a baby
Kind of a drag
Eight miles high
Liar, liar
Easy
Nightshift
Do you love me
Hair
He’s a rebel
Just like heaven
Riders on the storm
On broadway
Already gone
Walk right back
Bird dog
On the wings
of a nightingale
I found a love
Where do you
wanna go
Sugar shack
Long dark road
Artist
Title
Artist
Title
The
The
The
The
The
The
Carrie-anne
It’s your thing
Just can’t wait
Looking for a love
Dancing machine
Heaven’s just a
sin away
Better things
Till the end of
the day
Baby, baby
don’t cry
I don’t blame you
at all
I second that
emotion
Walk right in
The people in me
Love train
One bad apple
Some like it hot
People got to be
free
Someone
Unchained melody
The Temptations
I wish it would
rain
Disco inferno
Perfidia
It’s raining men
Pinball wizard
Happy jack
Heart full of soul
Shapes of things
Get together
The best
Private dancer
A dream goes on forever
I love
She’s a lady
Crystal blue persuasion
Mony mony
This house
Baby can I hold
you
Golden earring
With or without you
Red red wine
The way you do the
things you do
Deeper shade of soul
In the navy
Dance hall days
Theme from the
dukes of hazzard
Wake me up before
you go-go
Here I go again
Hold on
Don’t Knock
My Love - Pt. 1
I found a true love
Situation
La grange
Hollies
Isley Brothers
J. Geils Band
J. Geils Band
Jacksons
Kendalls
The Kinks
The Kinks
The Miracles
The Miracles
The Miracles
The
The
The
The
The
The
Moments
Music Machine
O’Jays
Osmonds
Power Station
Rascals
The Rembrandts
The Righteous
Brothers
The Righteous
Brothers
The Ritchie Family
The Robert Cray
Band
The Rolling Stones
The Rolling Stones
The Rolling Stones
The
The
The
The
The
The
The
The
The
The
Rolling Stones
Rolling Stones
Ronettes
Sopwith Camel
Staple Singers
String-A-Longs
Supremes
Supremes
Tee Set
Temptations
The Trammps
The Ventures
The Weather Girls
The Who
The Who
The Yardbirds
The Yardbirds
The Youngbloods
Tina Turner
Tina Turner
Todd Rundgren
Tom D Hall
Tom Jones
Tommy James
Tommy James
Tracie Spencer
Tracy Chapman
Twilight Zone
U2
UB40
UB40
Urban Dance Squad
Village People
Wang Chung
Waylon Jennings
Soul and
inspiration
The best disco
in town
Smoking gun
Wham!
Tumbling dice
Honky tonk women
Doo doo doo doo
doo
Going to a go-go
Miss you
Be my baby
Hello hello
City in the sky
Wheels
Floy joy
Stoned love
Ma belle amie
Ain’t too proud
to beg
Whitesnake
Wilson Phillips
Wilson Pickett
Wilson Pickett
Yaz
ZZ Top
Table A.5: Billboard dataset, consisting of 522 chord and key annotations.
163
A. SONGS USED IN EVALUATION
164
Appendix B
Relative chord durations
Figure B.1: Histograms of relative chord durations across the entire dataset of fullylabelled chord datasets used in this thesis (MIREX, USpop, Carole King, Oasis, Billboard)
Triads
60
60
g
au
di
m
/2
s4
su
m
m
N
C
0
N
0
in
20
aj
20
in
40
m
40
aj
% Duration
80
m
% Duration
Minmaj
80
Quads
60
60
% Duration
80
40
40
0
m
in
X
di
m
au
g
in
m
m
a
mj
m
in
in
7/ 7
m
aj
6
m N
aj
7
d
hd im
im
au au7
m g(b g
in 7
m )
au aj7
g
di (7
m )
(7
)
0
7/ 7
m
aj
6
N
C
m
aj
7
20
aj
20
m
% Duration
MM
80
165