EVALUATING AUTOMATICALLY ESTIMATED CHORD SEQUENCES Johan Pauwels and Geoffroy Peeters STMS IRCAM-CNRS-UPMC [email protected], [email protected] ABSTRACT In this paper, we perform an in-depth evaluation of a large number of algorithms for chord estimation that have been submitted to the MIREX competitions in 2010, 2011 and 2012. Therefore we first present a rigorous scheme to describe evaluation methods in a sound, unambiguous way that extends previous work specifically to take into account the large variance in chord estimation vocabularies and to perform evaluations on select sets of chords. Then we take a look at the evaluation metrics used so far and propose some alternative ones. Finally, we use these different methods to get a deeper insight into the strengths of each of the competing algorithms and show that the choice of evaluation measure greatly influences the ranking. Index Terms— music information retrieval, chord estimation, evaluation procedure, large scale evaluation chord sequences, in which we reuse his work on the comparison of chord pairs. It is intended as a replacement to his DBRE, especially designed to deal with differences in estimation vocabulary between distinct algorithms. As far as large-scale evaluations of chord estimation algorithms are concerned, the most known is of course the MIREX contest itself, but as previously said, its study of the results is rather limited. More extensive evaluations have been performed by [5, 8], but at the time of their evaluations, the vocabularies of the systems under test were mostly limited to major and minor triads. Our work will pay special attention to the assessment of more complex chords. Next, we will detail our scheme to describe evaluation methods in Section 2. Then the measures used so far and some alternative ones will be formulated according to this framework in Section 3. They will afterwards be used to analyse the raw algorithmic output in Section 4. Finally, we draw some conclusions in Section 5. 1. INTRODUCTION Chord estimation has established itself as one of a growing number of tasks in the field of Music Information Retrieval (MIR). Especially since its addition to the yearly returning MIREX contest in 2008, the field has seen a steady flow of published papers [1, 2] and data sets [3, 4]. However, this increase in available data has not led to a proportional increase in generally applicable knowledge. Among the reasons for this, is the fact that not all automatic estimation algorithms generate the same vocabulary of chords and that the evaluation procedure in most papers is tailored to the used chord vocabulary, often described in an ad-hoc way that leaves some room for interpretation. Therefore it is hard to exactly reproduce the evaluation method and to compare results between papers. Furthermore, we feel that the richness of available data coming out of the MIREX contest has been under-exploited. It is unique in its comparison of a multitude of algorithms on the same data sets with the same evaluation methods, but only a minimal study of the results is being done each year. Therefore, we take a deeper look in this paper at the data of the 2010, 2011 and 2012 editions. In order to make an evaluation reproducible, we need an unambiguous way to discuss metrics. The most thorough discussion of chord sequence evaluation up to now has been done by Harte [5]. He established a general framework to describe evaluation measures. Therefore he first studied the comparison of chord pairs and builds upon that to arrive at the comparison of two chord sequences. For the latter problem, he proposes the “dictionary-based recall evaluation” (DBRE) to study the performance for user-defined subsets of chords. Alternative methods to get a deeper insight into the performance of algorithms include the reporting of results per chord type [6] and of the amount of confusion with related chords [7, 8, 9]. The scheme we propose builds on the work by Harte. More specifically, we formulate a new framework to describe the evaluation of This work was partly supported by the Quaero Program funded by Oseo French agency. 2. EVALUATING: WHAT, WHERE AND HOW? The objective of an evaluation procedure is to quantify the extent to which an estimated chord sequence resembles a certain reference sequence (preferably a manual annotation). This resemblance can be quantified according to a number of different definitions, such as resemblance in terms of chord segmentation [5, 6] or fragmentation and estimated vocabulary [8], but in this text we limit ourselves exclusively to similarity in harmonic content, mainly because the other ones have already been thoroughly treated by others. In order to evaluate this similarity in harmonic content, we compare the estimated chord sequence with the reference sequence on a pair-wise basis. In the literature, these pairs have been calculated in a framebased (where both reference and estimation get discretised on a grid and every grid point gives a pair) or a segment-based fashion (where segment pairs of variable length are formed by joining chord boundaries of both sequences). We adhere to previous conclusions [5, 6] that state that the latter should be preferred because of its advantages related to rounding errors and speed. A clear evaluation strategy should then answer 3 questions about the comparison of these pairs of reference and estimated chords: “what”, “where” and “how” do we evaluate? These questions and their possible answers will be further explained in the following sections. 2.1. What: handling different chord vocabularies Typically, the reference chord sequence is annotated much richer than the estimated sequence, the former often having an unconstrained vocabulary only defined by the data itself while the latter has an algorithm dependent vocabulary defined before the analysis starts. An option to reduce the variability between the vocabularies of the reference and one or more estimated sequences, is to define a mapping between chords. In practice, it always is a mapping of chord types: the root is preserved and complex chord types get mapped to more simple chords (i.e. a reduction of the number of constituting chromas). The “what” question thus asks what exactly will be compared: full chord labels or their reduced versions. Some mappings that we will use later on, are “triads” where chords get reduced to their triads or “tetrads & triads” where every chord gets mapped to its tetrad or in case that doesn’t exist, to its triad. The no-chord symbol is mapped to itself for both. The most simple mapping is of course “none”, leaving the chords as they are. In the remainder of this text, we will also use a toy example for illustrative purposes that consists of three mapping rules: 7→maj, maj→maj and min→min. 2.2. Where: selecting points of interest The “where” question deals with which segments should be included in the evaluation. This should first and foremost only be determined by the reference sequence, in order to avoid making the evaluation depend on the estimated sequence. A mapping itself imposes implicit constraints on which pairs get evaluated. With each mapping M an input domain CM I is associated for which the mapping is defined and an output domain CM O which lists all possible chords that can be produced by the mapping. The mapping can then be notated as a surjection M : CM I CM O . For our example mapping, CM I = {maj, min, 7} and CM O = {maj, min}. If the mapping is executed on a chord that is not part of CM I , its output is undefined. For instance, our example does not mention what to do with diminished chords, therefore the mapping of diminished chords is undefined. Consequently, if a chord in the reference annotation does not belong to CM I , then that segment should be skipped from evaluation. On the other hand, if a chord in the estimated sequence does not belong to CM I , the evaluation should fail and a different mapping should be used. Summarized, if we represent a chord of the reference vocabulary as cref ∈ CREF and one of the estimation vocabulary as cest ∈ CEST , then the segment pairs (cref , cest ) that will be evaluated are those where cref ∈ CM I . The actual labels that will be compared are M(cref ) and M(cest ) where the mapping should be chosen such that CEST ⊆ CM I . The input domain for our previously defined “triads” and “tetrads & triads” mappings is the same, namely the no-chord and all chords than can be mapped to a triad. We concretize this as all chords that contain a major or minor third or one of a major second or perfect fourth plus the no-chord. The “none” mapping obviously has the collection of all possible chords CALL as input domain. The most common use-case is to evaluate on all chords (or at least all for which a mapping is defined), as this makes maximal use of the available data. In other cases however, it might be useful to restrict the evaluation to places where the reference chord belongs to a subset. This can lead to a better understanding of the performance of an algorithm on a certain category of chords, e.g. how well tetrads are recognized compared to triads. One way to do this would be to modify the mapping to map more input chords to “undefined” outputs, but that would mean that different mappings need to be used for reference and estimated chord sequences. Therefore we propose a more elegant solution where we keep the same mapping for both sequences, and the input domain of the mappings as general as possible. Instead, we introduce the additional requirement that evaluation only takes place when the reference chord cref belongs to a userdefined input limiting set CLI . In theory, this allows us to exactly specify the chords for evaluation, but for some cases this might be impractical. Suppose we want to have a more detailed look at the performance of chords that are mapped to a major triad. For our example mapping, this would mean setting CLI = {maj, 7}, which is still workable, but for mapping functions with a larger input domain, this could mean that we need to list every possible combination of a major triad and a number of tensions (e.g. maj7, maj9, maj11, etc.). This quickly becomes impracticable. Therefore we introduce a similar mechanism to limit the chords at the mapped side by specifying an output limiting set CLO . Applied to our example, the large input limiting set CLI can then simply be replaced by the singleton CLO = {maj}. Both limiting sets are optional, their default value is CALL such that they have no influence. Combining both limiting domains, this gives that segments pairs (cref , cest ) will only be evaluated when cref ∈ CLI ∩ CM I and M (cref ) ∈ M (CLI ∩ CM I ) ∩ CLO . Taking the chord sequences in Figure 1, the initial number of evaluation segments is 6: (Bdim, Dmin), (Dmin, Dmin), (Dmin, Bmin), (G7, Bmin), (Cmaj, Bmin) and (Cmaj, Cmaj). When using our toy mapping, the first segment pair will be discarded because the mapping for Bdim is undefined. Additionally, setting CLO to {maj} will further discard the second and third segment from the evaluation, because M(min) = min is not part of CLO . Only the last three reference-estimation pairs then remain and due to the mapping they will be evaluated as (Gmaj, Bmin), (Cmaj, Bmin) and (Cmaj, Cmaj). reference estimation evaluation segments Bdim Dmin Dmin 1 2 3 G7 Cmaj Bmin Cmaj 4 5 6 Fig. 1. Example chord sequences. The introduction of the input limiting domain shows some similarity to Harte’s “dictionary-based recall evaluation” (DBRE). The main difference lies in the way how chords that don’t belong to a user-defined dictionary are handled. Segments with out-ofdictionary chords in the annotation are discarded from evaluation, as we propose, but segments with out-of-dictionary chords in the estimated sequence get a score of 0 by default. For our example in Figure 1 with dictionary {maj, 7}, this would mean that the first 3 segments get discarded, but also that the (G7, Bmin) and (Cmaj, Bmin) segments get a score of 0 in all cases, whatever the evaluation measure might be. We argue that these 2 pairs should be evaluated as well, which can still lead to a score of 0, but does not have to. The resulting value will depend on the way the pairs are compared, where one can argue that on one hand, an estimation of Bmin for a G7 reference fails to find the correct root, but on the other hand, that all chromas of the estimated Bmin are part of the G7 chord. 2.3. How: assigning a score value The last part of an evaluation scheme should answer the question of how to assign a value to the retained reference-estimation pairs. Therefore a scoring function S : CSR × CSE → R+ should be defined where CSR and CSE are the reference, respectively estimation, chord vocabularies of the scoring function. At the minimum, the scoring function should be able to handle all chords pairs coming out of the mapping subject to optional limiting. This means that M(CM I ∩ CLI ) ∩ CLO ⊆ CSR for the reference sequence and CM O ⊆ CSE for the estimation sequence need to hold, but it is advisable to make the scoring function input domains as large as possible. Nonetheless, the combination of limiting the evaluation to a subset and a mapping can make it easier to define the actual comparison by reducing the number of reference-estimation combinations it needs to be able to handle. For instance, in [10] the evaluation is only performed on segments annotated as triads in order to avoid the need to define a scoring function between triads and more complex chords, in particular between triads and tetrads that contain those triads. This particular part of the evaluation scheme is thoroughly discussed in Harte’s thesis [5], so we’ll only briefly recap some major points. The measures can be divided into two categories: either chords are evaluated as a whole, where the chromas are organised with respect to the root, or as a bag of chromas, where all constituting chromas are considered equal (ordered and unordered set comparisons in Harte’s terms). A simple example of the first case assigns a score of 1 to an exact correspondence, taking enharmonic equivalence into account and ignoring the bass note, and 0 to other cases. This scoring function will be referred to as “exact” in the remainder of the text. For the bag of chromas approach, some examples of a scoring function are chroma based recall and precision. Returning to the (G7, Bmin) pair of the previous section, we see that its score would be 0 using the ”exact” measure, but 0.75 and 1 for chroma based recall, respectively precision. 2.4. Summary To summarize, an evaluation measure can be unambiguously described by the combination of a mapping function M : CM I → CM O , a scoring function S : CSR ×CSE → R+ and optionally input and output limiting domains CLI and CLO . The average over multiple segments is computed by taking the average of all individual scores weighted by their segment lengths. Whether these segments come from the same or multiple songs is not important in this regard. special treatment of the augmented and diminished chords, nor can two-chroma chords ever be estimated correctly. Lastly, generating superfluous chromas in the estimated chord is not penalized at all. In comparison, the MIREX editions before 2010 used the “exact” scoring function. In addition to the “Mirex2010” evaluation measure, we will use some other metrics. All have a scoring function that ignores the pitch spelling. The first two, named “Triads” and “Tetrads” use the aforementioned “triads” and “tetrads & triads” mapping. The mapped chords are then compared using the “exact” scoring function. From these, we derive two more measures that use a limiting set, in contrast to the first two. “TriadsInput” is equal to “Triads” with the addition of an input-limiting set that consists of all triads, including inversions. Next is “OnlyTetrads”, similar to “Tetrads” but with an output limiting set containing all tetrads, but no triads. The following two metrics don’t use a mapping, but have a simple scoring function. “Root” assigns a score of 1 only to chord pairs that have the same root and “Bass” does the same for bass notes. When a bass chroma is not explicitly given, we assume it is the root. The “Root” metric has also been used in MIREX 2008 and 2009 as an additional evaluation. Finally, we have two measures, “ChromaPrecision” and “ChromaRecall” that don’t use a mapping and calculate precision and recall for the bags of chromas. 3. EVALUATION MEASURES The results reported for MIREX editions 2010, 2011 and 2012 are limited to a single value per song. While the measures used for editions 2008 and 2009 are thoroughly discussed in Harte’s thesis [5], the only information about the measure of the 2010 edition (which has been kept for the following editions) is its source code1 . Therefore we start by describing it in terms of our newly defined scheme and comparing it with the previous evaluation versions. The advantages and drawbacks of these methods are then discussed, after which some alternative measures are proposed. In contrast to the 2008 and 2009 versions, the 2010 edition doesn’t use a mapping. This is of course the most simple solution, but has the disadvantage that the scoring function needs to be defined for every possible chord pair. The 2008 and 2009 versions followed an orthogonal approach, all chords are mapped to either major, minor or the no-chord. A particularity of both mappings, although they are not exactly equal, is that every possible chord is mapped to one of these three options, so there is no input for which the mapping is undefined. While it has the advantage that absolutely all available data is used for the evaluation, some of the mappings are rather stretched and cause a bias towards major chord types. A more detailed description of both mappings as well as their criticisms can be found in [5]. The scoring function from MIREX 2010 is of the bag of chromas category. First, reference and estimated chords (including their bass notes) are converted into a set of chromas. Then the cardinality of the intersection is taken, ignoring pitch spelling. If it is 3 or more, the segment receives a score of 1, otherwise its score is 0. For segments whose reference chord can be mapped to an augmented or a diminished triad, the threshold is lowered to 2 or more chromas. Of course, segments annotated with the no-chord, thus having an empty chroma set, are only assigned a score of 1 when the estimation also equals the no-chord symbol. This method suffers from some important drawbacks. First of all, like all bag of chromas scoring functions, the root chroma looses its special position among other chromas: no distinction is made between chords that differ in root, but contain the same chromas. Secondly, there are no musicological underpinnings that warrant the 1 http://nemadiy.googlecode.com/svn/ 4. A NEW LOOK AT EXISTING DATA Since the introduction of the NEMA (Networked Environment for Music Analysis) platform [23] for MIREX 2010, all algorithmic output and the used ground truth has been publicly available, offering us output from 42 systems for two data sets. The first one has been used in editions 2010, 2011 and 2012 and is named “Isophonics” [3]. It consists for the main part of Beatles songs (180) with some additional songs by Queen (19) and Zweieck (18). The second one has only been used in 2012 and is called “Billboard” [4]. It contains 197 songs that have charted in the Billboard 100 between 1958 and 1991. Unfortunately, only the annotations according to triads (without inversions) are publicly available for the latter set, therefore it is exclusively used in combination with the “Mirex2010”, “Triads”, “Root” and “ChromaRecall” metrics. For other mappings, it is impossible to verify whether any chromas that have been estimated in addition to the triad are correct and just haven’t been annotated or whether they result from an incorrect estimation. Due to space constraints we have not printed all results here, but a selection of the most interesting can be found in Table 2 and Table 1 for the “Billboard” and “Isophonics” sets respectively. We retained the letter identifiers for the algorithms as used on the MIREX website2 , with the addition of the year in which they participated. The complete results for all the algorithms, as well as the algorithmic output used to calculate them, can be found on-line3 . Comparing the two data sets, we see that the best results on the “Billboard” set are 10 % lower than on the “Isophonics” set, and this for all evaluation measures. This can be explained by the fact the “Isophonics” set has been publicly available before the contest, while the “Billboard” set has been secret. It is therefore likely that algorithms are optimised for the former. The “Triads” scores are consistently lower than the “Mirex2010”4 ones, but the extent to which they are lesser depends on the algorithm. For example, the algorithm with the best “Mirex2010” score, 2 http://www.music-ir.org/mirex/wiki/MIREX_HOME 3 https://github.com/jpauwels/mirex-tools 4 Our reported results for the “Mirex2010” score differ somewhat from the “Weighted average overlap ratio” score on the MIREX site because of differences in the evaluation of no-chords and in the chord parsing Algorithm Mirex2010 Triads Tetrads TriadsInput TetradsOnly Root Bass ChromaRecall ChromaPrecision 77.60 77.79 76.95 77.96 77.42 73.07 77.09 79.29 81.40 80.35 81.40 77.14 73.32 82.15 74.76 76.54 77.13 76.53 76.42 74.46 71.80 75.99 78.08 80.69 79.54 79.01 76.18 72.13 70.50 73.68 65.95 66.79 65.08 65.56 53.39 62.47 65.93 67.55 73.88 68.51 68.36 65.23 62.61 63.88 64.00 78.21 79.29 77.30 78.98 76.81 74.37 78.41 80.15 82.13 81.34 81.17 77.35 74.38 76.00 76.19 0.00 0.00 0.00 17.66 36.32 0.00 0.00 0.00 52.65 0.00 46.12 0.00 0.00 19.24 0.00 80.28 80.99 79.65 79.69 79.54 75.76 79.34 81.65 82.92 82.92 81.42 80.08 75.78 72.80 76.80 77.68 79.39 78.62 76.83 78.88 75.21 77.88 79.93 82.06 82.42 81.01 78.47 75.50 81.93 75.57 82.10 81.97 80.49 82.91 84.05 79.34 81.04 82.95 86.13 83.48 86.30 80.83 79.11 87.09 80.07 85.60 85.33 83.99 84.49 80.47 82.69 84.46 86.47 86.61 86.97 85.67 84.37 82.49 86.04 83.43 CWB1 (’10) [2] EW4 (’10) [11] KO1 (’10) [12] MD1 (’10) [6] MM1 (’10) [13] OFG1 (’10) [14] UUOS1 (’10/’11) [15] CB2 (’11) [16] KO1 (’11/’12) [17] NM1 (’11) [18] NMSD3 (’11) [18] CCSS1 (’12) [19] NG1 (’12) [20] NMSD1 (’12) [18] PMP1 (’12) [10] Table 1. Evaluation scores for the “Isophonics” set. Algorithm CCSS1 (’12) [19] DMW1 (’12) [21] KO1 (’11/’12) [17] NG1 (’12) [20] NMSD4 (’12) [22] PMP1 (’12) [10] Mirex2010 Triads Root ChromaRecall 66.21 62.65 69.80 62.49 72.51 67.29 65.90 62.33 69.31 62.27 60.45 67.01 71.33 69.71 73.91 66.99 63.72 70.96 75.97 74.81 79.03 74.56 81.40 77.10 Table 2. Evaluation scores for the “Billboard” set. “NMSD1/4 (’12)”, decreases the most in the ranking. This is because the “Mirex2010” score does not require the root to be correctly estimated to get a good score, whereas the “Triads” score does. This explanation is corroborated by the “Root” scores. Consequently, we conclude that this algorithm is very capable of retrieving chromas in a chord, but does not manage to identify the root very well. Overall, the score lowers even more for “Tetrads”, but here too some distinctions can be made between algorithms. For most of them, the relative decrease is almost 15 %, but there are outliers in both directions. The relative decrease for “MM1 (’10)” is almost 30 %, significantly worse than when no tetrads would be estimated at all, while the decrease for “KO1 (’11/’12)” and “NMSD1 (’12)” is less than 10 %. The picture gets clearer when we only look at chords that are mapped to tetrads (the “TetradsOnly” score). Most algorithms that decrease 15 % in score actually can not generate triads at all. The 15 % just corresponds to the proportion of the annotations that are mapped to a tetrad. Only 5 algorithms do generate tetrads, but their estimation is lower than for exact triads (“TriadsInput” score) in all cases, ranging between 18 % and 53 % instead of 77–82 %. However, the performance on the tetrads themselves does not entirely reflect the ranking of “Tetrads” (in this case, “MM1 (’10)” is significantly better than “NMSD1 (’12)”). This brings us to the conclusion that due to the relatively small proportion of tetrads to triads, attention should be paid as to not overestimate triads as tetrads before any possible gain due to the increased chord vocabulary can be expected. It should be noted that the frequency of occurrence of tetrads in relation to triads is strongly dependent on genre, for example tetrads are more used in jazz than in rock music, so the results are likely to change for other data sets, and as a result, different algorithms can have different “preferences” for certain genres. So far, we have ignored all bass notes during evaluation. There are however algorithms that have extended their vocabulary to be able to generate chord inversions. The measure “Bass” is specifically designed to evaluate this. It is somewhat surprising that the two best performing algorithms, “NM1 (’11)” and “KO1 (’11/’12)”, can not generate inversions, so by default they always estimate root position. The explanation is that the ratio of inversions to ground positions is rather small, just like the tetrads/triads ratio. So similarly to the latter, the benefit of possibly estimating the right inversion is not enough to outweigh the option to wrongly estimate the bass on a root position, at least not with the current implementations. As can be seen with the two previous metrics, it is obvious that a strongly skewed chord frequency makes it hard to evaluate improvements on less frequently occurring chords. A way to deal with this imbalance is to weigh all chords after mapping CM O equally instead of according to duration. The drawback is that the mapping needs to be chosen in function of the estimation algorithm such that there are no mapped chords that cannot be generated by the algorithm (i.e. M(CEST ) = CM O ), as that would strongly influence the average. Consequently, it is useful for per algorithm evaluations, but not for a large scale one, as that would necessitate a mapping that produces the lowest common estimation vocabulary. In our case, it would be a mapping with CM O = {maj, min} only, thereby defeating the purpose of a more in-depth evaluation. Lastly, we take a look at the bag of chromas evaluation measures. “ChromaRecall” is similar to the “Mirex2010” metric, but without threshold that binarises the score. Therefore the results are consistently higher. The best performing algorithms are those that can generate tetrads, the others are penalised because they simply cannot generate enough chromas at times. In combination with the “ChromaPrecision” measure, we can conclude that of those 5 tetrad generating algorithms, “MM1 (’10)” clearly has a tendency to overestimate triads as tetrads, the others are better balanced. 5. CONCLUSIONS AND FURTHER WORK We started this paper by presenting a scheme to describe chord evaluation measures, extending the work of Harte [5]. Then the measures as used in the MIREX competitions from 2008 till 2012 have been described according to the proposed scheme as well as some alternative ones. Finally, we used these metrics to analyse the raw data coming from the competition. This showed us that the effects of a change in evaluation procedure is not equal for all algorithms: some maintain a consistent ranking, while others only excel in one particular measure. In general, we can conclude that extending the vocabulary of algorithms towards tetrads and inversions does not necessarily result in a better estimation, although the state-of-the-art is able to estimate some tetrads correctly, albeit not as well as triads. Future algorithmic improvements are therefore advised to verify progress in that domain by using similar evaluation measures as presented here. In the future, we’d like to carry out a similarly detailed evaluation of tetrads and inversions on the “Billboard” set if we can get access to more detailed annotations. We also hope to integrate this extended evaluation directly into the NEMA framework so that future MIREX editions can directly benefit from a more extensive evaluation. 6. REFERENCES [1] H´el`ene Papadopoulos and Geoffroy Peeters, “Large-scale study of chord estimation algorithms based on chroma representation and HMM,” in Proceedings of the International Workshop on Content-Based Multimedia Indexing (CBMI), 2007, pp. 53–60. [2] Taemin Cho, Ron J. Weiss, and Juan P. Bello, “Exploring common variations in state of the art chord recognition systems,” in Proceedings of the Sound and Music Computing Conference, 2010. [3] Matthias Mauch, Chris Cannam, Matthew Davies, Simon Dixon, Chris Harte, Sefki Kolozali, Dan Tidhar, and Mark Sandler, “OMRAS2 metadata project 2009,” in Proceedings of the 10th International Conference on Music Information Retrieval (ISMIR), 2009. [4] John Ashley Burgoyne, Jonathan Wild, and Ichiro Fujinaga, “An expert ground-truth set for audio chord recognition and music analysis,” in Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR), 2011, pp. 633–638. [5] Christopher Harte, Towards automatic extraction of harmony information from music signals, Ph.D. thesis, Queen Mary, University of London, August 2010. [6] Matthias Mauch, Automatic chord transcription from audio using computational models of musical context, Ph.D. thesis, School of Electronic Engineering and Computer Science Queen Mary, University of London, March 2010. [7] H´el`ene Papadopoulos, Joint estimation of musical content information from an audio signals, Ph.D. thesis, Universit´e Pierre et Marie Curie, July 2010. [8] Laurent Oudre, Template-based chord recognition from audio signals, Ph.D. thesis, Tel´ecom ParisTech, November 2010. [13] Matthias Mauch and Simon Dixon, “Approximate note transcription for the improved identification of difficult chords,” in Proceedings of the 11th International Conference on Music Information Retrieval (ISMIR), 2010, pp. 135–140. [14] Laurent Oudre, C´edric F´evotte, and Yves Grenier, “Probabilistic template-based chord recognition,” IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no. 8, pp. 2249 – 2259, November 2011. [15] Yushi Ueda, Yuki Uchiyama, Nobutaka Ono, and Shigeki Sagayama, “Mirex 2010: Joint recognition of key and chord from music audio signals using key-modulaton HMMs,” in Proceedings of the Music Information Retrieval Evaluation Exchange (MIREX), 2010. [16] Taemin Cho and Juan Pablo Bello, “A feature smoothing method for chord recognition using recurrence plots,” in Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR), 2011, pp. 651–656. [17] Maksim Khadkevich and Maurizio Omologo, “Timefrequency reassigned features for automatic chord recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), December 2011, pp. 181 – 184. [18] Yizhao Ni, Matt McVicar, Ra´ul Santos-Rodr´ıguez, and Tijl De Bie, “An end-to-end machine learning system for harmonic analysis of music,” IEEE Transactions on Audio, Speech and Language Processing, vol. 20, no. 6, pp. 1771–1783, August 2012. [19] Ruofeng Chen, Weibin Shen, Ajay Srinivasamurthy, and Parag Chordia, “Chord recognition using duration-explicit hidden Markov models,” in Proceedings of the 13th International Conference on Music Information Retrieval (ISMIR), 2012, pp. 445 – 450. [9] Maksim Khadkevich, Music signal processing for automatic extraction of harmonic and rhythmic information, Ph.D. thesis, University of Trento, 2011. [20] Nikolay Glazyrin, “Audio chord estimation using chroma reduced spectrogram and self-similarity,” in Proceedings of the Music Information Retrieval Evaluation Exchange (MIREX), 2012. [10] Johan Pauwels, Jean-Pierre Martens, and Marc Leman, “Improving the key extraction accuracy of a simultaneous key and chord estimation system,” in Proceedings of the International Workshop on Advances in Music Information Research (AdMIRe), 2011. [21] W. Bas de Haas, Jos´e Pedro Magalh˜aes, and Frans Wiering, “Improving audio chord transcription by exploiting harmonic and metric knowledge,” in Proceedings of the 13th International Conference on Music Information Retrieval (ISMIR), 2012, pp. 295 – 300. [11] Adrian Weller, Daniel Ellis, and Tony Jebara, “Structured prediction models for chord transcription of music audio,” in Proceedings of the IEEE International Conference on Machine Learning and Applications (ICMLA), 2009, pp. 590 – 595. [22] Yizhao Ni, Matt McVicar, Ra´ul Santos-Rodr´ıguez, and Tijl De Bie, “Using hyper-genre training to explore genre information for automatic chord estimation,” in Proceedings of the 13th International Conference on Music Information Retrieval (ISMIR), 2012, pp. 109 – 114. [12] Maksim Khadkevich and Maurizio Omologo, “Use of hidden Markov models and factored language models for automatic chord recognition,” in Proceedings of the 10th International Conference on Music Information Retrieval (ISMIR), 2009, pp. 561–566. [23] Kris West, Amit Kumar, Andrew Shirk, Guojun Zhu, J. Stephen Downie, Andreas Ehmann, and Mert Bay, “The networked environment for music analysis (NEMA),” in IEEE Fourth International Workshop on Scientific Workflows (SWF), 2010, pp. 314 – 317.
© Copyright 2025