Song Similarity Classification Using Music Information Retrieval on the Million Song Dataset Authored by ¨ ter Richard Nysa [email protected] 070-4229705 Cardellgatan 3 11436 Stockholm Tobias Reinhammar [email protected] 070-6648678 Abrahamsbergsv¨agen 87 16830 Bromma Supervisor Anders Askenfelt School of Computer Science and Communications Royal Institute of Technology Bachelor Degree Project in Computer Science, DD143X May 24, 2013 Abstract The purpose of this study was to investigate the possibility of automatically classifying the similarity of song pairs. The machine learning algorithm K-Nearest Neighbours, combined with both bootstrap aggregating and an attribute selection classifier, was first trained by combining the acoustic features of 45 song pairs extracted from the Million Song Dataset with usersubmitted similarity for each pair. The trained algorithm was then utilized to predict the similarity between 50 hand-picked and about 4000 randomly chosen pop and rock songs from the Million Song Dataset. Finally, the algorithm was subjectively evaluated by asking users to identify which out of two randomly ordered songs, one with a low and one with a high predicted similarity, they found most similar to a target song. The users picked the same song as the algorithm 365 out of 514 times, giving the algorithm an accuracy of 71%. The results indicates that automatic and accurate classification of song similarity may be possible and thus may be used in music applications. Further research on improving the current algorithm or finding alternative algorithms is warranted to draw further conclusions about the viability of using automatically classified song similarity in real-world applications. Sammanfattning Syftet med denna studie var att unders¨oka huruvida det a¨r m¨ojligt att automatiskt r¨akna ut hur lika tv˚ a l˚ atar a¨r. I studien anv¨andes maskininl¨arningsalgoritmen k n¨armaste grannar tillsammans med bootstrap aggregering och en klassificerare som s˚ allar bort ovidkommande egenskaper. Algoritmen tr¨anades f¨orst genom att kombinera ett flertal akustiska parametrar med anv¨andares likhetsbed¨omningar f¨or 45 l˚ atpar skapades genom att kombinera 10 l˚ atar uttagna fr˚ an The Million Song Dataset med varandra. Den tr¨anade algoritmen anv¨andes sedan f¨or att r¨akna ut likheten mellan 50 handplockade och ungef¨ar 4000 slumpm¨assigt valda pop- och rockl˚ atar fr˚ an the Million Song Dataset. Avslutningsvis utv¨arderades resultaten genom en andra fas av anv¨andartestning. Anv¨andare blev ombedda att lyssna p˚ a en m˚ all˚ at, en av de 50 handplockade l˚ atarna, f¨oljt av en av de l˚ atar som algoritmen matchat som mycket lik och en l˚ at som den matchat som mycket olik, i slumpm¨assig ordning. Anv¨andaren fick sedan v¨alja vilken av de tv˚ a l˚ atarna som tycktes likna m˚ all˚ aten. Algoritmen och anv¨andaren valde samma l˚ at i 365 av 514 fall, vilket ger algoritmen en tr¨affs¨akerhet p˚ a 71%. Resultaten tyder p˚ a att det kan vara m¨ojligt att utveckla en algoritm som automatiskt kan klassificera likhet mellan l˚ atar med h¨og precision och d¨armed skulle kunna anv¨andas i musikapplikationer. Ytterligare utveckling av algoritmen, eller forskning p˚ a alternativa algoritmer, a¨r n¨odv¨andigt f¨or att kunna dra vidare slutsatser om hur anv¨andbart automatisk uppskattning av l˚ atlikhet ¨ar f¨or verkliga till¨ampningar. Statement of Collaboration Tobias Reinhammar recruited the test subjects for the web application and chose the tracks used in the user data gathering phase and the application evaluation. Richard Nys¨ater wrote most of the code used in the application, with critical input from Tobias regarding design, usability and functionality. This paper, the project specification and major project decisions and evaluation were collaborative efforts where both parties contributed equally. Contents 1 Introduction 1.1 Background . . . . . 1.1.1 Terminology . 1.1.2 Related work 1.2 Problem statement . 1.2.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 3 4 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Method 2.1 Million Song Dataset . . . . . . . . . . . . . . . 2.1.1 Features . . . . . . . . . . . . . . . . . . 2.2 Machine Learning . . . . . . . . . . . . . . . . . 2.2.1 Supervised Machine Learning algorithms 2.3 Project phases . . . . . . . . . . . . . . . . . . . 2.3.1 Gathering input data . . . . . . . . . . . 2.3.2 Utilizing the user data . . . . . . . . . . 2.3.3 Subjective evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 . 5 . 5 . 7 . 7 . 8 . 9 . 9 . 10 3 Results 11 3.1 User ratings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Automated similarity rating . . . . . . . . . . . . . . . . . . . 12 3.3 User evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4 Discussion 4.1 Confounding factors . . . . . . . . 4.1.1 Million Song Dataset . . . . 4.1.2 Feature usage . . . . . . . . 4.1.3 User data . . . . . . . . . . 4.1.4 Learning tracklist . . . . . . 4.1.5 Machine learning algorithm i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 15 15 15 15 16 16 5 Conclusions and future work 17 5.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Acknowledgements 18 References 19 6 Appendix 6.1 Appendix A - Million Song Dataset Field 6.2 Appendix B - Web Application . . . . . 6.2.1 User rating application . . . . . . 6.2.2 User evaluation application . . . 6.3 Appendix C - The Evaluation Tracklist . 22 22 24 24 25 26 ii List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1 Introduction 1.1 Background With the massive amount of music tracks available today, ways to allow users to discover new music according to their personal preferences by automatically analyzing songs are in high demand. While recommending similar artists is a prominent feature in popular music applications, recommending similar songs is, as of now, quite uncommon. Thanks to the distribution of the Million Song Dataset, a large amount of acoustic features and metadata is now freely available to researchers. One way of analyzing songs to discover related music is to define a similarity between songs and then recommending songs with a high similarity. While using a group of experts or users to manually tag related artists is used today, manually rating song similarity is unfeasible as the time requirement is immense. As such, utilizing an automated system to classify the similarity between songs is a far more sensible approach. Although such a system would be highly beneficial, measuring similarity is not a simple task due to the complexity of the many patterns that create the notion of similarity. 1.1.1 Terminology Music information retrieval - MIR MIR is the science of extracting information from music. The field of MIR has become increasingly relevant with the current development of various music services available over the internet. With the massive amount of songs available, useful tools for discovering new music according to personal preferences are in high demand. Thus, effective methods of automatically analyzing 1 songs are required in order to be able to process large quantities of data. Million Song Dataset - MSD The purpose of the MSD is, as described by the creators LabROSA and The Echo Nest [1], • to encourage research on algorithms that scale to commercial sizes. • to provide a reference dataset for evaluating research. • as a shortcut alternative to creating a large dataset with APIs. • to help new researchers get started in the MIR field. The MSD contains a large quantity of musical features and metadata extracted from one million songs provided by The Echo Nest. The Echo Nest is the largest repository of dynamic music data in the world, containing data on over 34 million songs [2]. The full list of features and metadata is available in appendix A. This project does not use the full Million Song Dataset due to time and computational restraints. Machine Learning Machine learning is a field that seeks to answer the question ”How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes?” [3]. A typical application of machine learning is when designing an algorithm would be too complex for a human or when an application needs to adapt to new environments without human input. Furthermore, current knowledge may not be relevant in the future and when continuously redesigning a system is not feasible, machine learning may allow the system to adapt on its own [4]. Supervised Machine Learning Supervised machine learning, is an area of Machine Learning where algorithms are first trained on externally supplied data in order to make predictions about future data [5]. It accomplishes this by first associating vectors of observations, the training data, to class labels and then creating a mathematical function. The function can then be used to predict the value of missing labels in future data [6]. 2 Training set The externally supplied data that consists of known values for the class labels that will later be predicted by the supervised machine learning algorithm is called a training set. Test set The data with unknown class labels which are predicted by the machine learning algorithm is called the test set. Waikato Environment for Knowledge Analysis - WEKA WEKA is a suite of machine learning software created by the University of Waikato in New Zealand. Weka contains machine learning tools for data preprocessing, classification, regression, clustering, association rules, and visualization [7]. Since WEKA is easily accessible and contains many of the popular machine learning algorithms used within Music Information Retrieval, it was chosen as the main tool to create the similarity algorithm. 1.1.2 Related work Music Similarity Aucouturier and Pachet performed similar research on automatically classifying the similarity between songs [8]. Aucouturier and Pachet’s study used a song set of 17,075 songs on which the authors compared the timbre of different pairs of songs to determine the similarity. They performed a subjective evaluation with 10 users in which the users were presented with a target song, followed by two test songs. The two test songs were chosen so that one was measured similar to the target song, while the other was dissimilar. Afterwards, users were asked to decide which of the two test songs they found most similar to the target song. In this experiment the similar test song was picked by the users 80% of the time. The success of Aucouturier and Pachet’s work was an inspiration to this project in a number of ways, but mainly in two aspects. Their research towards the relevance of the timbre feature was influential when selecting features for this study. Furthermore, the evaluation method conducted by the authors was deemed suitable for this study’s evaluation. Music Title Identification A related work in this domain is the identification of music title [9, 10], where an artist and title of a music source is identified by matching the 3 audio fingerprint of the music to known fingerprints of a large amount of songs. An identification rate above 98% has been achieved for the MPEG-7 standard [9]. While music title identification works great for finding the title of a music source, it does not address the problem of finding similar music since it does not consider the factors that make humans perceive two songs as similar [11]. However, it is clear that many of the features used in music title identification, such as timbre and loudness, can be used when determining music similarity. Music Genre Classification Automatic classification of the genre of a song is one of the more commonly researched topics in the field of music information retrieval today. Many studies have managed to correctly classify genres with a high degree of accuracy. Tzanetakis and Cook achieved an accuracy of 60% [12] and other studies have achieved an accuracy of at least 80% [13, 14]. Genre classification is closely related to music similarity, with similar predictive machine learning algorithms and musical features used for both tasks. 1.2 Problem statement The main purpose of this project was to investigate the possibility of using the data contained in the MSD in combination with user-supplied song similarity ratings to create a machine learning algorithm able to classify the similarity between songs belonging to the pop and rock genres with reasonable accuracy. The project was limited to two closely related genres, as including more genres would require a quantity of user ratings and computation time beyond the scope of this study. 1.2.1 Hypothesis Our hypothesis was that the developed application would be able to determine if songs are very different, but may not accurately select the most similar songs. This hypothesis was based on several confounding factors encountered in the initial stages of the project (see chapter 4. Discussion). 4 Chapter 2 Method This section will describe the approach taken when the algorithm was created and evaluated, as well as featuring a detailed explanation of the machine learning algorithm and the features used from the MSD. 2.1 Million Song Dataset The MSD was used as the dataset of choice since it is both freely available and contains a vast amount of songs and relevant musical information. The features utilized to create the similarity rating are described below. 2.1.1 Features The following part is a detailed description of the MSD features which were examined and employed in this research. Some features were excluded because they were not considered to be relevant to the perceived similarity or because their potential applications were deemed too complex; and some were excluded due to their low availability in the MSD. Tempo Tempo is the speed of a musical composition, in the MSD it is measured in beats per minute (BPM). BPM means that a note is first classified as the beat, and the amount of beats are the number of notes that must be played per minute. This feature was chosen due to its significant impact on a musical piece, as a happy and a sad song often differ significantly in tempo. Happy songs are generally played faster than sad songs, and at a more energizing pace. The tempo of a pair of songs was given as ratio, which was calculated by dividing the tempo of the quicker song with the tempo of the slower song. 5 Loudness The overall loudness of a song is derived by a weighted combination of the individual loudness values of notes (start and max loudness). The loudness values are mapped to a human auditory model and measured in decibel [15]. This feature was chosen as the difference in loudness between two songs likely is a contributing factor to the perceived similarity of songs. The loudness for a pair of songs was calculated as a ratio value using the same method as tempo. Key In the context of music theory, the key refers to the tonic note and chord. Keys range from C to B (C, C#, D, D#, ..., B) and in MSD, C corresponds to the value 0 and B to 11. This feature was chosen because the relation between the keys may influence the perceived similarity of songs. A pair of songs would be assigned the key value of the distance between the songs keys in the chromatic circle. Mode The mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived [15]. This feature was chosen because the mode greatly affects the overall mood and feeling of a song. A pair of songs would be assigned a mode value of 0 if the two songs were of the same mode and a value of 1 if their modes were not the same. Timbre Timbre is the quality of a musical note or sound that distinguishes different types of musical instruments, or voices. It is also referred to as sound color, texture, or tone quality, and is derived independently of pitch and loudness. In the MSD the timbre feature is represented as a 12 dimensional array. The twelve elements emphasize different musical features such as brightness, flatness and loudness and are ordered by importance [15]. This feature was chosen because it is likely one of the most important aspects as it describes the musical instruments and vocal performance, both of which are of utmost importance to the perceived similarity of two songs. The timbre values for a pair of songs were calculated as an absolute value of the difference between the elements that emphasize the same feature in each song’s array. 6 Timbre Confidence Because the data in the MSD is automatically generated from songs, there is a varying degree of uncertainty for the timbre values. The confidence value is between 0 and 1 and a low certainty means the value should be considered speculative. This feature was chosen because it allows the algorithm to take the confidence into consideration when estimating the similarity based on the timbre. The timbre confidence of a pair of songs was calculated as the individual confidences of the songs added together. 2.2 Machine Learning Machine learning is a powerful tool both for handling large quantities of data and creating algorithms which are too complex for humans, therefore it was deemed to be the most convenient and efficient way to automatically classify songs as considering each timbre value manually is a very complex task. In this study the WEKA suite is used to create and apply the various machine learning algorithms. 2.2.1 Supervised Machine Learning algorithms Supervised machine learning is utilized in this study as users are likely good at determining the similarity of song pairs, which can then be used to train a classifying algorithm. The supervised machine learning algorithm will use the training set to weigh the importance of the relations between the similarity rating and the other musical features. The trained supervised machine learning algorithm will then be used to predict a similarity for the other song pairs in the test set, for which the similarity is unknown. K-nearest neighbours (k-NN) k-NN is an instance based classifier [16] that works by creating a vector with N values for each known instance, where the N values is every value except the one being predicted, and placing each of these instances in an N-dimensional space. When the known instances have been placed, the algorithm works by also placing the instances being predicted in the N-dimensional plane and assigning the unknown value of each instance according to the average of the known values of its k nearest neighbors [17]. The algorithm used in this project uses the euclidean distance between instances, which means each parameter included is equally important, and it also weights the nearest neighbours by the inverse distance to the instance being classified. This 7 causes closer neighbours to be more important when predicting the unknown instance, which is particularly helpful when the amount of known values is low. The k-value used in this project was 7 and it was chosen iteratively by minimizing the root mean square error while still trying to maximizing the correlation coefficient when cross-validating the training data over 10 folds. This algorithm was chosen because it is fast, simple and still achieved better results than a few other algorithms, such as support vector machines and decision trees, in a small preliminary test. Bagging Bagging, also known as Bootstrap Aggregating, was chosen to enhance the k-NN algorithm because it minimizes prediction error by reducing overfitting. This is accomplished by creating a new weighted learning set [17]. In machine learning, overfitting occurs when an algorithm overvalues features that increase the accuracy on the training set but are irrelevant when predicting values for a test set [18]. Attribute selection Attribute selection tries to eliminate redundant or irrelevant features from the subset, which reduces overfitting [19]. Because the k-NN algorithm uses euclidean distance it is very important to only include important features and because the actual importance of the features was unknown, attribute selection was applied. 2.3 Project phases The project was conducted in three phases. First, data was gathered in the form of user-submitted similarity ratings. The user data was then used to teach the machine learning algorithm which parameters carry the most significance when determining how similar two songs are. Finally, the algorithm was applied on a larger set of songs and the results were evaluated by a final phase of user testing. In the evaluation, the users were first presented with a song from the evaluation tracklist followed by two songs from the subset; one of these songs was determined to be one of the most similar to the target song, and the other was one of the least similar. The users were then asked to pick which one of the two songs from the subset they considered to be most similar to the target song. 8 2.3.1 Gathering input data User data was gathered by hosting a web application, detailed in Appendix B - Web Application, which enabled the users to listen to pairs of song samples, extracted from the 7digital API [20]. These songs, the learning tracklist, were composed of a selected set of 10 unique songs. Every song in the set was matched against every other song adding up to a total of 45 unique pairs for the user to rate on a scale from 0 to 100. The songs selected for this phase were all chosen from the pop and rock genres. These genres were selected because they share many similarities and are quite familiar to most users which is likely to improve the quality of the user-submitted data. The 10 songs endeavor to provide a reasonable coverage of the pop and rock genres by varations in speed, mood, vocal performance and instrumental composition. The table 2.1 lists the artist and tracks which constitutes the learning tracklist. Track The Unforgiven II The Trooper White Flag A New Day Has Come About you now Basket Case Wind Of Change Here I go again Smoke Wonderwall Artist Metallica Iron Maiden Dido Celin´e Dion Timo R¨ais¨anen Green Day Scorpions Whitesnake Natalie Imbruglia Oasis Table 2.1: The learning tracklist The user is first introduced to the application by three sample pairs which display a roughly estimated rating in order to allow the user some insight into what kind of songs are present in the set. Subsequently, the real sample set is introduced and the ratings are saved. The rating session is matched to the user’s IP in order to limit the amount of ratings supplied by each user. 2.3.2 Utilizing the user data Firstly, the training set for the machine learning algorithm was created by extracting the differences between the two songs in every pair, as described earlier in the method. Secondly, a data post with the average of the usersubmitted similarity was added to every pair. Additionally, every song was 9 matched with itself and given a similarity rating of 100, in order to supply the algorithm with a few perfect matches. Lastly, the training set was used to automatically classify the similarity between pairs composed of the evaluation songs and the songs in the subset. In order to limit the size of the subset and keep the research within the pop and rock genres, the evaluation tracklist was only compared against songs by artists who featured a pop or rock tag, both supplied by users at MusicBrainz.org [21] and present among the Echo Nest genre tags. Furthermore, some tags were excluded from the search, due to being at the very edges of the genres and therefore not being well represented in the learning set. The subgenres excluded were the following: Grindcore, Deathgrind, Black metal, Doom metal, Sludge metal, Noise, Black metal, Screamo, Glitch, Glitchcore, Aggrotech, Metalcore and Death metal. 2.3.3 Subjective evaluation In addition to the 10 songs used in the first phase, another 40 songs from the pop and rock genres were added for the evaluation, as presented in Appendix C - The evaluation tracklist. In the same manner as in the first phase, the songs were chosen to provide a reasonably good coverage of the genres. The songs were chosen to be both fairly well known and popular in order to improve the user experience and therefore make users more likely to continue rating. The user evaluations were gathered through a slightly modified version of the web application from the first phase. The users were presented with a target song followed by two test songs from a subset of approximately 4000 songs extracted from the Million Song Dataset. One of the test songs were randomized from the 10 songs with the highest similarity to the target song, and the other from the 10 songs with the lowest similarity. 10 Chapter 3 Results In this section the results of the study is presented in three parts, one for each of the three phases of the research. 3.1 User ratings In total, 28 users submitted 965 similarity ratings of the 45 song pairs. The lowest amount of ratings on a pair was 18 and the highest 26. The average similarity rating of all songs was 37.3 and the average standard deviation was 11.9. The histogram below, figure 3.1, illustrates the distribution of the average user-submitted similarity rating of the song pairs in the training set. Most song pairs received a rather low similarity rating. The highest userrated pair recieved an average of 77. The pair, “White Flag - Dido” and “Smoke - Natalie Imbruglia” were rated 77, which was the highest similarity rating. The pair, “A New Day Has Come - C´eline Dion” and “The Trooper - Iron Maiden” were rated with the lowest similarity rating, 7. Both “White Flag - Dido” and “Smoke - Natalie Imbruglia” are performed by female pop artists, and are quite close in terms of tempo. Both songs are rather mellow and the instrumental compositions both feature strings and drums. “A New Day Has Come - C´eline Dion” is a slow-paced pop ballad with female vocals. The song is mainly accompanied by piano and a background of strings. “The Trooper - Iron Maiden” on the other hand is a fast paced hard rock song with male vocals. The instrumental composition is that of electric guitar, drums and bass guitar. 11 Figure 3.1: Histogram detailing the distribution of user-ratings for the song pairs 3.2 Automated similarity rating 196,000 ratings were predicted by the algorithm, 3920 songs were compared to each of the 50 songs in the evaluation tracklist. The attributes selected by the attribute selection classifier as the most important were: Tempo, Loudness, Mode, 6 out of the 12 timbre elements and the similarity value. The predicted similarity ratings were distributed evenly with the lowest ratings being near 17 and the highest ratings near 98. 3.3 User evaluation For the evaluation we gathered 514 user comparisons from 21 unique users. The song the algorithm determined as the most similar to the target song was picked by the users 365 times. This gives the algorithm a success rate of 71%. In Appendix C - The Evaluation Tracklist, the prediction accuracy for each of the 50 target songs is listed. The histogram 3.2 illustrates the distribution of the accuracy of the algorithm for the song pairs in the evaluation set. 12 Figure 3.2: Histogram detailing the distribution of accuracy for the song pairs 13 Chapter 4 Discussion The purpose of the study was to find out if it is possible to automatically classify the similarity between songs with a reasonable accuracy. While the model utilized in this study did not conclusively prove that it is possible to succeed in this endeavor, as a few songs received an accuracy below 50% which cannot be considered reasonably accurate, the results indicate that it may be feasible in future work. The data gathered from the user ratings suggest that users tend to share a common opinion on which pairs of songs they deem to be similar, suggesting that perceived song similarity is not solely an individual notion. Although this study strived towards limiting the tracks to the pop and rock genres, several tracks that could be considered neither pop or rock were included. Because the genre tags were only associated with the artist, tracks which would not be considered actual songs and therefore not be relevant to an application for rating song similarity. An example of this problem could be a track which consist of an interview with a rock artist. In comparison with the previous work of Aucouturier and Pachet [8], the algorithm of this study performed slightly poorer, 71% compared to their 80% accuracy. However, they analyzed the similarity using only the timbre of the songs, unlike this study which took many additional variables into consideration. Finally, the hypothesis stated in the initial stages of this project proved to be mostly correct. While the algorithm still often left a lot to be desired when matching a target song against a supposedly similar song, it was quite efficient at finding songs which deviated a great deal from the target song. Only 12 of the 50 target songs had an accuracy of 50% or worse, which suggests that there were certain elements in these songs that caused them to be greatly mismatched. In fact, the amount of songs that had an accuracy of 50% or below were fewer than the songs that had an accuracy of 90% or above. 14 Decreasing the variance by improving the accuracy of the worst performing songs would greatly increase the overall precision of the algorithm, likely to the point where it would have a respectable accuracy. 4.1 4.1.1 Confounding factors Million Song Dataset Most features included in the MSD are automatically extracted from audio provided by The Echo Nest. As such, many fields are approximated which can compromise the accuracy of the data. An example of bad approximation is a song pair of the same song recorded on two different occasions, “Whitesnake - Here I go again (2008 Digital Remaster)” and “Whitesnake - Here I go again ‘87 (2007 Digital Remaster)”, which has a BPM ratio of 3 according to the MSD. Furthermore, the target song which had the worst accuracy in the evaluation phase was “Take Me Out - Franz Ferdinand”, which has a tempo of 210 BPM according to the data in the MSD. However, the general consensus among public sources [22, 23] and our peers is that the song has a BPM of 105. In fact, Ronnow and Twetman encountered similar issues [24] when evaluating genre classification. This indicates that miscalculated BPM in the MSD may cause the algorithm to perform poorly. Unfortunately there is no BPM confidence value which could allow the algorithm place less weight on potentially erroneous values. 4.1.2 Feature usage It is possible that additional features present in the MSD may be utilized to increase the accuracy of the algorithm. Additionally, the comparisons between the features of two songs may not be optimal in this study. A possibly better use of the comparison of the loudness between two songs, which was calculated as a ratio in this study, would be to calculate the absolute difference instead. 4.1.3 User data The user data analyzed in this study had a standard deviation of up to 19, which means that the perceived similarity may vary a great deal between different users. This spread could potentially corrupt the training set and 15 therefore cause the algorithm to incorrectly classify the importance of certain features. 4.1.4 Learning tracklist The learning tracklist used in this project was limited to only 10 songs which seemed to be insufficient in covering the entire spectrum of the pop and rock genres. Songs which were significantly different from all the songs in the training set were often incorrectly classified. This occasionally resulted in pop ballads and grindcore metal songs to be matched as similar songs in the initial stages of the study. 4.1.5 Machine learning algorithm The machine learning algorithms used in this research were selected through a small set of empirical studies. Therefore, the chosen algorithm may not be sufficiently efficient. In addition, the parameters of this study’s algorithms could possibly be further tweaked to increase predictive accuracy. 16 Chapter 5 Conclusions and future work The k-NN algorithm created in this study successfully distinguished between similar and dissimilar songs 71% of the time, with 28% of the evaluated song pairs receiving an accuracy of 90% or above. However, 24% of the pairs received an accuracy below 50% which means that the algorithm cannot be considered accurate. On the other hand, the results achieved by this study is a strong indication that creating an algorithm that very accurately predicts song similarity is possible. While the user-submitted similarity had a rather high standard deviation, the average rating seemed to be a good indicator of perceived song similarity. Therefore, using user ratings to train an algorithm is likely a viable method when a large amount of users are available. 5.1 Future work If the factors that caused the predictions for a small subset of the songs to be greatly inaccurate were identified, the algorithm presented in this study could be much improved. Additionally, another improvement could be achieved by future studies if confounding factors encountered in this research were adressed. This could be accomplished by expanding and improving upon the learning tracklist, gathering a larger quantity of user input and validating the data used from the MSD, especially for the learning phase. Furthermore, other compositions of machine learning algorithms may be more suitable than k-NN for predicting similarity. 17 Acknowledgments We would like to give special thanks to the many Anders at the Department of Speech, Music and Hearing(TMH) at the Royal Institute of Technology, Stockholm, Sweden. To Anders Askenfelt for giving us a head start and providing valuable insight and feedback. To Anders Friberg and Anders Elowsson for their invaluable input regarding both machine learning, and adapting the data in the Million Song Dataset which aided us greatly in putting it to proper use. Furthermore, we give our sincerest thanks to everyone who has provided this project with valuable user data. Thanks to their diligence through the, at times, tedious task of rating song similarity the project got a solid foundation to start from and a useful evaluation. 18 References [1] Bertin-Mahieux, Thierry & Ellis, Daniel P.W, & Whitman, Brian & Lamere, Paul, The Million Song Dataset, LabROSA, Electrical Engineering Department, Columbia University, New York, USA & The Echo Nest,Somerville, USA, 2011. http://www.columbia.edu/ ~tb2332/Papers/ismir11.pdf(2013-04-10) [2] The Echo Nest, the source of the data used in the Million Song Dataset. http://echonest.com/company/ (2013-04-11) [3] Mitchell, Tom M., The Discipline of Machine Learning, School of Computer Science, Carnegie Mellon University, Pittsburgh, USA, 2006. https://www.cs.cmu.edu/~tom/pubs/MachineLearning. pdf(2013-04-10) [4] Nilsson, Nils J., Introduction to Machine Learning, Robotics Laboratory, Department of Computer Science, Stanford University, Stanford, USA, p 1-5, 1998. http://robotics.stanford.edu/~nilsson/ MLBOOK.pdf(2013-04-10) [5] Mohri, Mehryar, Lecture on: Foundations of Machine Learning: Lecture 1, Courant Institute & Google Research, 2013. http://www.cs.nyu. edu/~mohri/mls/lecture_1.pdf(2013-04-10)(2013-04-10) [6] Gentleman, R. & Huber, W. & Carey, V. J, Biconductor Case Studies - Supervised Machine Learning, Springer Science+Business Media LLC, 2008. ,p 121-123. http://link.springer.com/chapter/10. 1007%2F978-0-387-77240-0_9(2013-04-10) [7] Weka 3: Data Mining Software in Java. http://www.cs.waikato.ac. nz/ml/weka(2013-04-10) [8] Aucouturier, Jean-Julien & Pachet, Francois, Music Similarity Measures: What’s the Use?, France, Paris, SONY Computer Science 19 Labratory, 2002. http://web.cs.swarthmore.edu/~turnbull/cs97/ f09/paper/Aucouturier02.pdf(2013-04-10) [9] Allamanche, Eric & Herre, J¨ urgen & Hellmuth, Oliver & Fr¨oba, Bernhard & Kastner, Thorsten & Cremer, Markus, Content-based Identification of Audio Material Using MPEG-7 Low Level Description, Computer Science Department, Brandeis University, Germany, 2001. http://www.cs.brandeis.edu/~dilant/cs175/ %5BAlexander-Friedlander%5D.pdf(2013-04-10) [10] Cano, Pedro & Batlle, Eloi, & Kalker, Tom & Haitsma, Jaap, A Review of Algorithms for Audio Fingerprinting, Universitat Pompeu Fabra, Barcelona, Spain & Philips Research Eindhoven, Eindhoven, The Netherlands, 2002. http://ucbmgm.googlecode.com/svn-history/ r7/trunk/Documentos/Fingerprint-Cano.pdf(2013-04-10) [11] Cano, Pedro & Koppenberger, Markus & Wack, Nicolas, Contentbased Music Audio Recommendation, Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain, 2005. http://dl.acm.org/ citation.cfm?id=1101181(2013-04-10) [12] Tzanetakis, George & Cook, Perry, Musical Genre Classification of Audio Signals, IEEE Transactions on Speech and Audio Processing, 2002. http://dspace.library.uvic.ca:8080/bitstream/ handle/1828/1344/tsap02gtzan.pdf?sequence=1(2013-04-10) [13] Li, Tao & Ogihara, Mitsunori, & Qi, Li, A Comparative Study on Content-Based Music Genre Classification Computer Science Department, University of Rochester, Rochester, USA & Department of CIS, University of Delaware, Newark, USA, 2003. http://dl.acm.org/ citation.cfm?id=860487&bnc=1(2013-04-10) [14] Soltau, Hagen & Schultz Tanja &, Westphal, Martin & Waibel, Alex, Recognition of Music Types, Interactive Systems Laboratories, University of Karlsruhe, Germany, & Carnegie Mellon University, USA, 1998. http://www.ri.cmu.edu/pub_files/pub1/soltau_ hagen_1998_2/soltau_hagen_1998_2.pdf(2013-04-10) [15] Documentation for the Analyzer used to create the MSD. http://docs.echonest.com.s3-website-us-east-1.amazonaws. com/_static/AnalyzeDocumentation.pdf(2013-04-10) 20 [16] Kotsiantis, S. B, Supervised Machine Learning: A Review of Classification Techniques, Department of Computer Science and Technology, University of Peloponnese, Greece, 2007. http://www. informatica.si/PDF/31-3/11_Kotsiantis%20-%20Supervised% 20Machine%20Learning%20-%20A%20Review%20of...pdf(2013-04-10) [17] Steele, Brian M, Exact bootstrap k-nearest neighbor learners, Springer Science+Business Media LLC, 2008. http://link.springer.com/ content/pdf/10.1007%2Fs10994-008-5096-0(2013-04-10) [18] Singh, Aarti, Lecture on: Practical Issues in Machine Learning Overfitting and Model Selection, Machine Learning Department, Carnegie Mellon University, Pittsburgh, USA, 2010. http://www.cs.cmu.edu/ ~epxing/Class/10701-10s/Lecture/lecture8.pdf(2013-04-10) [19] Guyon, Isabelle & Elisseeff, Andr´e, An Introduction to Variable and Feature Selection, Empirical Inference for Machine Learning and Perception Department, Clopinet, Berkeley, USA & Max Planck Institute for Biological Cybernetics, T¨ ubingen, Germany, 2003. http://jmlr.csail. mit.edu/papers/volume3/guyon03a/guyon03a.pdf(2013-04-10) [20] API utilized for track previews. http://developer.7digital. net/(2013-04-10) [21] MusicBrainz.org, a community-maintained open source encyclopedia of music information. http://musicbrainz.org/(2013-04-11) [22] A BPM database. take-me-out(2013-04-10) http://songbpm.com/franz-ferdinand/ [23] A BPM database. http://www.bpmdatabase.com/search.php?begin= 0&num=1&numBegin=1&artist=franz+ferdinand&title=take+me+ out(2013-04-11) [24] R¨onnow, Daniel & Twetman, Theodor, Automatic Genre Classification From Acoustic Features, Royal Institute of Technology, Stockholm, Sweden, 2012. http://www.csc.kth.se/utbildning/kth/kurser/ DD143X/dkand12/Group7Anders/final/Ronnow_Twetman_grp7_ final.pdf(2013-04-10) 21 Chapter 6 Appendix 6.1 Appendix A - Million Song Dataset Field List Field name analysis sample rate artist 7digitalid artist familiarity artist hotttnesss artist id artist latitude artist location artist longitude artist mbid artist mbtags artist mbtags count artist name artist playmeid artist terms artist terms freq artist terms weight audio md5 bars confidence bars start beats confidence beats start Type float int float float string float string float string array string array int string int array string array float array float string array float array float array float array float 22 Description sample rate of the audio used ID from 7digital.com or -1 algorithmic estimation algorithmic estimation Echo Nest ID latitude location name longitude ID from musicbrainz.org tags from musicbrainz.org tag counts for musicbrainz tags artist name ID from playme.com, or -1 Echo Nest tags Echo Nest tags freqs Echo Nest tags weight audio hash code confidence measure beginning of bars, usually on a beat confidence measure result of beat tracking Continued on next page continued from previous page Field name Type Description danceability float algorithmic estimation duration float in seconds end of fade in float seconds at the beginning of the song energy float energy from listener point of view key int key the song is in key confidence float confidence measure loudness float overall loudness in dB mode int major or minor mode confidence float confidence measure release string album name release 7digitalid int ID from 7digital.com or -1 sections confidence array float confidence measure sections start array float largest grouping in a song, e.g. verse segments confidence array float confidence measure segments loudness max array float max dB value segments loudness time array float time of max dB value segments loudness start array float dB value at onset segments pitches 2D array float chroma feature, one value per note segments start array float musical events, note onsets segments timbre 2D array float texture features (MFCC+PCA-like) similar artists array string Echo Nest artist IDs song hotttnesss float algorithmic estimation song id string Echo Nest song ID start of fade out float time in sec tatums confidence array float confidence measure tatums start array float smallest rythmic element tempo float estimated tempo in BPM time signature int estimate of number of beats per bar time signature confidence float confidence measure title string song title track id string Echo Nest track ID track 7digitalid int ID from 7digital.com or -1 year int release year from MusicBrainz 23 6.2 6.2.1 Appendix B - Web Application User rating application 24 6.2.2 User evaluation application 25 6.3 Appendix C - The Evaluation Tracklist Track Carrie Dr. Feelgood Fast Car Highway Star It Takes A Fool To Remain Sane More Than A Feeling This Love (Will be your downfall) White Flag Whenever, Wherever My Immortal Shoreline A New Day Has Come A Thousand Miles About You Now Black Velvet Destiny Calling Misery Business The Downeaster “Alexa” Flux Only You Here I Go Again Tom’s Diner Wonderwall Angels Crazy On You It’s My Life Erase / Rewind The Trooper Africa Cats In The Cradle Basket Case 4 In The Morning Learning To Fly Scarborough Fair/Canticle Glory To The Brave Artist Europe M¨otley Cr¨ ue Tracy Chapman Deep Purple The Ark Boston Ellie Goulding Dido Shakira Evanescence Anna Ternheim C´eline Dion Vanessa Carlton Timo R¨ais¨anen Alannah Myles Melody Club Paramore Billy Joel Bloc Party Joshua Radin Whitesnake Suzanne Vega, DNA Oasis Within Temptation Heart Bon Jovi The Cardigans Iron Maiden Toto Ugly Kid Joe Green Day Gwen Stefani Tom Petty And The Simon & Garfunkel Hammerfall 26 Accuracy 100% 100% 100% 100% 100% 100% 100% 100% 90.91% 90.91% 90.91% 90% 90% 90% 88.89% 88.89% 88.89% 88.89% 87.5% 83.33% 83.33% 81.82% 81.82% 81.82% 80% 76.92% 75% 75% 75% 70% 70% 66.67% Heartbreakers 66.67% 62.5% 60% Continued on next page continued from previous page Track Artist Bark At the Moon Ozzy Osbourne I Want To Know What Love Is Foreigner Slow Dancing In A Burning Room John Mayer Good Riddance Green Day Chariot Gavin DeGraw The Unforgiven II Metallica Unwritten Natasha Bedingfield 18 And Life Skid Row Hero Enrique Iglesias Smoke Natalie Imbruglia You’ll Be In My Heart Phil Collins Wind Of Change Scorpions Make Your Own Kind Of Music Mama Cass Rooftops lostprophets Take Me Out Franz Ferdinand 27 Accuracy 58.33% 55.56% 55.56% 50% 50% 44.44% 44.44% 36.36% 33.33% 30% 27.27% 20% 20% 20% 18.18%
© Copyright 2025