This paper looks at methods and results of recent work in the field of using machine learning to predict music preferences. There is quite a bit of research in this arena and this review attempts to collect a representative sample. Unlike more purely academic subjects this field is being driven in part by corporate interests, and for this reason there is also some attention paid in this paper to successful commercial applications. The paper begins with a general introduction followed by summaries of the subjects of interest and test results. It was not possible to cover all the papers of interest, and a listing with brief descriptions is included before the bibliography.
Music is one of the fundamental methods of human expression. With billions of dollars tied up in the entertainment industry around the world, music is big business as well. The question of what exactly makes good music is open to considerable debate. Personal experiences, cultural conditioning, social pressures and very likely even brain structure all play important roles in influencing an individual's response to a piece of music. Despite this variety, there seem to be patterns in musical preferences. Machine learning holds some promise for being able to discern and leverage those patterns in useful ways.
Machine learning, as a field, deals with, intuitively, designing processes whereby machines learn. There are several types of learning possible, but in the domain of preference prediction the most applicable type is inductive learning where, given a set of evidence, the computer will attempt to determine the salient characteristics.
Music production starts with musicians and ends with the listeners. Machine learning has the potential to provide useful information to both ends of this process.
People write music for reasons from love to angst to patriotism. Record companies, however, produce records for essentially one reason: profit. For a given song, a record company would like to know how popular a song is going to be and what needs to be done to make the some more popular. Commercial systems are already developing to give musicians a hit rating [1] for their new song and many record companies are taking this into account.
Hundreds of new songs are produced every day. It is literally impossible for a listener to hear and evaluate them all. Listeners would like to be able generate suggestions based on their preferences of songs that they might enjoy. Products are developing [2] which allow users to create custom radio stations that incrementally tailor themselves to the user based on feedback.
Academic research is more active in the area of estimating user preferences. There's work using a variety of algorithms including anchor space clustering [6], support vector machines [5], and Bayesian inference [4].
A machine learning algorithm is only as strong as the characteristics the algorithm has to work with. There is a variety of research on the usefulness of characteristics including the more commonplace Mel coefficients [7] and Fourier transforms [8], but also creative applications such as latent semantic analysis [5,4].
Vicenç Gaitán Alcalde, Carlos Marîa López Ullod, Antonio Trias Bonet, Antonio Trias Llopis, and Jesês Sanz Marcos. Daniel Caldentey Ysern, Dominic Arkwright. Method and System for Music Recommendation. United States Patent #7,081,579, 25 July 2006. Polyphonic Human Media Interface, S.L.
Music Intelligence Solutions (MIS), is the American expansion of the Spain-based Polyphonic HMI and holder of US patent #7,081,579: Method and System for Music Recommendation. MIS is currently marketing several targeted product offerings. Their initial product, still in production, is Hit Song Science, a fee-based popularity prediction for musicians. They are expanding their technology into the arena of music recommendation with some product such as Music Information Universe which allows the development of custom recommendation systems for clients with specific content or subscribers.
The primary characterization used in MIS's computations are discrete Fourier transforms [8]. The song is divided into parts, various characteristics are extracted and the progression of the averages for different characteristics over time generates a signature. These signature vectors are then grouped using the "sum of differences squared," which is better known as k-means clustering.
The primary strengths of MIS's system, lie in the determination of which spectral components correspond to perceptual auditory characteristics such as "brightness, tempo, volume, rhythm and octave." Another important characteristic is their dataset which includes in excess of 3.5 million songs.
Ike, Elephant. "Tim Westergren Interview." Available online at http://videos.howstuffworks.com/reuters/1069-how-pandora-radio-works-video.htm. Tiny Mix Tapes, Jan. 2006.
Music Genome Project Website. About the Music Genome Project. Available online at http://www.pandora.com/mgp.shtml. Visited December 2007.
The metaphor that Pandora uses is an adaptive radio station that trains itself to a listener's tastes. Unlike many other systems [1], Pandora includes a variety of expert identified characteristics such as genre and number and gender of artists. Users interact with Pandora via the web. A user specifies either an artist or song to model the channel on, and positive or negative feedback on the songs that are selected help to tune the station to the listener's preferences.
The Music Genome Project serves as the basis for the categorization in Pandora. It aims to be "the most comprehensive analysis of music ever." [3] Relatively few specific statistics have been published, but in an interview with Reuters [2], Pandora creator Tim Westergren claimed the system examines "close to 400" qualities and has a database of "hundreds of thousands of songs."
Qing Li, Sung Hyon Myaeng, Dong Hai Guan, and Byeong Man Kim. A Probabilistic Model for Music Recommendation Considering Audio Features. In Information Retrieval Technology, pages 72-83. Springer Verlag, Lecture Notes in Computer Science 3689, 2005.
Content-based filters are a system for producing recommendations where a set is filtered based on explicit or implied characteristics of the searcher. For example, a keyword-based search engine takes a term and searches for pages containing that term. Collaborative filtering on the other hand looks at the behaviors of other people and attempts to infer preferences. For example, Amazon has a section when viewing a product titled, "other users who bought this also bought…"
Both types of filters have issues with the so called "cold start problem." That is it is not possible to infer anything about someone that nothing is known about. Li et al. describe a system that cannot overcome the problem entirely, but which permits the system to "warm" more quickly and start making effective recommendations even with sparse information.
The approach centers around two groups of clusters: one for abstracting listeners and the other for songs. The songs are clustered using k-means, but there is an addition of fuzzy set theory to allow a song to belong to all clusters to some, perhaps zero, extent. Communities of listeners are clustered using k-mentoids for its robustness to the effects of outliers.
Several characteristics are considered in clustering the audio data including MFCC [7], additional spectral components, rhythmic features and pitch features. Listener clusters are based solely on rankings of songs.
Testing was done on 760 pieces with 4.340 ratings made by 128 users. Each of 25 trials omitted a single test user and the accuracy was averaged over the trials. Different complexities were examined for both users and songs. The error dropped from 87% to 64% going from 1 to 50 communities, it then rose semi-linearly until the maximum sample size of 160. The error with increased complexity of audio features also went up and down, but the range was only 0.02. Compared to a item-based content filtering scheme, there was a 5% increase in effectiveness.
Ruth Dhanaraj and Beth Logan. HPL-2005-149: Automatic Prediction of Hit Songs. In Proceedings of the International Conference on Music Information Retrieval. London, UK, August 2005.
The bulk of work into music preference determination is based on musical characteristics, be they automated characteristics such as MFCC [7] or expert identified data such as genre. Relatively little attention has been given to the lyric content of the songs.
Dhanaraj and Logan attempt to examine the possibility of the role that semantic content plays in a songs popularity. Two methods are compared. One takes the MFCC features from several points in a variety of songs were used to generate a set of k-means clusters. The clusters were then identified and a song's identification vector was a composite of the identifiers of the clusters that its series of MFCCs belong to. This permits both a generalization of MFCC types and a more compact vector. The comparison method involved the application of latent semantic analysis [8] which uses probabilistic methods to correlate word frequency with likely semantic concepts present in a document. The identification vector was developed using a similar k-means approach.
The identification vectors were then classified using support vector machines [9]. Additionally the same vectors were classified using "boosting classifiers" consisting of optimal combinations of weak linear learners. [10]
Tests were conducted on a 1700 song dataset. Semantic content proved a slightly more effective determinant (68% effective versus 66%) of song popularity than audio content. Repeated tests were done with different numbers of k-means clusters and different vector sizes. There was relatively little variation in effectiveness, but in general boosting tended to decrease in performance as the complexity increased.
One of the more interesting findings was that the semantic content that proved most effective for the weak learners were terms that acted as negative bounds — that is terms that were almost certain not to occur in a popular song.
Adam Berenzweig, Daniel P.W. Ellis and Steve Lawrence. Anchor Space for Classification and Similarity Measurement of Music. In Proceedings of the International Conference on Multimedia and Exposition pages I-29-32. Baltimore, Maryland, July 2003.
Anchor-spaces deal with a common issue in clustering. Algorithms such as k-means and support vector machines [9] rely on conceptualizing the data as vectors in an n-dimensional space. This makes it difficult to deal with certain types of data such as music genre which do not have a natural ordered mapping to the real numbers. Anchor space attempts to deal with this by using an expert to select representative anchors for a given property. Clusters can then be built that express their finding in terms of the relationships to those anchors which are frequently more intuitive than raw spectral data. For example, the results of an anchor space clustering of music data might enable one to say, "the song has a strong blues influence with some folk elements."
The clustering was done with two configurations of neural nets. For the M anchors being considered, one method was a M-way classifier configured so strong activation of one feature would suppress others, and the was M separate classifiers. The input to the classifiers was MFCCs and the deltas between those MFCCs. Feedback into the neural networks was gathered via a web interface where listeners could rank the relative similarity of songs.
Tests were conducted on several different quantities of data with different modeling methods. The best results were seen using linear weights within the neural nets (as opposed to logarithmic) where the system achieved 38% effectiveness (as opposed to a 0.25% random possibility).
The Fourier series is based on the idea that any waveform can be approximated to arbitrary precision with a combination of sine and cosine waves. The Fourier transform produces the weights to represent a waveform as summation of these curves. It takes complex analog signals and allows them to be represented as a series of numbers.
Digital audio recordings are generally not represented as the actual waveforms. Rather thousands of samples are taken and reproduce an approximation of the original signal. Compact discs for example are sampled at 44.1kHz. The normal Fourier series is designed to deal with continuous curves whereas discrete Fourier transforms are specifically designed to handle sequenced sets of discrete chunks.
Wikipedia Website Article. Mel Frequency Cepstral Coefficients. Available online at http://en.wikipedia.org/wiki/Mel_frequency_cepstral_coefficient. Visited December 2007.
For frequencies about 500Hz, larger and larger increases in frequency are perceived by the human auditory system as equivalent increases in pitch. The Mel Frequency Cepstral Coefficients (MFCCs) simply band of the Fourier series into a logarithmic scale that approximates human hearing using the Mel's scale.
There is a surprisingly large amount of information on this topic on the internet once one begins to look.