Data mining musical profiles Anthony Liekens, March, 28-April, 2 2007
Abstract. Here's a preliminary data mining analysis of musical social networking service Last.fm. An automated classification into clusters or sub populations with related musical genres reveals the structure of musical preferences among the users in a relatively large sample population. Musical tag clouds are adopted to characterise users and populations, which adds a highly descriptive value and aids with the interpretation of the results.
Last.fm allows its users to register their musical listening habits; Every time a user plays a music track, this track's information is sent to Last.fm's servers. Based on the information that was gathered over a longer period, the Last.fm service generates a list of users (neighbours) who listen to similar artists and tracks. On the basis of this information, Last.fm then recommends artists and songs to its users. This approach works surprisingly well and provides very valuable information for users that want to expand their musical horizon. If two users listen to a big common pool of artists, it is indeed highly probable that they will also enjoy each other's mutual exclusive artists.
Here, we study the structure of musical preferences in a sample population of this social network. Based on the information in Last.fm's database, we can describe a user's profile by common tags or labels of its artists, with proportions of e.g., "rock," "jazz" or "hip-hop" as his preferred music genres. This approach is similar, but different from the methodology adopted by Last.fm itself. In the latter approach, data mining is based on shared artists and tracks. By adopting tags that describe a user's musical preferred genres, a more descriptive, and dimensionally less complex (and thus mathematically simpler) description of the population and its main structure can be given.
Here, we have taken a sample of 2840 Last.fm users, their 28302 recorded artists and these artists' commonly used tags. The sample was taken in the last week of March 2007. We show how one can determine important groups of musical genres that can classify Last.fm's user base into separate groups, adopting a range of elementary data mining algorithms, such as principal components analysis and K-means clustering.
Don't be fooled by what's hidden in the data, and remember that 70% of all statistics is made up (that's a joke)
As a result, we show that the population of Last.fm users can be separated into 5 clearly distinct populations of music lovers, one for each of the genres "indie", "rock", "metal", "hip-hop" and "electronic". We discuss some of the initial observations based on this preliminary data mining effort.
This sort of information shows the (economic) value of large online communities. If you are a music label, festival organiser, or otherwise at work in the corporate music business, this information provides insights in the global market, and shows how to act to attend to a perfect target audience for your commercial activities. This is not necessarily a bad thing for the consumer of these products. Last.fm's service is completely free and legally co-operates with labels on a unique basis. The consumers receive a lot of free services in return, learn to enjoy new musical genres and artists, which they are likely to pay for, and can develop a social network based on their musical preferences. Consumers and producers both gain. (It's probably very obvious that I'm a big fan of Last.fm!)
Musical tag vectors and clouds
In my previous instalment on analysing Last.fm data, I explored how one could study the musical tag cloud of a user's musical preferences to discover new bands. These musical tag clouds or musical tag vectors will now serve as the basis of our data mining analysis.
The musical tag cloud or tag vector describing a user's musical preferences is constructed as follows. For a user, consider his top 50 artists, and the number of tracks of that artist played by this user. Last.fm users can tag artists with keywords, called tags. For each artist in the user's top, consider this artist's tags and the counts of occurrences of that tag for the artist, linearly calibrated such that the top tag has weight or popularity 1. The tag vector is now the weighted sum of tag occurrences in this aggregate. The value at each dimension (identified by a tag) is the sum of that tag's weight in the top artists, weighted by the number of tracks played and the weight of the tag for the artist. The tag vector is calibrated such that its length equals 1.
The following are the 10 most important factors of my personal musical tag vector, which provide a clear description of my preferred music styles:
10 largest components in the tag vector describing user aliekens
A tag cloud is a descriptive illustration of these tags, where the font size is scaled linearly by the tag's weight in the tag vector. The following is my personal musical tag cloud, with my top 10 tags.
The top 10 tags in my personal profile as a tag cloud
This tag cloud indeed gives a good indication of my personal listening profile. I have also written a more detailed discussion of single user tag clouds, and finding recommendations based on tags. There's also more examples of musical tag clouds, representing weekly statistics of several friends.
In the rest of this article, we only use the 10 largest components of the tag vector, to simplify the computations and limit the number of dimensions required to describe populations of users. In a small test, 15 Last.fm users typically shared over 600 tags among their tag vectors, although the weights in most tags can be discarded as marginal. With our random sample consisting of 2840 users, 735 tags are used by all users if their tags are limited to the 10 most important. All 2840 users in the random sample are now considered as 735 dimensional vectors, where only their 10 most important tag vector components are nonzero.
The following is a tag cloud representing the top tags in the common pool of genres in our sample population.
The top tags in the sample population of 2840 users
The sample of users was gathered by a random walk on friends and neighbours lists in Last.fm's database, seeded by a set of random users. "Alternative" "rock" and "indie" are clearly the most common genres in the Last.fm sampled audience. It shows that this sample population (and probably all of Last.fm's user base) is not a good representation of the whole population of music fans. It's the demographic with internet access, people who listen to their music on a computer, and have subscribed to gather all their statistics to a public server, openly for everyone to see. Only the music played by this small population, via this specific medium, is considered in this study of Last.fm's social music service. Personally, my Last.fm statistics are gathered while I'm at work on the computer, which requires a specific choice of genres. I listen to radio and C Ds while in the car, which offers different genres, for a different mood. The demographic represented by Last.fm's statistics is however an important target marketing audience; a young, influential and evolving population of dedicated music lovers.
Principal components analysis
Picturing 2840 points in a 735-dimensional space is quite troublesome, as we're only used to visualising at most 3 dimensional data. We need a way to flatten the 735 dimensions down to 2 or 3 dimensions in order to view it. Principal components analysis (PCA) is such a technique, and it's got some extra's that come in hand here. It allows us to flatten the highly dimensional space down to it's principal components, which gives a new set of dimensions, where the first dimension is the one that remains most of the initial data's variance. This trick thus allows us to map the vectors to a lower dimensional space, and if significant differences among sub populations exist in the initial data, it's likely to also observe them in this lower dimensional representation. Nice!
The following picture depicts the data, flattened onto a 2-dimensional picture, which explains a quarter of the initial data's variance. The X-axis represent the most important principal component, the Y-axis shows the second most important component. For the 10 most important tags in the whole population, we show how their unit vectors are mapped onto the first two components. My personal profile is also highlighted and labelled "aliekens".
Sample population plotted by its first two principal components
The first component separates users that listen to "indie" and "rock" (on the left) from those that listen to "electronic", "hip-hop" and "metal" (on the right). The second component separates "indie" (top) from "rock" (bottom) and it differentiates between "metal" (bottom) on one hand and "hip-hop" and "electronic" (top) on the other. There is no clear distinction between "electronic" and "hip-hop" so it is unclear how the large "blob" at the top right constitutes of "hip-hop" and "electronic" fans.
It seems that there is a spectrum of musical genres in the population of Last.fm users. The spectrum goes from "indie" over "alternative" to "rock" and "metal", and then onto "hip-hop" and "electronic" music with a sparse gap back to "indie". Users seem distributed over this spectrum. As a data point is located closer to the centre of the data, the user is less biased towards the genre, and mixes in others. When we cluster the data later on, we will automatically determine this separation of musical preferences and sub populations. Note that the tag vector for the "pop" genre is at the centre of our plot, denoting that "pop" averages over the above genres.
The "seen live" tag is very popular among Last.fm users, labelling artists that they have seen live at a concert or festival. This tag doesn't show a musical genre, but is related to "rock"/"indie" listeners. This is probably due to the fact that these music types are generally known to perform live more often than other styles, but this claim is unsupported.
The genres "hip-hop" and "electronic" are not clearly separated in this 2-dimensional representation of our data. We can show the data mapped onto its 3rd and 4th principal components, as follows, which shows the separation of "hip-hop" and "electronic" music styles very clearly.
Sample population plotted by its second and fourth principal components
The k-means clustering algorithm is a straightforward technique that attempts to find a classification of the vectors, putting them in clusters of users that are similar in their musical preferences. Their definition to ending up in the same cluster, is that they are all closest to their cluster's centre point (with respect to Euclidian distance). When the number of clusters is set to 5, we get a clear separation of sub populations in Last.fm. Below is a depiction of the clusters, where each colour denotes a cluster. It is clear that the clustering algorithm found "indie", "rock" and "metal" to be three significant sub populations of Last.fm users.
Clustering of data into 5 clusters (principal components 1 and 2)
The two other clusters are not separated clearly when depicted with respect to two principal components, but the third and fourth components show that the remaining clusters separate users that listen to "hip-hop", and those that don't. It is clear that I belong to the "electronic" camp. "Electronica" and "electronic" are confusing labels, and are used interspersed to tag the same music. "Electronic" is the adjective used for music that is produced electronically, where "electronica" is the actual genre.
Clustering of data into 5 clusters (principal components 3 and 4)
For each of these clusters, we can build the common tag cloud, describing the average of musical genres that are enjoyed by the users in the clusters.
The clustering clearly separates different musical styles, with little or no overlap with respect to the musical genres represented in these tag clouds:
The following is a list of random observations
This article will probably never be finished, and will evolve over time as I find out more stuff about these sub populations and their musical tag clouds. If you want to comment, share an idea, or know more about all this, e-mail me at
All contents copyrighted 2000-2006 Anthony Liekens unless otherwise noted. Powered by PmWiki
Page last modified on April 02, 2007, at 10:32 AM.