Data mining musical profiles Anthony Liekens, March, 28-April, 2 2007

Abstract. Here's a preliminary data mining analysis of musical social networking service Last.fm. An automated classification into clusters or sub populations with related musical genres reveals the structure of musical preferences among the users in a relatively large sample population. Musical tag clouds are adopted to characterise users and populations, which adds a highly descriptive value and aids with the interpretation of the results.

Introduction

Last.fm allows its users to register their musical listening habits; Every time a user plays a music track, this track's information is sent to Last.fm's servers. Based on the information that was gathered over a longer period, the Last.fm service generates a list of users (neighbours) who listen to similar artists and tracks. On the basis of this information, Last.fm then recommends artists and songs to its users. This approach works surprisingly well and provides very valuable information for users that want to expand their musical horizon. If two users listen to a big common pool of artists, it is indeed highly probable that they will also enjoy each other's mutual exclusive artists.

Here, we study the structure of musical preferences in a sample population of this social network. Based on the information in Last.fm's database, we can describe a user's profile by common tags or labels of its artists, with proportions of e.g., "rock," "jazz" or "hip-hop" as his preferred music genres. This approach is similar, but different from the methodology adopted by Last.fm itself. In the latter approach, data mining is based on shared artists and tracks. By adopting tags that describe a user's musical preferred genres, a more descriptive, and dimensionally less complex (and thus mathematically simpler) description of the population and its main structure can be given.

Here, we have taken a sample of 2840 Last.fm users, their 28302 recorded artists and these artists' commonly used tags. The sample was taken in the last week of March 2007. We show how one can determine important groups of musical genres that can classify Last.fm's user base into separate groups, adopting a range of elementary data mining algorithms, such as principal components analysis and K-means clustering.

Don't be fooled by what's hidden in the data, and remember that 70% of all statistics is made up (that's a joke)

As a result, we show that the population of Last.fm users can be separated into 5 clearly distinct populations of music lovers, one for each of the genres "indie", "rock", "metal", "hip-hop" and "electronic". We discuss some of the initial observations based on this preliminary data mining effort.

This sort of information shows the (economic) value of large online communities. If you are a music label, festival organiser, or otherwise at work in the corporate music business, this information provides insights in the global market, and shows how to act to attend to a perfect target audience for your commercial activities. This is not necessarily a bad thing for the consumer of these products. Last.fm's service is completely free and legally co-operates with labels on a unique basis. The consumers receive a lot of free services in return, learn to enjoy new musical genres and artists, which they are likely to pay for, and can develop a social network based on their musical preferences. Consumers and producers both gain. (It's probably very obvious that I'm a big fan of Last.fm!)

Musical tag vectors and clouds

In my previous instalment on analysing Last.fm data, I explored how one could study the musical tag cloud of a user's musical preferences to discover new bands. These musical tag clouds or musical tag vectors will now serve as the basis of our data mining analysis.

The musical tag cloud or tag vector describing a user's musical preferences is constructed as follows. For a user, consider his top 50 artists, and the number of tracks of that artist played by this user. Last.fm users can tag artists with keywords, called tags. For each artist in the user's top, consider this artist's tags and the counts of occurrences of that tag for the artist, linearly calibrated such that the top tag has weight or popularity 1. The tag vector is now the weighted sum of tag occurrences in this aggregate. The value at each dimension (identified by a tag) is the sum of that tag's weight in the top artists, weighted by the number of tracks played and the weight of the tag for the artist. The tag vector is calibrated such that its length equals 1.

The following are the 10 most important factors of my personal musical tag vector, which provide a clear description of my preferred music styles:

electronic	= 0.649133
electronica	= 0.472839
trance	= 0.306708
techno	= 0.281446
ambient	= 0.257368
dance	= 0.179112
alternative	= 0.132631
rap metal	= 0.0949674
rock	= 0.0919889
psytrance	= 0.0912921

10 largest components in the tag vector describing user aliekens

A tag cloud is a descriptive illustration of these tags, where the font size is scaled linearly by the tag's weight in the tag vector. The following is my personal musical tag cloud, with my top 10 tags.

alternative ambient dance electronic electronica psy trance rap metal rock techno trance

The top 10 tags in my personal profile as a tag cloud

This tag cloud indeed gives a good indication of my personal listening profile. I have also written a more detailed discussion of single user tag clouds, and finding recommendations based on tags. There's also more examples of musical tag clouds, representing weekly statistics of several friends.

In the rest of this article, we only use the 10 largest components of the tag vector, to simplify the computations and limit the number of dimensions required to describe populations of users. In a small test, 15 Last.fm users typically shared over 600 tags among their tag vectors, although the weights in most tags can be discarded as marginal. With our random sample consisting of 2840 users, 735 tags are used by all users if their tags are limited to the 10 most important. All 2840 users in the random sample are now considered as 735 dimensional vectors, where only their 10 most important tag vector components are nonzero.

The following is a tag cloud representing the top tags in the common pool of genres in our sample population.

alternative alternative rock ambient black metal chillout classic rock death metal electronic electronica emo experimental female vocalists folk gothic metal hardcore heavy metal hip-hop idm indie indie rock industrial japanese jazz melodic death metal metal metalcore pop power metal progressive metal progressive rock punk punk rock rap rock seen live singer-songwriter soundtrack

The top tags in the sample population of 2840 users

The sample of users was gathered by a random walk on friends and neighbours lists in Last.fm's database, seeded by a set of random users. "Alternative" "rock" and "indie" are clearly the most common genres in the Last.fm sampled audience. It shows that this sample population (and probably all of Last.fm's user base) is not a good representation of the whole population of music fans. It's the demographic with internet access, people who listen to their music on a computer, and have subscribed to gather all their statistics to a public server, openly for everyone to see. Only the music played by this small population, via this specific medium, is considered in this study of Last.fm's social music service. Personally, my Last.fm statistics are gathered while I'm at work on the computer, which requires a specific choice of genres. I listen to radio and C Ds while in the car, which offers different genres, for a different mood. The demographic represented by Last.fm's statistics is however an important target marketing audience; a young, influential and evolving population of dedicated music lovers.

Principal components analysis

Picturing 2840 points in a 735-dimensional space is quite troublesome, as we're only used to visualising at most 3 dimensional data. We need a way to flatten the 735 dimensions down to 2 or 3 dimensions in order to view it. Principal components analysis (PCA) is such a technique, and it's got some extra's that come in hand here. It allows us to flatten the highly dimensional space down to it's principal components, which gives a new set of dimensions, where the first dimension is the one that remains most of the initial data's variance. This trick thus allows us to map the vectors to a lower dimensional space, and if significant differences among sub populations exist in the initial data, it's likely to also observe them in this lower dimensional representation. Nice!

The following picture depicts the data, flattened onto a 2-dimensional picture, which explains a quarter of the initial data's variance. The X-axis represent the most important principal component, the Y-axis shows the second most important component. For the 10 most important tags in the whole population, we show how their unit vectors are mapped onto the first two components. My personal profile is also highlighted and labelled "aliekens".

Sample population plotted by its first two principal components

The first component separates users that listen to "indie" and "rock" (on the left) from those that listen to "electronic", "hip-hop" and "metal" (on the right). The second component separates "indie" (top) from "rock" (bottom) and it differentiates between "metal" (bottom) on one hand and "hip-hop" and "electronic" (top) on the other. There is no clear distinction between "electronic" and "hip-hop" so it is unclear how the large "blob" at the top right constitutes of "hip-hop" and "electronic" fans.

It seems that there is a spectrum of musical genres in the population of Last.fm users. The spectrum goes from "indie" over "alternative" to "rock" and "metal", and then onto "hip-hop" and "electronic" music with a sparse gap back to "indie". Users seem distributed over this spectrum. As a data point is located closer to the centre of the data, the user is less biased towards the genre, and mixes in others. When we cluster the data later on, we will automatically determine this separation of musical preferences and sub populations. Note that the tag vector for the "pop" genre is at the centre of our plot, denoting that "pop" averages over the above genres.

The "seen live" tag is very popular among Last.fm users, labelling artists that they have seen live at a concert or festival. This tag doesn't show a musical genre, but is related to "rock"/"indie" listeners. This is probably due to the fact that these music types are generally known to perform live more often than other styles, but this claim is unsupported.

The genres "hip-hop" and "electronic" are not clearly separated in this 2-dimensional representation of our data. We can show the data mapped onto its 3rd and 4th principal components, as follows, which shows the separation of "hip-hop" and "electronic" music styles very clearly.

Sample population plotted by its second and fourth principal components

K-Means clustering

The k-means clustering algorithm is a straightforward technique that attempts to find a classification of the vectors, putting them in clusters of users that are similar in their musical preferences. Their definition to ending up in the same cluster, is that they are all closest to their cluster's centre point (with respect to Euclidian distance). When the number of clusters is set to 5, we get a clear separation of sub populations in Last.fm. Below is a depiction of the clusters, where each colour denotes a cluster. It is clear that the clustering algorithm found "indie", "rock" and "metal" to be three significant sub populations of Last.fm users.

Clustering of data into 5 clusters (principal components 1 and 2)

The two other clusters are not separated clearly when depicted with respect to two principal components, but the third and fourth components show that the remaining clusters separate users that listen to "hip-hop", and those that don't. It is clear that I belong to the "electronic" camp. "Electronica" and "electronic" are confusing labels, and are used interspersed to tag the same music. "Electronic" is the adjective used for music that is produced electronically, where "electronica" is the actual genre.

Clustering of data into 5 clusters (principal components 3 and 4)

For each of these clusters, we can build the common tag cloud, describing the average of musical genres that are enjoyed by the users in the clusters.

alternative black metal death metal doom metal folk metal gothic metal heavy metal industrial melodic death metal metal metalcore power metal progressive metal rock seen live symphonic metal thrash metal

alternative indie indie rock rock seen live singer-songwriter

alternative alternative rock classic rock emo indie metal pop punk rock seen live

hip hop hip-hop rap rnb rock

80s alternative ambient anime chillout classic rock classical dance ebm electro electronic electronica experimental female vocalists gothic hardcore hip-hop house idm indie industrial j-pop japanese jazz jpop metal new wave pop progressive rock psytrance punk rnb rock russian seen live soul soundtrack techno trance trip-hop visual kei

The clustering clearly separates different musical styles, with little or no overlap with respect to the musical genres represented in these tag clouds:

Electronic/pop (828 users)
Rock (792 users)
Indie (589 users)
Metal (479 users)
Hip-hop (152 users)

The following is a list of random observations

A lot of big musical styles ("classical", "jazz", "blues", "folk", "reggae", to name a few) are not clearly represented in these tag clouds, and show that there is not a significant population of listeners of these genres at Last.fm, or it signifies that fans of these genres mix with different styles, causing these genres to fade from the classification.
I've never heard of "indie" music. I thought it had a lot in common with "rock," and would have ended up in the same cluster, but this is expectation was false. The target audience for independent music (over 20% of the sample population) is significantly separated from others, e.g., "rock". This can also be observed by the relative mutual exclusivity of "rock" and "indie" in their tag clouds.
The "electronic" cluster packs a lot of different genres, and is the most versatile cluster. Attempts to separate this "electronic" cluster into smaller ones by choosing a higher number of clusters failed, this cluster remained in tact. Instead, the first cluster to split up is the "metal" cluster, with significant sub populations of general "metal" fans, and "black metal" fans. A re-clustering of this "electronic"/"pop"/"punk" sub population can be found here. The 6 main sub clusters of this big cluster are "pop", "japanese", "ambient", "electronic", "industrial" and "punk." Sub populations of the other main clusters also define more specific target audiences with significantly different musical needs.
In contrast with the "electronic" cluster, the "hip-hop" cluster is the most clearly defined cluster, with "rap" and "rnb," its close neighbours (and "hip hop," a different spelling). Fans of "hip-hop" do not usually mix other genres with theirs. They are a clearly defined group of fans that can serve as an easy target for marketing. If you're organising a "hip-hop" festival, don't hire a "punk rock" band or your festival will end in a drive-by shooting bonanza.

More?

This article will probably never be finished, and will evolve over time as I find out more stuff about these sub populations and their musical tag clouds. If you want to comment, share an idea, or know more about all this, e-mail me at

take my first name
add an "at" character
take my last name
add a "dot" character
add "net"