What music do people listen to? How does their taste change with time? Where do new music styles come from? In this post we go through the methodology and technique used to create A.Track.Tion, a data visualization aimed at shedding light on these deep and interesting questions.
Data gathering and processing
Our objective was to measure and visualize music popularity. As a proxy, we settled on the industry’s definition: popular music sells. As our core dataset, we started with the Whitburn Project list: a collective effort to gather historical weekly music sales rankings published by the Billboard company (the project is now maintained at the Bullfrogs Pond). The list goes back up to 1890, even though the Billboard’s Top 100 data officially starts after 1954. The dataset required a little cleaning, as some songs have inconsistent date formats, and others are missing their weekly position in the chart. The subset we used (only songs after 1954, and dropping a couple thousand incomplete records) contains 33560 songs.
For our purposes, however, the data was incomplete in two ways:
1) First, we do not have sales data, at least directly: the Billboard company only publishes weekly rankings, which are ordered by some formula that depends on total sales. The formula has become more complex in the last two decades, as online presence is also taken into account. We need to invert this unknown formula (ok, not fully unknown, but it changes not only through the years but even week to week so it might as well be unknown to us). For this reverse-engineering , we used some publicly available sales data for a few weeks and years (like this, this, or this). The data is scarce (the company that measures it, Nielsen Soundscan, sells the subscription service to record companies and alike, and probably is expensive). However, we are optimists and we shall try to estimate from this the percentage of sales that a given ranking position entails. First, many many rankings based on economic figures follow a power law (also known as a Zipf’s law). Indeed our sales data shows a power law, but the two problems of power laws are also visible: One, for small numbers the power law tapers off (because real things do not diverge at zero), and two, for large rankings the tail is strange, specially if you have few data points like we do. However, we can optimistically say that the curves follow more or less the same slope, which means they follow the same power law! We estimate this slope to 0.75, that is, the sales follow a law such that sales = C * rank^0.75, where C is a normalization factor.
By choosing C so that all sales add up to 100, we have a convenient estimator of the percentage of sales associated with a given position in the weekly ranking. We just need to remember that for top 5 or so positions we overestimate the sales, and that the whole thing is just an estimate anyway — fitting more elaborate functions is not fully justified without bigger data. See below for a sanity check on our estimator.
2) Our second problem is that the original database contains only partial information about the genre of each song: only a few broad genres are listed, and only for a small percentage of the songs. To find this genre information, we searched and parsed thousands of Wikipedia articles, one for each song and artist in the list, and in this way collected data about the genre (or genres) that people have assigned to each song. This new dataset is actually richer than the original assigned genres, as many songs are now cross-genre, or they belong to niche sub-sub-genre of a broader music style (hello cowpunk). We found almost 800 musical genres, many associated to the songs in the database, and the rest because it was directly related to a genre already in the database (this, and rounding to zero, lead to some genres in the final plot appearing as having 0% popularity. We are working on fixing that). Because our parser was a little crude and naive, we had to curate and clean the list manually, separating actual genres from text the parser thought was genres (like the names of record companies). During this manual clean-up we also cleaned and re-assigned the relationships between genres. We think that it would not be so hard to perform an automated search and parse to complete our list.
Coupling our improved genre information with our estimate for percentage of sales, we are now able to estimate the percentage of sales for each musical genre and subgenre. Because many songs were linked to multiple genres, we decided to split the popularity equally among them.
The last piece of the puzzle is to connect our estimated percentages to actual money figures. For this, we will rely on the estimations of the global number of recorded music sales by the TsorT World Music Charts compilations. These estimates are very useful because they go way back to 1954, and they are close to the actual self-reported numbers from the industry: For 2007, TsorT estimates over 24 billion USD in total world sales, while the industry reported 19.4 billion USD. We normalized the TsorT data with this figure. TsorT only estimates up to 2007, for following years we used: 2008, 2009, 2010, 2011, 2012, and 2013. All our figures are inflation adjusted to 2013 US dollars.
Some results and sanity check
Our sales estimator allows us to compute the aggregated popularity for songs over several weeks (which assumes that total volume of sales is roughly constant over the period). This gives us an interesting opportunity to cross-check our sales estimator, and at the same time gossip and compare artists and songs! How can we pass this chance.
Why can we cross-check? Well, it turns out that Billboard itself has done the same calculation and published (for their 50th aniversary, and many times, actually) an all time ranking of songs. This is the one that betters compares to our time range (1955-2013). The Billboard list of top 20 songs:
- “The Twist” – Chubby Checker
- “Smooth” – Santana feat. Rob Thomas
- “Mack the Knife” – Bobby Darin
- “How Do I Live” – LeAnn Rimes
- “Party Rock Anthem” – LMFAO feat. Lauren Bennett & GoonRock
- “I Gotta Feeling” – The Black Eyed Peas
- “Macarena (Bayside Boys Mix)” – Los Del Rio
- “Physical” – Olivia Newton-John
- “You Light Up My Life” – Debby Boone
- “Hey Jude”, The Beatles
- “We Belong Together” – Mariah Carey
- “Un-Break My Heart” – Toni Braxton
- “Yeah!” – Usher feat. Lil Jon & Ludacris
- “Bette Davis Eyes” – Kim Carnes
- “Endless Love” – Diana Ross & Lionel Richie
- “Tonight’s the Night (Gonna Be Alright)” – Rod Stewart
- “You Were Meant for Me / Foolish Games” – Jewel
- “(Everything I Do) I Do It for You” – Bryan Adams
- “I’ll Make Love to You” – Boyz II Men
- “The Theme from ‘A Summer Place'” – Percy Faith
And the list coming out of our estimator:
- “Smooth”, Santana
- “I Gotta Feeling”, The Black Eyed Pea
- “Macarena (Bayside Boys Mix)”, Los Del Rio
- “We Belong Together”, Mariah Carey
- “Un-Break My Heart”, Toni Braxton
- “Yeah!”, Usher
- “One Sweet Day”, Mariah Carey
- “I’ll Make Love To You”, Boyz II Men
- “Somebody That I Used To Know”, Gotye
- “Candle In The Wind 1997”, Elton John
- “Something About The Way You Look Tonight”, Elton John
- “Party Rock Anthem”, LMFAO
- “We Found Love”, Rihanna
- “Low”,Flo Rida
- “Call Me Maybe”, Carly Rae Jepsen
- “End Of The Road”, Boyz II Men
- “I Will Always Love You”, Whitney Houston
- “Boom Boom Pow”, The Black Eyed Peas
- “Rolling In The Deep”, Adele
- “The Boy Is Mine”, Brandy & Monica
We observe that we have many of the songs in similar or close positions (mental note: we can do a visualisation of how the songs change positions between Billboard’s list and ours). Our biggest problem seems to be that we are miscalculating the popularity of some older songs (“Hey Jude” comes in at position 79 in our list, what a disgrace to The Beatles), but we knew this would be a problem because Billboard has changed their methodology a few times in the past, and our data could be seriously skewed to newer songs (and in fact Billboard mentions explicitly that “certain eras are weighted differently”). This is probably why “Te Twist” is Billboard’s top song, and it only comes in at position 344 in our list…However, given our rough estimations, that we do not change our formula, and that Billboard applies a liberal dose of subjective-hand-adjusting to their data, we are quite satisfied with the level of agreement.
Billboard also has a list of top 100 artists that we can compare too. Their 20 first artists in the ranking are
1 THE BEATLES
3 ELTON JOHN
4 ELVIS PRESLEY
5 MARIAH CAREY
6 STEVIE WONDER
7 JANET JACKSON
8 MICHAEL JACKSON
9 WHITNEY HOUSTON
10 THE ROLLING STONES
11 PAUL MCCARTNEY/WINGS
12 BEE GEES
16 THE SUPREMES
17 DARYL HALL JOHN OATES
19 ROD STEWART
20 OLIVIA NEWTON-JOHN
while our list is
- Elvis Presley
- Mariah Carey
- The Beatles
- Elton John
- Whitney Houston
- Michael Jackson
- Stevie Wonder
- The Rolling Stones
- Katy Perry
- Janet Jackson
- Bee Gees
- Boyz II Men
- The Black Eyed Peas
- Rod Stewart
- R. Kelly
We are very happy with the agreement.
D3.js has been the bestest of friends. We love you Mike Bostock, and also the rest of lovely helpful people that helped us so much by posting examples online.
We created A.Track.Tion online, but we showed it on a large touch screen table at Sónar+D 2014, where it had a great reception by the festival audience as well as the music experts.
Our dataset and our processing imposes a few limitations that we are aware of (and probably some we haven’t realised yet).
One large gap was already discussed, we still don’t have all possible musical genres listed in Wikipedia, but only those that we were able to crawl from our songs list. There are many more, and maybe some relationships (as older music styles start appearing) will change. For example, Jazz and Rock and Roll are said to derive from Blues, which in turn comes from American Folk Music.
Another large blind spot is that our database lists only songs that ranked in Billboard’s lists, which does not carry classical music and other styles. Those are indeed popular, but we do not have info to include it. Furthermore, the lists only include US sales, which further limit our findings to American musical taste (which is why Country music is so popular), and introduces a few strange oddities (“world music”, for example).
On the influence between genres, notice that we can only count how many genres have been influenced by a particular one, but this does not correlate (probably) to how many records have been produced on each genre, or how many musicians work on it.
Another criticism we received while at Sónar+D from music professionals was that we based our influence and genre information mostly on Wikipedia, instead of some other list curated by experts. While there is some truth to this (Wikipedia has been known to contain a few errors now and then), we found the curated hierarchies to be much more structured and similar to a tree, and most songs associated to a single genre. We think this is artificial, and that the complexity of the Wikipedia data reflects better the reality of music.
Another clear limitation is our sales estimator, which we produced from a rather small dataset (small in time, space, and number of records). Also, since the formula used by Billboard has changed many times in the past, we cannot expect it to hold for all our data. Perhaps we could find and produce a better estimate, but for our purposes our estimation is already good enough.
(EDIT) Note: We are aware of a small bug in the processing scripts that lost a few hundred records due to some text appearing in the original database where numbers were expected (in the position by weeks columns, if you need to know). We are working to fix this asap.
We thank the many anonymous contributors to the Whitburn project, and BSC for supporting this project.