What can 10 000 data points we have collected and analysed tell us about the spending habits of those willing to pay for the music that they listen to? I hope I don’t need to tell you that most people do not buy music. We have gathered and presented this data repeatedly. There are all sorts of reasons for this unfortunate phenomenon. We won’t go into those in this article. If you are interested in that, we will include a link to some of our previous content.
Besides the general interest that I have in the subject, as well as my interest in data analysis and visualisation, there are other reasons why our team wanted to do this. Our website is vested in the information that is revealed by this study.
We champion data-driven decision making. As we continue to invest financially and in time, we want to know what the odds are. Not just for ourselves, but for the sake of the artists that we work with and represent. By constantly researching and gathering data we are able to make a decision with their best interests at the core of them. Data-driven decision making allows us to do the right things, at the right time and in the right way.
We may break the results down into the first 3000 data points, 5000 data points, then 10000 data points. To take this a step further, we will also look at how our results match up when test other countries.
What are the correlations between musical spending habits of people in different countries? The first one to be looked at will be South Africa. How does the difference between the socioeconomic statuses of the two countries affect the way they spend their money on music.
Abstract
Using data points gathered from iTunes sales records, we seek to find out what sort of music sells best. The results are considered in terms of the year of release, genre, the country where the artist is from and the language of the song. Due to the sheer volume of data collected we limited the language to be either English or vernacular.
Hypothesis
My hypothesis was formed after a quick glance at the first 200 data points. This is what encouraged me to start this research. We expect that most of the music sales will be from international artists. The bulk of the local music sales will be of vernacular.
Close to half of the sales will not be of music that was recently released. The bulk of these sales will likely range between the 1970s and the 1990s. Popular/Mainstream local artists with a vast following on social media, as well as radio rotation will get a good portion of sales.
Method
We started by collecting 200 data points each day. These are the rankings of music based on sales from the previous 24 hours. These are entered into a spreadsheet, similar to figure 1. There are four columns, one for each of the metrics that we deemed relevant to prove or disprove our hypothesis.
There are a few websites that you can use to extract the sales data from iTunes for the past 24 hours. We made use of https://kworb.net/charts/itunes/zw.html and www.digitalsalesdata.com . This decision was mainly because it cuts out some of the data that we do not need and presents the ones that we do need.
We started by creating a spreadsheet on Google Docs. This would allow us to collaborate with others on the collation of data.
Data Gathering
The data gathering process involved daily harvesting of sales data. This was manually pasted onto a Google Sheet. The artist name and song title went into column A. They were followed by Genre, Language, Country and Release, respectively.
To make sure that we would not have to repeatedly copy and paste data manually, we used an excel vLookUp formula to automate this. It involves creating an index of all the data points that we have created and populated. Once a value is indexed. Anytime that we would enter the same value in column A, the vLookUp would automatically populate columns B, C, D and E.
The next step in our data gathering once column A was filled in was to search on Google for information related to columns B, C, D and E. These would then be entered into our index for the first time.
Simply adding the word release at the end of the search term, as illustrated above, gave the results.
The results would be somewhat conflicting. Image 1 shows the year 2000 on the knowledge graph. Upon further research and scrolling down the page, you will realise that the release for this was in fact 1989. 2000 was the year in which there was a live release. This is something we identified about halfway through the collection of data and knew would be significant to data cleaning.
10000 data points
The data shown above is on a live document. It will keep being updated until we have 10000 results and all columns sorted.
Data Cleaning
Collecting data is fairly tedious as it is. Data cleaning is another level. Once I was getting close to the part where the data is almost at our first threshold of 3000 data points, I needed to make sure that everything was in place for analysis.
The first thing to do was making sure that all columns are filled in. Most of the songs recorded within our data was appearing repeatedly. This meant that the best way to analyse the data without consuming time was to automate the filling of the rest of the columns. Whenever the same song appears again, the formula would automatically fill in the rest of the fields.
As previously mentioned, we went with the vLookUp formula for this one.
=VLOOKUP(A5,H5:L205,2,FALSE)
It says, read the value in column A.
Go and look for a match for that value in column H
Fill in the next column where the match is found.
It has to be an exact match.
The formula worked and had most of the columns and rows auto-completing. As I looked through, I started to identify some errors. There were release dates that were way off. No matter what I tried to do to improve or amend the formula, this problem was ubiquitous.
Won’t tell you how long it took me to figure out what was going wrong. Turns out the issue was down to the fact that the values in the column be were not unique, which meant that the formula would be finding the same value in other places. This would lead to the wrong values being used.
The values highlighted in red were the tell-all signs. Lord knows I wish Lucky Dube had still been releasing music in 2016, but that isn’t the reality.
To resolve this, I went back to basics. There are less than 500 songs sold in this batch of 3000 sales. Most of them were repeated. Instead of having the songs listed in the order of the days that they were sold, I went with alphabetical order. This allowed me to see the songs in clusters. Once I found details of a song, I could paste over multiple rows in one go.
Once the system was in place, it didn’t take more than an hour to have all the data. With the first 3000 data points in place, visualisation of the data was next.
Analysis and Visualisation
I was very excited about this. As soon as the data was ready, I opened up Tableau, ready to drop some knowledge. Unfortunately, it didn’t take long for me to realise that I needed to further categorise and simplify the data for visualisation. To do this, I needed to know how many instances each genre occurred within our dataset. I also needed to know how many instances each country occurred.
To achieve this, I used excel’s CountIf formula as shown here:
=COUNTIF(B2:B3001,”EDM”)
Look in column B
Start from Row 1 and stop at Row 30001
Find where EDM is mentioned
Count each instance
Worked like a charm!
Next was to make sense of the release years. I decided that it was best to do these in decades as opposed to individual years. We made use of the same COUNTIF formula.
=COUNTIFS(E3:E3002,”>2009″, E3:E3002,”<2020″)
Look in Column E
Start from Row E3 and stop at Row 3002
Look for a number greater than 2009
Make sure that number is not greater than 2020
Count instance that meets this criterion.
Results & Anomalies
The first anomaly is a statistical one. I will likely figure out where it has gone wrong with a bit more digging. We are 18 data points over where we should be for the Years metric. This will mean that our data will be inaccurate to a degree of 0.6%. I have repeatedly checked each category within this metric to see if one has been overstated, but nothing is showing up. I’ve also rerun the formula to make sure no error was made there, and nothing came up there either.
The saving grace is that this is the only metric that has an error in it. Looking at the results from afar, one can say that the deviation is negligible. The results are still a true representation of the data.
At the time of writing this, there is a local artist on the top of the charts. ExQ leads the pack with Wakatemba (feat. Tocky Vibes). He features again in 10th with Zuva Rese which he did with Ti Gonzi. To have 2 Hip Hop artists in there is massive.
As we have previously discussed in another post, this sadly does not translate to much financially. There are factors responsible for this, such as the financial state of the nation. Most people simply do not have disposable income. The large part of the demography that buys music does not do it through iTunes. We could go on. Be sure to read the other post to delve into the data.
Our key findings are as follows.
About 80% of the top 10 songs on the charts at any given time are by international artists.
A number-one single generates an average of <$10 in a week of sales.
The top-selling songs are in English.
The top-selling local songs are in Shona.
The top-selling local genre is Gospel.
The most consistent local gospel artist on the charts is Minister Michael Mahendere
The most consistent local artists on the charts are Minister Michael Mahendere and Jah Prayzah.
The most consistent African artists on the charts are Davido and AKA.
The most consistent International artists on the charts are Adele and Ed Sheeran.
South African Hip Hop sells more than Zim Hip Hop in Zimbabwe.
Discussion
There is a clear relationship between religion and economics. This, of course, is mirrored in Zimbabwean society at large. As a long term fan of Hip Hop and Zim DanceHall, it is refreshing to see these genres begin to climb up the charts consistently.
This is a phenomenon that will likely continue in to grow as the younger Hip Hop and DanceHall heads become more financially independent. An uptick in the economy will undoubtedly see this speed up significantly. No one can say when or if this is to ever occur. We can only hope and wait.
Another key thing that we expected and noticed is that the music that my father and his father would have listened to still sells. I am talking about James Chimombe, System Tazvida, Simon Chimbetu, Four Brothers, Tuku and so on. The economics have yet to trickle a generation down, which would be one explanation for this.
Conclusion
I think the data speaks for itself. We are the only nation in the iTunes Charts where international acts consistently eclipse local acts. Let that sink in. We simply do not support our own. In some ways, I feel as though we do not appreciate our own artistry until it is appreciated by the world.
Incredible artists such as Rationale are killing it globally. Would he be doing as well if he was based in Zim? No! Would the talent be diminished? No, yet in the eyes of his own people, it would be. There has to be a shift in the mentality of the audience before we can see a proper change.