Showing Some Respect for Data Munging, Part 2: Pitchfork Dataset Cleaning

2019, Jul 09    

So far on this blog, we’ve used the data containing information on Pitchfork music reviews (available on Kaggle at this link) for a number of different data analyses. I’ve found this to be a particularly interesting dataset, and have learned a great deal doing all of these small fun projects.

However, before I could do any data analysis, I had to spend some time and effort on data preparation. Specifically, the raw data from Kaggle are contained in a number of separate tables in an SQLite database, and they therefore required a bit of work to clean and transform into a single tidy dataset.

The goal of this post is to give an overview of the data munging process of the Pitchfork data. Data preparation (or data munging) is an important topic in data science that receives far too little attention in most conversations about applied analytics. Indeed, most of the posts on this blog are all about the “sexier” aspects of data analysis, but from time-to-time it’s good to highlight all of the important prep work that is necessary before one can actually analyze data.

Code Repo on Github

All the code is available on Github here. The repo contains a Jupyter notebook that reads the data (available on Kaggle - I can’t distribute it myself), and performs the extraction, cleaning, and merging of the Pitchfork review data to create a final tidy dataset.

We will not go over all of the code in this blog post. Rather, we will focus on a few of the key steps that illustrate some of the issues which need to be solved in order to create the tidy data structure of our final dataset.

Extracting the Data from SQLite

The first steps are to import the libraries we’ll need and define the directory where the data are stored. We then connect to the SQLite database and print the names of the tables contained therein.

  
# import needed libraries
import sqlite3, datetime
import pandas as pd
import numpy as np

# define directory where the data are stored
in_dir = 'C:\\Directory\\'

# What are the tables in the database?
con = sqlite3.connect(in_dir + 'database.sqlite')
pd.read_sql("SELECT name FROM sqlite_master WHERE type='table';", con)

Which returns:

name
0 reviews
1 artists
2 genres
3 labels
4 years
5 content

There are 6 different tables in our SQLite database: reviews, artists, genres, labels, years, and content. We will extract all of the data from each table into a separate data frame, and close the connection to the SQLite database:

  
# extract the tables from sqlite to Pandas dataframes
reviews = pd.read_sql('SELECT * FROM reviews', con)
artists = pd.read_sql('SELECT * FROM artists', con)
genres = pd.read_sql('SELECT * FROM genres', con)
labels = pd.read_sql('SELECT * FROM labels', con)
years = pd.read_sql('SELECT * FROM years', con)
content = pd.read_sql('SELECT * FROM content', con)
# and close the connection to the sqllite database
con.close()

We now have 6 different data frames, each one containing all of the data from a single table in the original SQLite database.

Munging the Reviews Data

We will first munge the dataset entitled “reviews.” The dataframe looks like this (first 5 rows shown):

reviewid title artist url score best_new_music author author_type pub_date pub_weekday pub_day pub_month pub_year
0 22703 mezzanine massive attack http://pitchfork.com/reviews/albums/22703-mezz... 9.3 0 nate patrin contributor 2017-01-08 6 8 1 2017
1 22721 prelapsarian krallice http://pitchfork.com/reviews/albums/22721-prel... 7.9 0 zoe camp contributor 2017-01-07 5 7 1 2017
2 22659 all of them naturals uranium club http://pitchfork.com/reviews/albums/22659-all-... 7.3 0 david glickman contributor 2017-01-07 5 7 1 2017
3 22661 first songs kleenex, liliput http://pitchfork.com/reviews/albums/22661-firs... 9.0 1 jenn pelly associate reviews editor 2017-01-06 4 6 1 2017
4 22725 new start taso http://pitchfork.com/reviews/albums/22725-new-... 8.1 0 kevin lozano tracks coordinator 2017-01-06 4 6 1 2017

The code below checks whether there are any duplicate review id’s. In fact, there are 4. We then print the data for one of the duplicated review ids:

  
# there are four duplicated review id's
reviews.reviewid.duplicated().sum()

# which review ids are they?
reviews.reviewid[reviews.reviewid.duplicated()]

# look at the first one
reviews[reviews.reviewid == 9417]

Which gives us the following:

reviewid title artist url score best_new_music author author_type pub_date pub_weekday pub_day pub_month pub_year
12116 9417 radiodread easy star all-stars http://pitchfork.com/reviews/albums/9417-radio... 7.0 0 joe tangari contributor 2006-10-11 2 11 10 2006
12120 9417 radiodread easy star all-stars http://pitchfork.com/reviews/albums/9417-radio... 7.0 0 joe tangari contributor 2006-10-11 2 11 10 2006

It looks like, for whatever reason, four of the rows are duplicated in the reviews data. My guess is this has something to do with the scraping program that collected the data from the Pitchfork website, but it’s hard to know for sure.

We will simply remove the duplicated rows (and reset the index on the Pandas dataframe). I always print the shapes of data frames when doing this type of operation, just to make sure I’ve actually done the sub-setting properly. The following code drops the duplicate rows and prints the shapes of the original and sub-set dataframes:

  
# conclusion- there are 4 duplicated reviewid's in the reviews database
# but as the lines are the same- we can simply drop duplicates
# we also reset the index
reviews_final = reviews.drop_duplicates().reset_index(drop = True)
# check that we've removed them
# looks ok!
print(reviews.shape)
print(reviews_final.shape)

Which returns:

 
(18393, 13)
(18389, 13)

We have in fact removed 4 rows!

Munging the Genres Data

The genres data looks like this:

reviewid genre
0 22703 electronic
1 22721 metal
2 22659 rock
3 22661 rock
4 22725 electronic

This format seems pretty straightforward. However, I again checked for duplicate review id’s (code below), and saw that there were 4,291 of them. Why does this happen? The code below does the duplicate checking, extracts the first 5 duplicate review ids, and displays the data for one of them:

  
# many duplicated review id's
genres.reviewid.duplicated().sum()

# what are some of the review ids?
genres.reviewid[genres.reviewid.duplicated()][0:5]

# example duplicate review id
genres[genres.reviewid == 8005]

Which gives us the following result:

reviewid genre
22607 8005 jazz
22608 8005 pop/r&b
22609 8005 electronic

This seems quite obvious, in retrospect. Not all albums contain music with a single genre. When an album contains music with a variety or mix of genres, the album is assigned multiple genres in the data.

We are ultimately interested in producing a dataset with one row per album. We therefore need to transpose these data from the long format (with one row per review id-genre combination) to the wide format (with one row per review id). We will use the Pandas to_dummies function to produce dummy variables (also sometimes called “one-hot encoding”) with this information.

We can produce the matrix of dummy variables, which will put each unique genre value as a column, with the value of “1” when the row contains the genre value, and “0” if the row does not contain the genre value:

  
# make pandas dummy variables out of the genres
# still multiple rows per review id here
dummy_genres = pd.get_dummies(genres.genre)
# the number of rows is still the same as the genres table
print(dummy_genres.shape)
print(genres.shape)

Which returns the following to the console:

 
(22680, 9)
(22680, 2)

We have kept the same number of rows, but increased the number of columns. Our dummy_genres dataframe looks like this:

electronic experimental folk/country global jazz metal pop/r&b rap rock
0 1 0 0 0 0 0 0 0 0
1 0 0 0 0 0 1 0 0 0
2 0 0 0 0 0 0 0 0 1
3 0 0 0 0 0 0 0 0 1
4 1 0 0 0 0 0 0 0 0

We now need to aggregate the data to the level of the review id - this will give us the genre data with one row per album, which is what we need. Because of the dummy (0/1) format of our data, if we group the dataframe by review id and take the sum of the dummy columns, we should get the data we want: a single row per review id, with binary indicators for genre, including the cases where albums have multiple genres.

The following code goes through the following steps: concatenating the review id with the dummy variable data, grouping the data by review id and summing all of the dummy columns, and checking the maximum score of the dummy variables. If we have understood the data structure and done everything properly, the dummy columns should have a maximum score of 1.

  
# merge the dummies back into the genre database
genres_wdummies = pd.concat([genres.reviewid,dummy_genres], axis = 1)
print(genres_wdummies.shape)

# aggregate to the reviewid level; we take the sum of the dummies
# each dummy should only exist once
# we essentially get boolean indices for each reviewid 
# of the genres represented by the album for a given reviewid
genres_wdummies_gb = genres_wdummies.groupby('reviewid').sum()
print(genres_wdummies_gb.shape)

# some have more than 1 entry per genre!
genres_wdummies_gb.max()

The last line of code returns the following result:

 
electronic      1
experimental    1
folk/country    1
global          2
jazz            1
metal           2
pop/r&b         1
rap             1
rock            2
dtype: uint8

Three of the genre variables have maximum values of 2! How can this possibly happen?

Let’s check out the review that contains the “global” genre twice:

  
# one review with global twice
# we recognize this review id!
# it's one of the ones that was repeated above
genres_wdummies_gb.index[genres_wdummies_gb['global'] > 1]

This returns the review id 9417. This is one of the duplicate review id’s we saw when munging the reviews data above!

What do the raw data look like for this review id?

reviewid genre
14626 9417 global
14631 9417 global

This seems to be, as we saw with the reviews data above, a problem with duplicate entries in our data. It is again unclear why certain reviews appear multiple times in our database, but this issue of duplicates is something that we have to be attentive to all throughout our data munging process.

The solution I chose here was to keep the data I had produced, and simply binarize the genre columns. In other words, a score of 0 remains 0; all other values are set to 1.

The code below performs this binarization and checks the maximum of the dummy columns to verify that it worked.

  
# conclusion: we can just binarize everything - if it's 2, it should be 1
# binarize the dataframe: set all values to 0 or 1
genres_final = genres_wdummies_gb.apply(lambda x: np.where(x == 0, 0, 1), axis = 0)
print(genres_final.shape)
genres_final.reset_index(level=0, inplace=True)
print(genres_final.shape)

# now we're good!
genres_final.max()

The last line of code returns the following:

  
reviewid        22745
electronic          1
experimental        1
folk/country        1
global              1
jazz                1
metal               1
pop/r&b             1
rap                 1
rock                1
dtype: int64

It looks like we have solved the issue!

The head of our cleaned genres dataset, called genres_final, looks like this:

reviewid electronic experimental folk/country global jazz metal pop/r&b rap rock
0 1 0 0 0 0 0 0 0 0 1
1 6 1 0 0 0 0 0 0 0 0
2 7 1 0 0 0 0 0 0 0 0
3 8 1 0 0 0 0 0 0 0 0
4 10 0 0 0 0 1 0 0 0 0

Final Dataset

The code found at the Github repo contains the code for munging all 6 of the component data sets, and creating a single final dataframe with one row per album/review id.

The final data produced by the code on Github produces a master dataset containing reviews of 18,389 albums. The head of the final cleaned data looks like this:

reviewid title artist url score best_new_music author author_type pub_date pub_weekday pub_day pub_month pub_year content year1 year2 electronic experimental folk/country global jazz metal pop/r&b rap rock label1 label2 label3 label4 label5 artist1 artist2 artist3 artist4 artist5 artist6 artist7
0 22703 mezzanine massive attack http://pitchfork.com/reviews/albums/22703-mezz... 9.3 0 nate patrin contributor 2017-01-08 6 8 1 2017 “Trip-hop” eventually became a ’90s punchline,... 1998.0 NaN 1 0 0 0 0 0 0 0 0 virgin NaN NaN NaN NaN massive attack NaN NaN NaN NaN NaN NaN
1 22721 prelapsarian krallice http://pitchfork.com/reviews/albums/22721-prel... 7.9 0 zoe camp contributor 2017-01-07 5 7 1 2017 Eight years, five albums, and two EPs in, the ... 2016.0 NaN 0 0 0 0 0 1 0 0 0 hathenter NaN NaN NaN NaN krallice NaN NaN NaN NaN NaN NaN
2 22659 all of them naturals uranium club http://pitchfork.com/reviews/albums/22659-all-... 7.3 0 david glickman contributor 2017-01-07 5 7 1 2017 Minneapolis’ Uranium Club seem to revel in bei... 2016.0 NaN 0 0 0 0 0 0 0 0 1 static shock fashionable idiots NaN NaN NaN uranium club NaN NaN NaN NaN NaN NaN
3 22661 first songs kleenex, liliput http://pitchfork.com/reviews/albums/22661-firs... 9.0 1 jenn pelly associate reviews editor 2017-01-06 4 6 1 2017 Kleenex began with a crash. It transpired one ... 2016.0 NaN 0 0 0 0 0 0 0 0 1 kill rock stars mississippi NaN NaN NaN kleenex liliput NaN NaN NaN NaN NaN
4 22725 new start taso http://pitchfork.com/reviews/albums/22725-new-... 8.1 0 kevin lozano tracks coordinator 2017-01-06 4 6 1 2017 It is impossible to consider a given release b... 2016.0 NaN 1 0 0 0 0 0 0 0 0 teklife NaN NaN NaN NaN taso NaN NaN NaN NaN NaN NaN

This is the dataset that has served as input for all of the different posts I have done using the Pitchfork data!

Summary and Conclusion

This post was completely dedicated to the data munging process. We started with an SQLite database containing 6 different tables, and cleaned them one-by-one, merging them all to produce a final tidy dataset, with one row per album, and all of the information about each album contained in the columns. The complete code is available on Github here and the data are available here.

In popular discussions of data science or statistics, one often starts the narrative with a cleaned dataset that is appropriate for the analysis to be presented. In real life, however, this is rarely the case. For this reason, we focused here on munging 2 of the 6 tables: the reviews table and the genres table, which provide a good illustration of some common pitfalls.

Both the reviews table and the genres table had duplicate data. I wasn’t necessarily expecting this, and the documentation for the data made no indication of this issue. Based on the data collection procedure, my assumption is that something happened during the scraping of the data, so that 4 of the albums were scraped twice. Odd “surprises” like this are very common when working on applied data problems, and being vigilant and doing checks like the ones described above is critical to producing high-quality, analysis-ready datasets. It is very dangerous to assume that the data that someone else gave you (or that you found on the internet) are correct and “ready-to-analyze.”

I made a poor assumption in my first check of the genres dataset above. I did indeed check for duplicate review id’s, and my assumption was that this duplication was due to the fact that each album can have multiple genres. I assumed that everything was fine and continued with my data transformation process (from long to wide formats). It was only in a subsequent check that I realized that there were problems with the duplicate data in the original table. Because I checked multiple times, I was able to realize the problem and correct it. But I think this serves as a nice example of the importance of checking and testing data throughout the munging process!

Coming Up Next

In the next post, we will analyze data on my walking behavior from two different sources: the Accupedo app on my phone, and the Fitbit I wear on my wrist. We will use visualization and basic statistical techniques (in R!) to examine whether these two devices give similar measurements of the number of steps I take each day.

Stay tuned!