Nas vs. DOOM: A Model-Based Text Analysis with Python
In this post, we’ll return to analyzing rap lyrics using statistical and data analytic tools (the first posts of this blog dealt primarily with this topic). Specifically, in this post we will be looking at the collective work, in the form of the songs from all official studio albums, of two elder rap statesmen: Nas and DOOM. The goal of the analysis presented in this post will be to: 1) clean and explore the text data 2) build a model to predict the rapper (e.g. Nas or DOOM) of a song, given the song lyrics and 3) use the results of our model to understand the linguistic choices that differentiate the two artists.
I considered all of the text from official studio albums for each artist (list created by consulting the respective artist Wikipedia pages). This analysis was undertaken in the beginning of November 2017, and all albums released at that point were used. For Nas, the following albums were included: Illmatic, It Was Written, I Am…, Nastradamus, Stillmatic, The Lost Tapes, God’s Son, Street’s Disciple, Hip Hop is Dead, Untitled, and Life Is Good.
I included the following DOOM albums: Operation: Doomsday., Take Me to Your Leader, Vaudeville Villain, Madvillainy, Venomous Villain, MM.. FOOD, The Mouse and The Mask, Born Like This, Key to the Kuffs, and NehruvianDOOM. Note that albums of instrumentals and work released prior to the adoption of the DOOM persona are not considered, while albums created with collaborators (e.g. Madvillainy, The Mouse and the Mask, Key to the Kuffs, etc.) are included.
The lyrics were scraped from genius.com, a website that allows users to transcribe and annotate song lyrics. Although I won’t go into the details in this blog post, I used the Python module Beautiful Soup to scrape the lyrics. Beautiful Soup works really well, is relatively easy to use, and made the task of obtaining these data very straightforward.
The lyric dataset contains 1 line for each song. For the purposes of the present analysis, we will focus on two main columns in our dataset. The first is the column containing the song lyrics, called “lyrics_clean” (though, as we’ll see below, this field still requires some serious cleaning before being ready for analysis). The second is the column called “rapper” which can take on two values: “Nas” or “DOOM.”
My scraping program returned 340 songs in total; our original raw dataset therefore contains this many rows.
The head of this dataset can be seen below:
Data Pre-Processing & Exploratory Analysis
As can be seen in the above screenshot, the text is pretty messy. There are carriage returns (indicated by “\n”), information about the producers (e.g. “Faith N & Nas” in the first line), the artist (e.g. “Nas,” “Grand Wizard + Nas”), and the song structure (“Intro,” “Hook,” etc.).
In order to clean up the text, we’ll adapt a function we’ve used in previous posts. This function removes the carriage returns in the text, removes the text between brackets (e.g. [Intro]), removes non-letters and numbers, converts the text to lower case, removes stopwords, stems the text, and removes any remaining words with only 1 character. Note that we have to slightly adapt our stemming algorithm, otherwise it stems “Nas” (one of the artists in our dataset) to “na.”
Our function looks like this:
Which we apply to our dataset (called nas_doom) with a list comprehension:
We can look at the first text in our raw data, which begins like this (I can’t share all the lyrics because I don’t have the rights to them):
And the data cleaned by the function, which begins like this:
We appear to have successfully removed the mess and retained only the essential parts of the text!
Exploring Word Counts by Artist
Before proceeding to the modeling, let’s first explore the word counts for the songs in our dataset. We can create a count of the words in the cleaned texts with the function presented below. When exploring the word counts, I noticed that some were quite low. This happened with instrumental tracks for which there are no or very few lyrics. I therefore removed these songs from the data. The following code counts the words of the cleaned texts and removes rows for which the word count is 10 or fewer (there were 6 such songs):
With these rows removed, our dataset contains 334 songs: 185 by Nas and 149 by DOOM. Let’s use seaborn to examine the distributions of word counts for the two artists:
Which yields the following plot:
It is clear that Nas songs tend to have more words than DOOM songs. The mean (cleaned) word count for Nas is 313.14 (SD = 88.34) while the mean for DOOM is 240.75 (SD = 103.11).
Most Frequent Words
As a final exploratory analysis, let’s examine the most frequently-occurring words in our cleaned data. We can use sklearn’s word count vectorizer to count the occurrences of each word, and visualize the results using seaborn.
The code below calculates the frequency of each word (called ‘unigrams’ in the natural language processing world) and two-word combination (e.g. ‘bigrams’). It then stores the unigrams/bigrams and their respective frequencies in a dataframe, and plots the 15 most frequent ones.
Which yields the following plot:
“Like” is the most frequently occurring word in our data, appearing 1368 times (on average 4.02 times per song). In my experience, comparing objects, people or actions is quite frequent in rap lyrics (indeed, the first line from Nas’ Illmatic quoted above -Street’s disciple, my raps are trifle / I shoot slugs from my brain just like a rifle- is a great example). I note that others have also pointed out that simile use is quite common in hip-hop.
Other notable frequently-occurring words include “get” and “got.” Clearly, obtaining other things is an important subject in many of these songs. It was interesting to see that “time” (or times - remember that our stemming converts both to “time”) occurred so frequently in these data. Some examples include: time to grind, hard times, tough times, took more time to write in my book of rhymes, triple that times three, doing time [serving a prison sentence], and have an iller rhyme, at least by Miller Time.
Document-Term Matrix and Train-Test Split
Before we can begin modeling, we need to represent our text data in a matrix format. I go over this in more detail a previous blog post; in short we create a document-term matrix (documents in the rows, words in the columns) and include a column for every word in our dataset. We will extract binary indicators - if a word is present in a song, the relevant column for the given row takes on a value of 1; otherwise it takes on a value of zero. We only retain words which appear in 25 or more songs.
We use sklearn’s count vectorizer, specifying we want binary indicators, unigrams and bigrams (as we did above), and then split the data into training and test samples with the following code:
We can see how many different features were extracted using this method with the following code:
With our data and the parameters supplied above, we extract 589 different text features.
We will first create a random forest model with our training data to predict the rapper, using the words from the song lyrics as predictive features. We will use a similar approach for building the random forest model as we did in a previous post. We then extract the feature importances from our model and plot them using seaborn. The following code accomplishes all of this:
And gives us the following plot of the feature importances:
We can see that the rappers’ names are (of course) very predictive: Nas, DOOM, and Villian (a moniker sometimes used by DOOM) are among the most-predictive features in our text data. References to urban life (often typical of Nas’ work, e.g. ghetto and street) also feature prominently.
Let’s use our model to predict the rappers of the songs in our hold-out test data and create a confusion matrix:
Which gives us the following confusion matrix:
On the whole, we do quite well! There are 2 songs in our test data that are by Nas but which our model predicts are by DOOM. There are 4 songs in our test data which are by DOOM but which our model predicts are by Nas.
We will now create a LASSO regression model to predict the rapper, given the song lyrics. We model using the same training and test sets as we did for the random forest model above. (For more details about LASSO regression please see this previous post).
We then extract the features with the largest positive and negative penalized coefficients and plot them. The following code accomplishes all of this:
And returns the following plot of the largest positive and negative penalized coefficients:
Unlike the random forest feature importances, the penalized regression coefficients give us a sense of the direction of given feature; e.g. whether it is positively or negatively predictive of Nas (vs. DOOM) being the rapper. The LASSO regression coefficients are therefore more informative as they allow us to understand the nature of the relationship between a given predictor and the outcome variable. With the random forest feature importances shown above, we simply know that a feature is predictive, but not in which direction.
Note that I have colored the positive coefficients (e.g. features which, if they are present, indicate that Nas is the likely rapper) in orange and the negative coefficients (e.g. features which, if present, indicate that DOOM is the likely rapper) in green. This makes it quick and easy to see which words are more predictive of Nas vs. DOOM, respectively. I would argue that the LASSO regression and the plot of the penalized coefficients provide more insight into our problem than the random forest and the resulting feature importances plot.
We can use the LASSO regression model to predict the rapper for each song in our hold-out test dataset and create a confusion matrix with the following code:
Which produces the following confusion matrix:
The results seem comparable to the random forest model above!
Let’s compare the model performance in a more structured way via ROC curves.
We first calculate the predicted probabilities for both models, and then create and visualize the ROC curves with the following code (adapted from that used in previous posts):
Which gives us the following plot:
The random forest and the LASSO regression model have identical ROC values!
Substantive Interpretation - What Have We Learned From This Analysis?
Finally, what does this analysis teach us about the language and thematic content of songs on the Nas and DOOM albums?
We first see that name-checks (e.g. rappers mentioning themselves in the lyrics) are the best predictors of the song artist. Indeed, Nas, DOOM and Villian are the most predictive features in this analysis.
Second, Nas’ depiction of the difficulties of urban life (a theme which characterized his ground-breaking debut album Illmatic and one which is viewed as an important subject in all of his subsequent work) comes across quite clearly. This theme, exemplified by the term ghetto, is clearly a lyrical differentiator between Nas and DOOM, with Nas using the term much more frequently.
Third, Nas’ use of what some might call “offensive language” is a prominent linguistic differentiator. Indeed, b-tch and what I’ll term the n-word are both among the top words that Nas uses more than DOOM. I’m not the right person to comment on the broader cultural and societal implications of this. I will say that, as someone who listens to a lot of hip-hop music, this type of language is quite often used, almost like a lyrical trope in some cases. Nas’ lyrics are clearly more characteristic of this style than are DOOM’s.
Fourth, we perhaps get a sense of the rappers’ thinking styles. DOOM seems more comfortable with uncertainty, using the term probably to indicate likelihood but not certainty (this way of thinking is near-and-dear to an old statistician’s heart). Here is an illustrative example of this use from the Vaudeville Villain album: If these walls could talk, they’d probably still ignore me. Nas, in contrast, is more likely to use the term reason. Examination of the lyrics in this dataset yields a number of examples of Nas seeking explanations for why things occur (e.g. Only reason I’m here now is cause God chose me), suggesting an importance to understanding process and structure in the world. Why are things the way they are? Nas wants to know; DOOM just says - “it’s probably like this.”
Fifth, there are two references to musical qualities. DOOM makes reference to the beat; an important musical quality in hip-hop and one for which DOOM (a prominent MC in his own right) is known. Nas, meanwhile, makes references to singing. This is an interesting and unconventional (although not entirely inappropriate) way to describe rapping, and suggests an importance placed on the human element of the musical performance in hip-hop (as opposed to the more mechanical or electronically-produced creation of beats). An illustrative example from Nas’ It Was Written album: They use me wrong so I sing this song.
In this post, we analyzed songs on the official studio albums of Nas and DOOM. We first cleaned the text of the lyrics, examined the cleaned word counts of songs by the respective artists, and visualized the frequent words in our corpus. We then made two models to predict the rapper given the song lyrics, and evaluated the model performance on our hold-out test dataset. While the performance of both models was quite good, the LASSO regression gave results that were more insightful and helped us to better understand the relationships between the predictive text features and our outcome (song artist). Examination of the features that distinguish Nas songs from DOOM songs gave us some insight into the themes the artists frequently use, their thinking styles, and which musical qualities the artists emphasize in their lyrics.
Coming Up Next
In the next post, we’ll return to the data on Pitchfork music reviews, and will use R and the tidytext framework to analyze the sentiment of the review texts.