In the previous two posts, I described some analyses of a dataset containing characteristics of 2000 different wines. We used easily-analyzable data such as year of production and appellation region to predict wine price (a regression problem) and to classify wines as red vs. white (a classification problem).
[Edit: the data used in this blog post are now available on Github.]
In this blog post, we will again use the wine dataset and the random forest algorithm to classify wines as red vs. white. However, in contrast to the previous two blog posts in which we used features of the type traditionally used in statistical and data analysis (e.g. quantitative information such as year of production or categorical variables with a limited number of levels), in this post we will use words describing the wine (the “Winemaker’s Notes”) as input for our classification model. We will again use Python for our analysis.
As described in the previous posts, the dataset contains information on 2000 different wines. Half of these wines are red wines, and the other half are white. The dataset also contains “Winemaker’s Notes” for each wine. These texts are provided by the vintner and aim to describe the wine in an appealing way. I see them as being partly descriptive of the characteristics of the wine, and partly a marketing text use to stimulate wine purchase.
Here is an example of the Winemaker’s Notes for one of the wines in the dataset:
One of the challenges of text analysis involves extracting information from the texts in such a way that they can be used in a data analytic (and, in our case, a predictive modeling) setting. The approach described in this blog post can broadly be described as making use of the bag-of-words model. First, we will process the text in order to make all letters lowercase, remove numbers, and stem the words. We will then extract individual words (unigrams) to create a document-term-matrix (dtm) which will constitute the feature matrix for our classification model. This dtm will have 2000 rows, one for each wine. The terms (e.g. words in the texts) will be contained in the columns of the matrix, with each word represented by one column.
Let’s first import the main libraries we’ll need:
The nltk library has a number of interesting functions for text analysis. We will use a mix of the nltk library and functions available in the sklearn library to conduct the analysis. (If you’re interested in understanding computer science approaches to language analysis, I can recommend the freely-available book Natural Language Processing with Python, written by the authors of the nltk library).
In order to process the text, we’ll define a function to do so. This code is adapted from this wonderful tutorial. In sum, we keep only the text, turn all the letters to lowercase, remove stop words (e.g. words that occur frequently but have no substantive content such as “the”), and stem the words (e.g. remove the ending of words to retain only the root form).
In order to get a sense of what the function does, let’s take the 5th Winemaker’s Notes in the dataset, and examine the original vs. processed text. The original text was already shown above, but let’s look at it again to compare it with the processed version.
We can see the effects of putting letters to lowercase, removing the stop words and also the stemming. The text is much less readable, but the goal here is to reduce the information in the text to the most essential elements in order to use it for analysis. Now that we have seen what the pre-processing function does, let’s apply this method to all of the Winemaker’s Notes in our dataset:
Now that text is processed, we need to prepare the data for modeling. We will use the sklearn function CountVectorizer to create the document-term matrix, which as noted above will contain documents in the rows and words in the columns. We will extract single words here, though other types of text features such as bi-grams (two word combinations such as “Nappa Valley”) or tri-grams (three word combinations) can be extracted and used as input for the model.
In this blog post, we will create binary indicators for each word. If a word is present in a given document, it will be represented as a “1” in the document-term matrix, regardless of how many times the word appeared in that document. If a word is not present in a given document, it will be represented as a “0” in the dtm. Other possibilities for word representation include: the count of each word in the document, “normalized” counts (e.g. the term count divided by the total word count in the document), and the term-frequency inverse-document frequency (tf-idf) value for each word. All can be useful and are relatively easy to implement in Python, but for this example we’ll stick with binary indicators for their ease of implementation and their straightforward interpretation.
Finally, we will make a selection of words to include as predictors. Document-term matrices can often be very wide (e.g. with a great many terms in the columns, depending on the corpus or body of texts analyzed) and very sparse (as there are typically many terms which occur infrequently, and therefore provide little information). Dealing with these wide, sparse matrices is a key challenge in text analysis.
There are a number of ways to reduce the feature space by selecting only the most interesting or important words. We can set thresholds for any one of a number of different metrics which assess how “important” a given word is in a collection of documents. Word frequency, the number of documents a word appears in, and tf-idf scores can all be used to make this selection. In this example, we will simply set a minimum threshold for the number of documents a word must appear in before it is included in our dtm. Below, I set this value to 50: a word is included in our extracted document-term matrix if it appears in at least 50 of our 2000 different Winemaker’s Notes texts:
We can print the shape of the processed text vector with this call: print(pred_feat_unigrams.shape). In this case, with a minimum document frequency requirement of 50, we extract 307 words from these data.
Let’s see what the most common words are. We are extracting binary indicators here, so our code will return the number of documents (out of the 2000 available in the dataset) that contain the most-frequent words. We’ll first create a sorted dataframe with this information:
Now let’s make a plot of the top 15 words which occur most frequently in the 2000 Winemaker’s Notes. (I’ve tried to define a color-scheme that is somewhat wine-colored!)
I’m no œnologue, but these all make sense to me. Smell, taste and color all turn up in lots of the texts, and seem like important qualities to describe if you’re trying to create an appealing portrait of your wine!
The document-term matrix we extracted above will serve as our predictor matrix. We are now ready to split our data into training and test sets for modeling. In order to compare our results here (using text features) with the results described in the previous post (using non-text features), we will use the same seed (called random_state in sklearn) when splitting our data. This ensures that the training and test sets here contain the same observations as those used to create and evaluate the models in the previous post. The predictive performance of all models on the test sets are therefore directly comparable.
Predicting Wine Class
Let’s create a random forest model to predict wine type (red vs. white) using most of the default hyper-parameters in sklearn. The one exception, as before, is that we will increase the number of trees to 500. These settings are the same as described in the “default” model in the previous blog post. The code below creates the model object, runs the model, and computes the confusion matrix on the test data:
The resulting confusion matrix:
|Predicted Class:||Red Wines||White Wines|
We can see that this appears to give a much better classification than the previous set of models, which used the non-text features (alcohol by volume, appellation region, year of production, etc.). The signal is much stronger in the Winemaker’s Notes, and so the text-based features are better able to distinguish between red and white wines.
Let’s extract the feature importances for our model and plot them for the 25 most-important predictor words:
These all appear logical. Even for non-wine-lovers, it’s clear that *cabernet *identifies red wines and *chardonnay *identifies white wines. I found it interesting that there are a number of features related to fruit in the Winemaker’s Notes (e.g. cherry, citrus, apple, berries, plum, etc.) that appear to distinguish red from white wines.
One of the top features of the model is “red,” which obviously is a terrific feature for predicting red vs. white wines. I wondered to myself if including this feature is “cheating” in some sense - of course such a word will be highly predictive, but how well does our model do without a feature that directly names the to-be-predicted outcome?1
In order to examine this question, I decided to remove “red” from the feature set and re-run the random forest model without it. To what extent would this harm predictive performance?
First, let’s redefine the predictor matrix without the word “red”:
Now let’s create training and test sets (using the same observations in the test and training sets as above and as in the previous post), run the random forest model, and look at the confusion matrix.
The confusion matrix:
|Predicted Class:||Red Wines||White Wines|
Our predictions are still very accurate! Without the word “red,” what terms did the model choose as most important in distinguishing between wine types? Let’s extract the features and plot them:
Which yields the following plot:
The predictors displayed are basically the same as those in our original model!
AUC and ROC Curves
The confusion matrix above suggests that the model without the word “red” performs just about as well as the model with the word “red.” In order to make a more direct comparison, let’s use the same approach as in the previous post, and compare the AUC values and ROC curves of the two models. The code is essentially the same as we used in the previous post:
And produces the following ROC plot:
To three decimal places, the AUC values are identical. Removing the word “red” did not hurt the predictive power of the model!
In this post, we will examine one additional metric for evaluating the performance of classification models: the Brier score. The Brier score is simply the mean squared difference between A) the actual value of the outcome (coded 0/1) for each case in the test set and B) the predicted probability from the model for that case, making it essentially a mean squared error (MSE) for classification models. Smaller (larger) Brier Score values indicate better (worse) model performance, and can be useful for comparing modeling approaches on a given dataset.
We compute the Brier scores for our two models on the test dataset with the following code:
The Brier scores are also identical to 3 decimal places, both having values of .019, which are certainly low when considering that they are mean squared errors!
The confusion matrices, AUC values/ROC curves and Brier scores all converge on the same conclusion: the models using features derived from text analysis are very accurate at classifying red vs. white wines. In particular, these models are much more accurate than the models described in the previous post which used the traditional types of features one might use for predictive modeling: abv, price, year of production, appellation region, etc.
Insight into a Top Predictor
As we have done in previous posts, let’s explore the meaning behind an important predictor. In both of the models described in this post, the top predictor was “tannin.” Let’s examine the percentage of texts with the word tannin, separately for red and white wines, and produce a plot showing this difference:
Which yields the following plot:
The term “tannin” is clearly much more frequent for red than white wines. Indeed, 52.9% (529) of the 1000 red wine texts mention the word “tannin,” while only .5% (5) of the 1000 white wines do. As red wines have much higher levels of tannins than do white wines, it makes sense that this feature is an important predictor in classifying red vs. white wines.
In this post, we returned to our classification problem of distinguishing red vs. white wines. We used text analysis, extracting unigrams (one-word features) which we fed as predictors to our random forest algorithm. We computed two different models: one including all words which were present in more than 50 documents, and one which included those same words with the exclusion of “red.” Both models were highly accurate, and gave identical predictive performance for the AUC and Brier score metrics.
There are three interesting points to be taken from what we’ve done here.
Signal strength across multiple features. Even though “red” was the second-most-important predictor in our first model, removing it in the second model did not harm classification accuracy. Essentially, because there was a strong enough signal in the other features, the random algorithm was able to find other combinations of predictors that resulted in an accurate classification of our outcome.
Hyper-parameter tuning vs. feature choice. In the previous post using numeric and categorical features to predict wine type, we went through an awful lot of trouble to search for optimal hyper-parameters, and found that this only increased model performance a slight amount. In this post, we used the same data but changed our feature set to the text of the Winemaker’s Notes. Using a basic set of model hyper-parameters, we were able to much more accurately predict wine type. This points to an insight that many others have made, and which I’ve seen put most succinctly in this Kaggle blog post about best practices in predictive modelling: “What gives you most ‘bang for the buck’ is rarely the statistical method you apply, but rather the features you apply it to.” By providing richer and more discriminating data to our model, a simpler analytical approach resulted in a better classification accuracy.
Text as a rich but unstructured data source. Finally, let’s note that the rich feature set we developed in this post was the fruit of text analytic methods. Using basic text cleaning techniques and a simple bag-of-words approach, we were able to derive predictors that very accurately distinguished red from white wines. This required more effort than using the more-easily-usable numeric and categorical variables in our dataset, but the results leave no doubt as to the potential power of text analysis in applied predictive problems.
Coming Up Next
In next post, we’ll switch languages (back to R!), analytical methods (from predictive modelling to exploratory data analysis) and data (we’ll be looking at a rich dataset concerning consumer perceptions of beverages). We’ll first focus on an often-neglected side of data science practice: data munging. We’ll then use traditional statistical techniques to create a map of consumer perceptions of drinks within the beverage category.
In an applied setting, I’d be happy to have such a feature and would probably leave it at that. If we were to “productionalize” a system where we needed to predict wine type given Winemaker’s Notes texts, any word in those texts would be available at scoring and therefore “fair game” for an applied predictive model. ↩