Data Science and Wine!

Ever need a fun data set to lose yourself into? Ever need a good data set to test out your skills in basic statistics or advanced machine learning? Then, if you don’t already, you should know about these repositories for datasets:

  1. 538
  2. UCI
  3. Kaggle

But, what if you are interested in data science and wine? What can you do?

It turns out that you are not alone and there are a couple of nice datasets available. In summary, those datasets are:

  1. Modeling wine preferences by data mining from physicochemical properties, (Cortez et al., Decision Support Systems, November 2009, Elsevier, 47(4):547-553. ISSN: 0167-9236). Here is a link to the publication. This is a dataset containing properties of Portuguese wine that was constructed with the explicit goal of testing machine learning algorithms (support vector machines and neural networks in this case) for predicting wine quality based on physicochemical measurements. Wines are from the vinho verde DOC, and both red and white wines are included; often you will encounter these two (red and white) datasets separately. I have mirrored the data on my GitHub page, and also provide a combined red+white dataset. Wine properties are: acidity (three types), residual sugar, chlorides, sulfur compounds (three types), density, pH, alcohol and quality. While this is a good dataset, it lacks many features you might desire: everything from malic acid to grape variety to price.
  2. PARVUS – An Extendible Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Forina, M. et al., Via Brigata Salerno, 16147 Genoa, Italy. As you might have guessed, this dataset contains information on Italian wines. This dataset includes the 13 features: alcohol, malic acid, ash, alcalinity of ash, magnesium, total phenols, flavanoids, nonflavanoid phenols, proanthocyanins, color intensity, hue, OD280/OD315 of diluted wines and proline. More about this dataset can be found at UCI.

I have organized the wine data here.

Here is a Jupyter notebook I constructed based on the Portuguese wine dataset. If you want the notebook, you can get it from here (it is also a bit easier to view it there).

While there is a lot one can do with these wine datasets, they are limited. Each dataset contains quite different features and for wines in a narrow geographical range, making it difficult to draw any broad conclusions. This situation is common with such datasets: it is expensive and time consuming to create them, and sometimes there are legal issues. Let’s hope that more people will be motivated to create more datasets like these – maybe you?