This is the notebook accompanying the slidecast for the Data Science and IPython session of the Tools of the Trade meetup.
To start, download the red and white wine datasets from the UCI Machine Learning Repository. This data is originally from:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv
--2014-02-12 01:40:58-- https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.1.87 Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.1.87|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 84199 (82K) [text/csv] Saving to: `winequality-red.csv' 100%[======================================>] 84,199 133K/s in 0.6s 2014-02-12 01:40:59 (133 KB/s) - `winequality-red.csv' saved [84199/84199]
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv
--2014-02-12 01:41:12-- https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.1.87 Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.1.87|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 264426 (258K) [text/csv] Saving to: `winequality-white.csv' 100%[======================================>] 264,426 98.7K/s in 2.6s 2014-02-12 01:41:15 (98.7 KB/s) - `winequality-white.csv' saved [264426/264426]
!ls
simple demo.ipynb winequality-red.csv wine analysis.ipynb winequality-white.csv
Load the CSV files using Pandas.
import pandas
reds = pandas.read_csv('winequality-red.csv', sep=';')
reds.head(5)
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15 | 54 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17 | 60 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
5 rows × 12 columns
whites = pandas.read_csv('winequality-white.csv', sep=';')
whites.head(5)
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.0 | 0.27 | 0.36 | 20.7 | 0.045 | 45 | 170 | 1.0010 | 3.00 | 0.45 | 8.8 | 6 |
1 | 6.3 | 0.30 | 0.34 | 1.6 | 0.049 | 14 | 132 | 0.9940 | 3.30 | 0.49 | 9.5 | 6 |
2 | 8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30 | 97 | 0.9951 | 3.26 | 0.44 | 10.1 | 6 |
3 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47 | 186 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |
4 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47 | 186 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |
5 rows × 12 columns
What are the basic descriptive stats for all of the wine properties?
wines = whites.append(reds)
len(wines)
6497
len(reds) + len(whites)
6497
wines.describe()
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 6497.000000 | 6497.000000 | 6497.000000 | 6497.000000 | 6497.000000 | 6497.000000 | 6497.000000 | 6497.000000 | 6497.000000 | 6497.000000 | 6497.000000 | 6497.000000 |
mean | 7.215307 | 0.339666 | 0.318633 | 5.443235 | 0.056034 | 30.525319 | 115.744574 | 0.994697 | 3.218501 | 0.531268 | 10.491801 | 5.818378 |
std | 1.296434 | 0.164636 | 0.145318 | 4.757804 | 0.035034 | 17.749400 | 56.521855 | 0.002999 | 0.160787 | 0.148806 | 1.192712 | 0.873255 |
min | 3.800000 | 0.080000 | 0.000000 | 0.600000 | 0.009000 | 1.000000 | 6.000000 | 0.987110 | 2.720000 | 0.220000 | 8.000000 | 3.000000 |
25% | 6.400000 | 0.230000 | 0.250000 | 1.800000 | 0.038000 | 17.000000 | 77.000000 | 0.992340 | 3.110000 | 0.430000 | 9.500000 | 5.000000 |
50% | 7.000000 | 0.290000 | 0.310000 | 3.000000 | 0.047000 | 29.000000 | 118.000000 | 0.994890 | 3.210000 | 0.510000 | 10.300000 | 6.000000 |
75% | 7.700000 | 0.400000 | 0.390000 | 8.100000 | 0.065000 | 41.000000 | 156.000000 | 0.996990 | 3.320000 | 0.600000 | 11.300000 | 6.000000 |
max | 15.900000 | 1.580000 | 1.660000 | 65.800000 | 0.611000 | 289.000000 | 440.000000 | 1.038980 | 4.010000 | 2.000000 | 14.900000 | 9.000000 |
8 rows × 12 columns
Across all wines, what is the distribution of the quality ratings?
wines.quality.value_counts()
6 2836 5 2138 7 1079 4 216 8 193 3 30 9 5 dtype: int64
What are the properties of the wines that got a rating of 9?
wines[wines.quality == 9]
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
774 | 9.1 | 0.27 | 0.45 | 10.6 | 0.035 | 28 | 124 | 0.99700 | 3.20 | 0.46 | 10.4 | 9 |
820 | 6.6 | 0.36 | 0.29 | 1.6 | 0.021 | 24 | 85 | 0.98965 | 3.41 | 0.61 | 12.4 | 9 |
827 | 7.4 | 0.24 | 0.36 | 2.0 | 0.031 | 27 | 139 | 0.99055 | 3.28 | 0.48 | 12.5 | 9 |
876 | 6.9 | 0.36 | 0.34 | 4.2 | 0.018 | 57 | 119 | 0.98980 | 3.28 | 0.36 | 12.7 | 9 |
1605 | 7.1 | 0.26 | 0.49 | 2.2 | 0.032 | 31 | 113 | 0.99030 | 3.37 | 0.42 | 12.9 | 9 |
5 rows × 12 columns
%matplotlib inline
import matplotlib.pyplot as plt
What is the pH across all wines?
fig, ax = plt.subplots(figsize=(10, 5))
plt.plot(reds.index, reds.pH, 'ro')
ax.set_title('Wines vs pH')
ax.set_xlabel('Red wine index')
ax.set_ylabel('pH')
<matplotlib.text.Text at 0x34ffe90>
Show ten reds only.
reds[:10].pH.plot(kind='bar', title='Wine vs pH')
<matplotlib.axes.AxesSubplot at 0x3634810>
Sort by pH and show the top 25 reds by pH.
reds.sort('pH', ascending=False).pH[:25].plot(kind='bar')
<matplotlib.axes.AxesSubplot at 0x42ce850>
Show the scatter plot distribution of all pairs of red wine characteristics.
from pandas.tools.plotting import scatter_matrix
tmp = scatter_matrix(reds, alpha=0.2, figsize=(20,20))
Are the reds and whites distinguishable given some of their properties?
from pandas.tools.plotting import parallel_coordinates
Label the reds and whites in a new column called kind
.
reds['kind'] = 'red'
whites['kind'] = 'white'
Glob them into one big DataFrame.
wines = reds.append(whites)
Select a few columns to visualize.
sub_wines = wines[['alcohol', 'pH', 'density', 'chlorides', 'kind']]
Render every red and white white as a line traversing all of its column values on separate vertical axes. Show the reds and whites as different colored lines.
parallel_coordinates(sub_wines, 'kind', alpha=0.2)
<matplotlib.axes.AxesSubplot at 0x1aebe090>
It's hard to see a pattern in the above because the scales of the axes are so different. Let's apply a normalization.
sub_wines = wines[['alcohol', 'pH', 'density', 'chlorides']]
sub_wines = (sub_wines - sub_wines.mean()) / (sub_wines.max() - sub_wines.min())
sub_wines['kind'] = wines['kind']
The two types of wines seem to differ most in chlorides (for the columns we picked at least).
parallel_coordinates(sub_wines, 'kind', alpha=0.2)
<matplotlib.axes.AxesSubplot at 0x345728d0>
Let's train a model to predict if a wine is red or white given its characteristics.
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
Define the wine feature vectors as all columns except for the kind
column. Slice those out into X
.
X = wines.ix[:, 0:-1]
X.head()
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15 | 54 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17 | 60 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
5 rows × 12 columns
Define the target as the wine color, the kind
column.
y = wines.kind
Convert the color name into an integer, 0 for white and 1 for red.
y = y.apply(lambda val: 0 if val == 'white' else 1)
y.head()
0 1 1 1 2 1 3 1 4 1 Name: kind, dtype: int64
Create an instance of the LogisticRegression classifier. See http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression.
clf = LogisticRegression()
Train and evaluate 5 model instances, using different subsets of the data for training and test each time. See http://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics
scores = cross_val_score(clf, X, y, cv=5)
print scores.mean(), scores.std()
0.981837626577 0.00428937901531
We didn't standardize the feature vectors. We should: it is an assumption of many of the learning algorithms. See http://scikit-learn.org/stable/modules/preprocessing.html.
from sklearn.preprocessing import scale
Standardize the features to zero-mean and unit variance.
X_std = scale(X)
We no longer have a DataFrame here. scale
has given us its underlying numpy representation. It's fine.
X_std
array([[ 0.14247327, 2.18883292, -2.19283252, ..., 0.19309677, -0.91546416, -0.93722961], [ 0.45103572, 3.28223494, -2.19283252, ..., 0.99957862, -0.58006813, -0.93722961], [ 0.45103572, 2.55330026, -1.91755268, ..., 0.79795816, -0.58006813, -0.93722961], ..., [-0.55179227, -0.6054167 , -0.88525328, ..., -0.47897144, -0.91546416, 0.20799905], [-1.32319841, -0.30169391, -0.12823371, ..., -1.016626 , 1.9354021 , 1.35322771], [-0.93749534, -0.78765037, 0.42232597, ..., -1.41986693, 1.09691202, 0.20799905]])
Train and test again.
scores_std = cross_val_score(clf, X_std, y, cv=5)
About 1% more accurate given the standardized input features.
print scores_std.mean(), scores_std.std()
0.993997157577 0.00259479882976