We'll now do a basic round of supervised classification using scikit-learn. We start by loading the data. We actually have the final classifications in this dataset, so that we can figure out what our accuracy rate was, but we'll ignore it initially and pretend we're starting from scratch.
import pandas as pd
df = pd.read_csv('singapore-roadnames-final-classified.csv')
df
Unnamed: 0 | road_name | has_malay_road_tag | classification | comment | |
---|---|---|---|---|---|
0 | 0 | Abingdon | 0 | British | NaN |
1 | 1 | Abu Talib | 1 | Malay | NaN |
2 | 2 | Adam | 0 | British | NaN |
3 | 3 | Adat | 1 | Malay | NaN |
4 | 4 | Adis | 0 | Other | Indian Jewish |
5 | 5 | Admiralty | 0 | British | NaN |
6 | 6 | Ah Hood | 0 | Chinese | NaN |
7 | 7 | Ah Soo | 1 | Chinese | NaN |
8 | 8 | Ahmad Ibrahim | 1 | Malay | NaN |
9 | 9 | Aida | 0 | Other | NaN |
10 | 10 | Airport | 0 | Generic | NaN |
11 | 11 | Alexandra | 0 | British | NaN |
12 | 12 | Aliwal | 0 | Indian | Battle of Aliwal in the Indo-Sikh war |
13 | 13 | Aljunied | 0 | Other | Arab |
14 | 14 | Allanbrooke | 0 | British | NaN |
15 | 15 | Allenby | 0 | British | NaN |
16 | 16 | Almond | 0 | Generic | NaN |
17 | 17 | Alnwick | 0 | British | NaN |
18 | 18 | Alps | 0 | Other | NaN |
19 | 19 | Ama Keng | 0 | Chinese | NaN |
20 | 20 | Amber | 0 | Other | after the Amber Trust fund established for poo... |
21 | 21 | Amoy | 0 | Chinese | NaN |
22 | 22 | Ampang | 1 | Malay | NaN |
23 | 23 | Ampas | 1 | Malay | NaN |
24 | 24 | Ampat | 1 | Malay | NaN |
25 | 25 | Anak Bukit | 1 | Malay | NaN |
26 | 26 | Anak Patong | 1 | Malay | NaN |
27 | 27 | Anamalai | 0 | Indian | NaN |
28 | 28 | Anchorvale | 0 | Generic | marine theme |
29 | 29 | Anderson | 0 | British | NaN |
... | ... | ... | ... | ... | ... |
1721 | 1721 | Woodgrove | 0 | Generic | NaN |
1722 | 1722 | Woodland | 0 | Generic | NaN |
1723 | 1723 | Woodlands | 0 | Generic | NaN |
1724 | 1724 | Woodleigh | 0 | British | NaN |
1725 | 1725 | Woodsville | 0 | Generic | NaN |
1726 | 1726 | Woollerton | 0 | British | NaN |
1727 | 1727 | Worthing | 0 | British | NaN |
1728 | 1728 | Xilin | 0 | Chinese | NaN |
1729 | 1729 | Yan Kit | 0 | Chinese | NaN |
1730 | 1730 | Yarrow | 0 | British | NaN |
1731 | 1731 | Yarwood | 0 | British | NaN |
1732 | 1732 | Yasin | 1 | Malay | NaN |
1733 | 1733 | Yio Chu Kang | 0 | Chinese | NaN |
1734 | 1734 | Yishun | 0 | Chinese | NaN |
1735 | 1735 | Yong Siak | 0 | Chinese | NaN |
1736 | 1736 | York | 0 | British | NaN |
1737 | 1737 | Youngberg | 0 | British | NaN |
1738 | 1738 | Yuan Ching | 0 | Chinese | NaN |
1739 | 1739 | Yuk Tong | 0 | Chinese | NaN |
1740 | 1740 | Yung An | 0 | Chinese | NaN |
1741 | 1741 | Yung Ho | 0 | Chinese | NaN |
1742 | 1742 | Yung Kuang | 0 | Chinese | NaN |
1743 | 1743 | Yung Sheng | 0 | Chinese | NaN |
1744 | 1744 | Yunnan | 0 | Chinese | NaN |
1745 | 1745 | Zamrud | 1 | Malay | NaN |
1746 | 1746 | Zehnder | 0 | Other | Eurasian |
1747 | 1747 | Zion | 0 | Other | NaN |
1748 | 1748 | Zubir Said | 0 | Malay | NaN |
1749 | 1749 | kukoh | 1 | Malay | NaN |
1750 | 1750 | one-north Gateway | 0 | Generic | NaN |
1751 rows × 5 columns
In this step, we'll use about 10% of the data to mimic the process I actually used.
# let's pick a random 10% to train with
import random
random.seed(1965)
train_test_set = df.loc[random.sample(df.index, int(len(df) / 10))]
X = train_test_set['road_name']
y = train_test_set['classification']
zip(X,y)[::10]
[('Opal', 'Generic'), ('Club', 'Generic'), ('Minto', 'Other'), ('Woodlands', 'Generic'), ('Hai Sing', 'Chinese'), ('Batalong', 'Malay'), ('Hikayat', 'Malay'), ('Bassein', 'Other'), ('Mount Echo', 'Generic'), ('Kallang Pudding', 'Malay'), ('Republic', 'Generic'), ('Wan Tho', 'Chinese'), ('Rengkam', 'Malay'), ('Keong Saik', 'Chinese'), ('Sedap', 'Malay'), ('Stratton', 'British'), ('Seagull', 'Generic'), ('Manila', 'Other')]
You never actually train and test on the same data. So we'll split this dataset even further. scikit-learn provides a convenient function for this.
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_true = train_test_split(X, y)
This was actually one of the trickiest parts of the process. These are the labels I finally decided on:
Something to bear in mind is that some of the streets can be classified in multiple ways. For example, is Queen Street "British" or "Generic"? In this case I selected "British" because it was specifically named after Queen Victoria. I tried to be consistent in my criteria, but up to ~5% of the roads might be arguable. Also, there is insufficient information for some of the roads so I went with my gut feel about the orthotactics of the word (the letter patterns).
df.classification.value_counts()
Malay 614 British 518 Generic 255 Chinese 217 Other 119 Indian 28 dtype: int64
What we're doing is basically language classification. Often, people use n-grams as features for this. scikit-learn conveniently provides a function that counts n-grams for us.
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(ngram_range=(1,4), analyzer='char')
# fit_transform for the training data
X_train_feats = vect.fit_transform(X_train)
# transform for the test data
# because we need to match the ngrams that were found in the training set
X_test_feats = vect.transform(X_test)
print type(X_train_feats)
print X_train_feats.shape
print X_test_feats.shape
<class 'scipy.sparse.csr.csr_matrix'> (131, 1410) (44, 1410)
According to this, we should be starting out with Linear SVC.
from sklearn.svm import LinearSVC
clf = LinearSVC()
Use the classifier to fit a model based on the feature matrix of X_train
and the label vector of y_train
.
model = clf.fit(X_train_feats, y_train)
Now that we have our model, we can use it to predict labels on a fresh test set.
y_predicted = model.predict(X_test_feats)
y_predicted
array(['Malay', 'Malay', 'British', 'Malay', 'British', 'British', 'British', 'British', 'British', 'British', 'Malay', 'Chinese', 'British', 'Chinese', 'British', 'Other', 'Generic', 'Malay', 'Malay', 'Chinese', 'British', 'British', 'Malay', 'British', 'British', 'Generic', 'Other', 'British', 'British', 'British', 'British', 'British', 'Malay', 'Generic', 'Malay', 'Generic', 'Malay', 'British', 'Malay', 'British', 'British', 'Malay', 'Malay', 'Generic'], dtype=object)
scikit-learn comes with a bunch of evaluation metrics. Which one should be chosen depends on what we're trying to minimise/maximise. In this case, we want to make as few errors as possible, so it makes sense to use accuracy as our metric.
$$ accuracy = \frac{\# correct}{\# classified} $$from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_predicted)
0.59090909090909094
So we got 60% accuracy. Let's try it with a few more train/test splits to see whether this is typical.
def classify(X, y):
# do the train-test split
X_train, X_test, y_train, y_true = train_test_split(X, y)
# get our features
X_train_feats = vect.fit_transform(X_train)
X_test_feats = vect.transform(X_test)
# train our model
model = clf.fit(X_train_feats, y_train)
# predict labels on the test set
y_predicted = model.predict(X_test_feats)
# return the accuracy score obtained
return accuracy_score(y_true, y_predicted)
scores = list()
num_expts = 100
for i in range(num_expts):
score = classify(X,y)
scores.append(score)
print sum(scores) / num_expts
0.551818181818
The accuracy we obtain with this set of features and this classifier is about 55%. This isn't completely terrible. With 6 categories, a completely random classifier should expect to get only 16.6% of them right. But 55% accuracy also means that I'd have to go through and correct every other label. How can we improve this?
There are a few ways that spring to mind: