Building the baseline classifier¶

We'll now do a basic round of supervised classification using scikit-learn. We start by loading the data. We actually have the final classifications in this dataset, so that we can figure out what our accuracy rate was, but we'll ignore it initially and pretend we're starting from scratch.

In [33]:

import pandas as pd

In [34]:

df = pd.read_csv('singapore-roadnames-final-classified.csv')

In [35]:

df

Out[35]:

	Unnamed: 0	road_name	has_malay_road_tag	classification	comment
0	0	Abingdon	0	British	NaN
1	1	Abu Talib	1	Malay	NaN
2	2	Adam	0	British	NaN
3	3	Adat	1	Malay	NaN
4	4	Adis	0	Other	Indian Jewish
5	5	Admiralty	0	British	NaN
6	6	Ah Hood	0	Chinese	NaN
7	7	Ah Soo	1	Chinese	NaN
8	8	Ahmad Ibrahim	1	Malay	NaN
9	9	Aida	0	Other	NaN
10	10	Airport	0	Generic	NaN
11	11	Alexandra	0	British	NaN
12	12	Aliwal	0	Indian	Battle of Aliwal in the Indo-Sikh war
13	13	Aljunied	0	Other	Arab
14	14	Allanbrooke	0	British	NaN
15	15	Allenby	0	British	NaN
16	16	Almond	0	Generic	NaN
17	17	Alnwick	0	British	NaN
18	18	Alps	0	Other	NaN
19	19	Ama Keng	0	Chinese	NaN
20	20	Amber	0	Other	after the Amber Trust fund established for poo...
21	21	Amoy	0	Chinese	NaN
22	22	Ampang	1	Malay	NaN
23	23	Ampas	1	Malay	NaN
24	24	Ampat	1	Malay	NaN
25	25	Anak Bukit	1	Malay	NaN
26	26	Anak Patong	1	Malay	NaN
27	27	Anamalai	0	Indian	NaN
28	28	Anchorvale	0	Generic	marine theme
29	29	Anderson	0	British	NaN
...	...	...	...	...	...
1721	1721	Woodgrove	0	Generic	NaN
1722	1722	Woodland	0	Generic	NaN
1723	1723	Woodlands	0	Generic	NaN
1724	1724	Woodleigh	0	British	NaN
1725	1725	Woodsville	0	Generic	NaN
1726	1726	Woollerton	0	British	NaN
1727	1727	Worthing	0	British	NaN
1728	1728	Xilin	0	Chinese	NaN
1729	1729	Yan Kit	0	Chinese	NaN
1730	1730	Yarrow	0	British	NaN
1731	1731	Yarwood	0	British	NaN
1732	1732	Yasin	1	Malay	NaN
1733	1733	Yio Chu Kang	0	Chinese	NaN
1734	1734	Yishun	0	Chinese	NaN
1735	1735	Yong Siak	0	Chinese	NaN
1736	1736	York	0	British	NaN
1737	1737	Youngberg	0	British	NaN
1738	1738	Yuan Ching	0	Chinese	NaN
1739	1739	Yuk Tong	0	Chinese	NaN
1740	1740	Yung An	0	Chinese	NaN
1741	1741	Yung Ho	0	Chinese	NaN
1742	1742	Yung Kuang	0	Chinese	NaN
1743	1743	Yung Sheng	0	Chinese	NaN
1744	1744	Yunnan	0	Chinese	NaN
1745	1745	Zamrud	1	Malay	NaN
1746	1746	Zehnder	0	Other	Eurasian
1747	1747	Zion	0	Other	NaN
1748	1748	Zubir Said	0	Malay	NaN
1749	1749	kukoh	1	Malay	NaN
1750	1750	one-north Gateway	0	Generic	NaN

1751 rows × 5 columns

In this step, we'll use about 10% of the data to mimic the process I actually used.

Step 0: putting the data together¶

In [36]:

# let's pick a random 10% to train with

import random
random.seed(1965)
train_test_set = df.loc[random.sample(df.index, int(len(df) / 10))]

X = train_test_set['road_name']
y = train_test_set['classification']

In [37]:

zip(X,y)[::10]

Out[37]:

[('Opal', 'Generic'),
 ('Club', 'Generic'),
 ('Minto', 'Other'),
 ('Woodlands', 'Generic'),
 ('Hai Sing', 'Chinese'),
 ('Batalong', 'Malay'),
 ('Hikayat', 'Malay'),
 ('Bassein', 'Other'),
 ('Mount Echo', 'Generic'),
 ('Kallang Pudding', 'Malay'),
 ('Republic', 'Generic'),
 ('Wan Tho', 'Chinese'),
 ('Rengkam', 'Malay'),
 ('Keong Saik', 'Chinese'),
 ('Sedap', 'Malay'),
 ('Stratton', 'British'),
 ('Seagull', 'Generic'),
 ('Manila', 'Other')]

You never actually train and test on the same data. So we'll split this dataset even further. scikit-learn provides a convenient function for this.

In [38]:

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_true = train_test_split(X, y)

Step 1: Figure out your classification labels¶

This was actually one of the trickiest parts of the process. These are the labels I finally decided on:

Malay (including Indonesian/Bugis names)
British
Chinese (all languages ("dialects"))
Indian (all languages)
Other (e.g. other European names, Jewish names, Armenian names...)
Generic (Temple Street, Sunrise Avenue, etc)

Something to bear in mind is that some of the streets can be classified in multiple ways. For example, is Queen Street "British" or "Generic"? In this case I selected "British" because it was specifically named after Queen Victoria. I tried to be consistent in my criteria, but up to ~5% of the roads might be arguable. Also, there is insufficient information for some of the roads so I went with my gut feel about the orthotactics of the word (the letter patterns).

In [39]:

df.classification.value_counts()

Out[39]:

Malay      614
British    518
Generic    255
Chinese    217
Other      119
Indian      28
dtype: int64

Step 2: decide what features to use¶

What we're doing is basically language classification. Often, people use n-grams as features for this. scikit-learn conveniently provides a function that counts n-grams for us.

In [40]:

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(ngram_range=(1,4), analyzer='char')

# fit_transform for the training data
X_train_feats = vect.fit_transform(X_train)
# transform for the test data
# because we need to match the ngrams that were found in the training set 
X_test_feats  = vect.transform(X_test) 

print type(X_train_feats)
print X_train_feats.shape
print X_test_feats.shape

<class 'scipy.sparse.csr.csr_matrix'>
(131, 1410)
(44, 1410)

Step 3: pick a classifier¶

According to this, we should be starting out with Linear SVC.

In [41]:

from sklearn.svm import LinearSVC
clf = LinearSVC()

Step 4: Train the model¶

Use the classifier to fit a model based on the feature matrix of X_train and the label vector of y_train.

In [42]:

model = clf.fit(X_train_feats, y_train)

Step 5: Predict the labels of the test set¶

Now that we have our model, we can use it to predict labels on a fresh test set.

In [43]:

y_predicted = model.predict(X_test_feats)

In [44]:

y_predicted

Out[44]:

array(['Malay', 'Malay', 'British', 'Malay', 'British', 'British',
       'British', 'British', 'British', 'British', 'Malay', 'Chinese',
       'British', 'Chinese', 'British', 'Other', 'Generic', 'Malay',
       'Malay', 'Chinese', 'British', 'British', 'Malay', 'British',
       'British', 'Generic', 'Other', 'British', 'British', 'British',
       'British', 'British', 'Malay', 'Generic', 'Malay', 'Generic',
       'Malay', 'British', 'Malay', 'British', 'British', 'Malay', 'Malay',
       'Generic'], dtype=object)

Step 6: select an evaluation metric¶

scikit-learn comes with a bunch of evaluation metrics. Which one should be chosen depends on what we're trying to minimise/maximise. In this case, we want to make as few errors as possible, so it makes sense to use accuracy as our metric.

$$ accuracy = \frac{\# correct}{\# classified} $$

In [45]:

from sklearn.metrics import accuracy_score

In [46]:

accuracy_score(y_true, y_predicted)

Out[46]:

0.59090909090909094

So we got 60% accuracy. Let's try it with a few more train/test splits to see whether this is typical.

In [47]:

def classify(X, y):
    # do the train-test split
    X_train, X_test, y_train, y_true = train_test_split(X, y)

    # get our features
    X_train_feats = vect.fit_transform(X_train)
    X_test_feats  = vect.transform(X_test) 

    # train our model
    model = clf.fit(X_train_feats, y_train)
    
    # predict labels on the test set
    y_predicted = model.predict(X_test_feats)
    
    # return the accuracy score obtained
    return accuracy_score(y_true, y_predicted)

In [50]:

scores = list()
num_expts = 100
for i in range(num_expts):
    score = classify(X,y)
    scores.append(score)
    
print sum(scores) / num_expts

0.551818181818

Conclusion¶

The accuracy we obtain with this set of features and this classifier is about 55%. This isn't completely terrible. With 6 categories, a completely random classifier should expect to get only 16.6% of them right. But 55% accuracy also means that I'd have to go through and correct every other label. How can we improve this?

There are a few ways that spring to mind:

Increase the amount of data - easier said than done
Try different classifiers - scikit-learn makes this dead easy
Use more features - worth a try (and we will)
Adjust the hyperparameters of the classifiers - more on this later