I still remember that the first time I'd seen IPython Notebook was at Taipei.py. At that time, I wasn't sure why it's a good idea to write Python programs on a restricted environment in a browser. I mean, I couldn't even use my favorite Vim commands! I tried installing it and wrote a few short programs, but ended up not being interested to continue using the Notebook. On the other hand, I found the IPython interactive shell to be more convenient than the original Python shell, so I started using it more and more often when I needed to issue short Python commands.
My second experience with IPython Notebook was from the course materials of CS231n. In that course, each assignment was a IPython Notebook. You could complete the code on either the Notebook itself or independent Python scripts. The results could be evaluated immediately on the Notebook. I realized this was a fantastic way to share and communicate! There was also the nbviewer, which made it so easy to view all those Notebooks without having to actually set up the Python environment.
As I gained more experience with machine learning tasks using Python, I started to understand why IPython Notebook was so popular among the scientific computing community. I believed one of the reasons must be that it provided a very simple way to record everything you did.
最近在資料科學領域學習,最讓我感覺震驚的不是技術,反而是 science != engineering 的感覺,跟寫 code 做產品完全不一樣的思維和工作模式。打開 ipython 或 rstudio 就像打開實驗記錄簿,不斷假設、驗證、預測。這跟做軟體工程,差別真的蠻大的。
— i͛ho͌ͯͦ̉͑we̍̃̏ͣr̆̽̓ (@ihower) August 31, 2015
For example, when doing data science, there is a need to manage not just the source code but also the data. Tasks such as data preprocessing, data cleaning and feature extraction all requires some transformations of the data. Oftentimes, it seems that some of the transformations would be done for only once, and it is very tempting to just issue the command without even recording what has been done. This could be a disaster when you want to rerun the experiments under different settings for the early stages of the pipeline. Even if you do put down the commands on Python scripts, it is still difficult to figure out the order and parameters for these scripts later. On the other hand, if you put much effort to write a single script for all the data transformation and analysis straight from the original data every time runs, the running time may become unacceptable when dealing with big data. That's where IPython Notebook shines. It is a perfect notebook to record everything.
So let's get started with our tour of IPython Notebook. Some simple tasks will be demonstrated using several libraries including:
In particular, matplotlib is a powerful graphing package for data visualization, and the close integration with IPython Notebook makes it even more useful.
Firstly, we use %matplotlib inline
magic command to make matplotlib display plots directly inside the Notebook.
%matplotlib inline
To get intuition on a model, finding the features that have the largest weights are often helpful. We will use the polarity dataset for the demonstration:
! wget http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
! tar xzf review_polarity.tar.gz
--2015-12-26 16:14:15-- http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 128.84.154.137 Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|128.84.154.137|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 3127238 (3.0M) [application/x-gzip] Saving to: ‘review_polarity.tar.gz’ 100%[======================================>] 3,127,238 655KB/s in 5.6s 2015-12-26 16:14:21 (543 KB/s) - ‘review_polarity.tar.gz’ saved [3127238/3127238]
Firstly load the required modules:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
We use TfidfVectorizer
to get the TF-IDF feature vector for each sentence:
sent_data = load_files('txt_sentoken')
tfidf_vec = TfidfVectorizer()
sent_X = tfidf_vec.fit_transform(sent_data.data)
sent_y = sent_data.target
LinearSVC
is used to train a classifier for positive and negative sentiments.
lsvc = LinearSVC()
lsvc.fit(sent_X, sent_y)
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0)
Finally, we show the most important features learned by the classifier.
def display_top_features(weights, names, top_n):
top_features = sorted(zip(weights, names), key=lambda x: abs(x[0]), reverse=True)[:top_n]
top_weights = [x[0] for x in top_features]
top_names = [x[1] for x in top_features]
fig, ax = plt.subplots(figsize=(16,8))
ind = np.arange(top_n)
bars = ax.bar(ind, top_weights, color='blue', edgecolor='black')
for bar, w in zip(bars, top_weights):
if w < 0:
bar.set_facecolor('red')
width = 0.30
ax.set_xticks(ind + width)
ax.set_xticklabels(top_names, rotation=45, fontsize=12)
plt.show(fig)
display_top_features(lsvc.coef_[0], tfidf_vec.get_feature_names(), 20)
Word clouds are also an interesting way to show relative importance for different words:
from wordcloud import WordCloud
def generate_word_cloud(weights, names):
return WordCloud(width=350, height=250).generate_from_frequencies(zip(names, weights))
def display_word_cloud(weights, names):
fig, ax = plt.subplots(1, 2, figsize=(28, 10))
pos_weights = weights[weights > 0]
pos_names = np.array(names)[weights > 0]
neg_weights = np.abs(weights[weights < 0])
neg_names = np.array(names)[weights < 0]
lst = [('Positive', pos_weights, pos_names), ('Negative', neg_weights, neg_names)]
for i, (label, weights, names) in enumerate(lst):
wc = generate_word_cloud(weights, names)
ax[i].imshow(wc)
ax[i].set_axis_off()
ax[i].set_title('{} words'.format(label), fontsize=24)
plt.show(fig)
display_word_cloud(lsvc.coef_[0], tfidf_vec.get_feature_names())
It's often difficult to understand data with high dimensionality. Therefore, dimensionality reduction is often used to help visualization. Here we will use t-SNE for the Iris flower data set. Additionally, we use MPLD3 to produce figures that could be zoomed in and zoomed out.
from sklearn.datasets import load_iris
from sklearn.manifold import TSNE
import mpld3
iris = load_iris()
def display_iris(data):
X_tsne = TSNE(n_components=2, perplexity=20, learning_rate=50).fit_transform(data.data)
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
ax[0].scatter(X_tsne[:, 0], X_tsne[:, 1])
ax[0].set_title('All instances', fontsize=14)
ax[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=data.target)
ax[1].set_title('All instances labeled with color', fontsize=14)
return mpld3.display(fig)
display_iris(iris)
As we could see, t-SNE does quite well to separate data points of different types even without knowing the label. Let's try a more complicated example with the MNIST dataset of handwritten digits. We will also use PointLabelTooltip
to display the labels as tooltips.
from sklearn.datasets import fetch_mldata
from sklearn.decomposition import PCA
mnist = fetch_mldata('MNIST original')
def display_mnist(data, n_samples):
X, y = data.data / 255.0, data.target
# downsample as the scikit-learn implementation of t-SNE is unable to handle too much data
indices = np.arange(X.shape[0])
np.random.shuffle(indices)
X_train, y_train = X[indices[:n_samples]], y[indices[:n_samples]]
X_tsne = TSNE(n_components=2, perplexity=30).fit_transform(X_train)
X_pca = PCA(n_components=2).fit_transform(X_train)
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
points = ax[0].scatter(X_tsne[:,0], X_tsne[:,1], c=y_train)
tooltip = mpld3.plugins.PointLabelTooltip(points, labels=y_train.tolist())
mpld3.plugins.connect(fig, tooltip)
ax[0].set_title('t-SNE')
points = ax[1].scatter(X_pca[:,0], X_pca[:,1], c=y_train)
tooltip = mpld3.plugins.PointLabelTooltip(points, labels=y_train.tolist())
mpld3.plugins.connect(fig, tooltip)
ax[1].set_title('PCA')
return mpld3.display(fig)
display_mnist(mnist, 1000)
If your aim is to learn a projection vector when labels are available for the training data. LDA could also be used.
from mpl_toolkits.mplot3d import Axes3D
from sklearn.lda import LDA
def display_mnist_3d(data, n_samples):
X, y = data.data / 255.0, data.target
# downsample as the scikit-learn implementation of t-SNE is unable to handle too much data
indices = np.arange(X.shape[0])
np.random.shuffle(indices)
X_train, y_train = X[indices[:n_samples]], y[indices[:n_samples]]
X_lda = LDA(n_components=3).fit_transform(X_train, y_train)
fig, ax = plt.subplots(figsize=(10,10), subplot_kw={'projection':'3d'})
points = ax.scatter(X_lda[:,0], X_lda[:,1], X_lda[:,2] , c=y_train)
ax.set_title('LDA')
ax.set_xlim((-6, 6))
ax.set_ylim((-6, 6))
plt.show(fig)
display_mnist_3d(mnist, 1000)
Pandas is quite useful for data analysis. Let's use the Meta Kaggle dataset to see how users are doing on the Kaggle website.
import pandas as pd
import sqlite3
After manually downloading the dataset, we extract the zipped file. There should be a output
directory, containing the files.
con = sqlite3.connect('output/database.sqlite')
kaggle_df = pd.read_sql_query('''
SELECT * FROM Submissions''', con)
Display some entries:
kaggle_df.head()
Id | SubmittedUserId | DateSubmitted | TeamId | PrivateScore | PublicScore | IsSelected | ScoreStatus | IsAfterDeadline | DateScored | ScoringDurationMilliseconds | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2180 | 647 | 2010-04-29 22:32:08 | 496 | 56.2139 | 55.7692 | False | 1 | False | ||
1 | 2181 | 619 | 2010-04-30 09:38:29 | 497 | 50 | 47.1154 | False | 1 | False | ||
2 | 2182 | 619 | 2010-04-30 09:48:50 | 497 | 65.6069 | 61.0577 | False | 1 | False | ||
3 | 2184 | 663 | 2010-05-01 11:02:52 | 499 | 50 | 47.1154 | False | 1 | False | ||
4 | 2185 | 673 | 2010-05-02 08:04:38 | 500 | 62.2832 | 61.0577 | False | 1 | False |
Now, we would like to analyse the submission times. Firstly we obtain the day of week and the hour of week for each submission.
print('There is {} submissions'.format(kaggle_df.shape[0]))
# convert time strings to DatetimeIndex
kaggle_df['timestamp'] = pd.to_datetime(kaggle_df['DateSubmitted'])
print('The earliest and latest submissions are on {} and {}'.format(kaggle_df['timestamp'].min(), kaggle_df['timestamp'].max()))
kaggle_df['weekday'] = kaggle_df['timestamp'].dt.weekday
kaggle_df['weekhr'] = kaggle_df['weekday'] * 24 + kaggle_df['timestamp'].dt.hour
There is 934345 submissions The earliest and latest submissions are on 2010-04-29 22:32:08 and 2015-08-31 23:58:44.050000
import calendar
def display_kaggle(df):
fig, ax = plt.subplots(1, 2, figsize=(16, 8))
ax[0].set_title('submissions per weekday')
df['weekday'].value_counts().sort_index().rename_axis(lambda x: calendar.day_name[x]).plot.bar(ax=ax[0])
ax[1].set_title('submissions per hour of week')
ax[1].set_xticks(np.linspace(0, 24*7, 8))
df['weekhr'].value_counts().sort_index().plot(color='red', ax=ax[1])
plt.show(fig)
display_kaggle(kaggle_df)
Next, we try to cluster the users based on their submission patterns to see whether different groups might like to submit at different times.
from collections import defaultdict
from sklearn.cluster import KMeans
def display_hr(df, n_clusters):
hrs_per_user = df[['SubmittedUserId', 'weekhr', 'Id']].groupby(['SubmittedUserId', 'weekhr']).count()
total_per_user = hrs_per_user.sum(axis=0, level=0)
user_patterns = (hrs_per_user / total_per_user)['Id']
vectors = defaultdict(lambda: np.zeros(24*7))
for (u, hr), r in user_patterns.items():
vectors[u][hr] = r
X_hr = np.array(list(vectors.values()))
y = KMeans(n_clusters=n_clusters, random_state=3).fit_predict(X_hr)
for i in range(n_clusters):
fig, ax = plt.subplots(figsize=(6, 6))
indices = y == i
X = X_hr[indices]
ax.plot(np.arange(24*7), X.mean(axis=0))
ax.set_xticks(np.linspace(0, 24*7, 8))
ax.set_xlim((0, 24*7))
ax.set_title('Cluster #{}, n = {}'.format(i, len(X)), fontsize=14)
plt.show(fig)
display_hr(kaggle_df, 9)
It seems that the users from Cluster#1
and Cluster#8
might indeed be active at different times. What do you think?
Finally, let's draw a XKCD-style plot with matplotlib! To be able to draw this, we need to install Humor Sans, and clean the font cache directory. To get the path of the cache, use:
import matplotlib
matplotlib.get_cachedir()
As we use Python 3, additional packages are also required:
sudo apt-get install libffi-dev
pip3 install cairocffi
def xkcd():
with plt.xkcd():
fig, ax = plt.subplots()
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.set_xticks([])
ax.set_yticks([])
ax.set_ylim([-1, 10])
data = np.zeros(100)
data[:60] += np.linspace(-1, 0, 60)
data[60:75] += np.arange(15)
data[75:] -= np.ones(25)
ax.annotate(
'DEADLINE',
xy=(71, 7), arrowprops=dict(arrowstyle='->'), xytext=(30, 2))
ax.plot(data)
ax.plot([72, 72], [-1, 15], 'k-', color='red')
ax.set_xlabel('time')
ax.set_ylabel('productivity')
ax.set_title('productivity under a deadline')
plt.show(fig)
xkcd()