Here is where we start diving into the field of machine learning.
By the end of this section you will
In addition, we will go over several basic tools within scikit-learn which can be used to accomplish the above tasks.
Machine Learning (ML) is about building programs with tunable parameters (typically an array of floating point values) that are adjusted automatically so as to improve their behavior by adapting to previously seen data.
In most ML applications, the data is in a 2D array of shape [n_samples x n_features]
,
where the number of features is the same for each object, and each feature column refers
to a related piece of information about each sample.
Machine learning can be broken into two broad regimes: supervised learning and unsupervised learning. We’ll introduce these concepts here, and discuss them in more detail below.
Every algorithm is exposed in scikit-learn via an ''Estimator'' object. For instance a linear regression is:
%pylab inline
import pylab as plt
import numpy as np
Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['plt'] `%pylab --no-import-all` prevents importing * from pylab and numpy
from sklearn.linear_model import LinearRegression
Estimator parameters: All the parameters of an estimator can be set when it is instantiated:
model = LinearRegression(normalize=True)
print model.normalize
True
print model
LinearRegression(copy_X=True, fit_intercept=True, normalize=True)
Estimated parameters: When data is fitted with an estimator, parameters are estimated from the data at hand. All the estimated parameters are attributes of the estimator object ending by an underscore:
x = np.array([0, 1, 2])
y = np.array([0, 1, 2])
_ = plt.plot(x, y, marker='o')
X = x[:, np.newaxis] # The input data for sklearn is 2D: (samples == 3 x features == 1)
X
array([[0], [1], [2]])
model.fit(X, y)
model.coef_
array([ 1.00000003])
In Supervised Learning, we have a dataset consisting of both features and labels. The task is to construct an estimator which is able to predict the label of an object given the set of features. A relatively simple example is predicting the species of iris given a set of measurements of its flower. This is a relatively simple task. Some more complicated examples are:
What these tasks have in common is that there is one or more unknown quantities associated with the object which needs to be determined from other observed quantities.
Supervised learning is further broken down into two categories, classification and regression. In classification, the label is discrete, while in regression, the label is continuous. For example, in astronomy, the task of determining whether an object is a star, a galaxy, or a quasar is a classification problem: the label is from three distinct categories. On the other hand, we might wish to estimate the age of an object based on such observations: this would be a regression problem, because the label (age) is a continuous quantity.
Classification: K nearest neighbors (kNN) is one of the simplest learning strategies: given a new, unknown observation, look up in your reference database which ones have the closest features and assign the predominant class.
Let's try it out on our iris classification problem:
from sklearn import neighbors, datasets
iris = datasets.load_iris()
X, y = iris.data, iris.target
knn = neighbors.KNeighborsClassifier(n_neighbors=1)
knn.fit(X, y)
# What kind of iris has 3cm x 5cm sepal and 4cm x 2cm petal?
print iris.target_names[knn.predict([[3, 5, 4, 2]])]
['virginica']
# A plot of the sepal space and the prediction of the KNN
from helpers import plot_iris_knn
plot_iris_knn()
Regression: The simplest possible regression setting is the linear regression one:
# Create some simple data
import numpy as np
np.random.seed(0)
X = np.random.random(size=(20, 1))
y = 3 * X.squeeze() + 2 + np.random.normal(size=20)
# Fit a linear regression to it
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
model.fit(X, y)
print "Model coefficient: %.5f, and intercept: %.5f" % (model.coef_, model.intercept_)
# Plot the data and the model prediction
X_test = np.linspace(0, 1, 100)[:, np.newaxis]
y_test = model.predict(X_test)
import pylab as pl
print X.squeeze()
pl.plot(X.squeeze(), y, 'o')
pl.plot(X_test.squeeze(), y_test)
Model coefficient: 3.93491, and intercept: 1.46229 [ 0.5488135 0.71518937 0.60276338 0.54488318 0.4236548 0.64589411 0.43758721 0.891773 0.96366276 0.38344152 0.79172504 0.52889492 0.56804456 0.92559664 0.07103606 0.0871293 0.0202184 0.83261985 0.77815675 0.87001215]
[<matplotlib.lines.Line2D at 0x1a57430>]
Unsupervised Learning addresses a different sort of problem. Here the data has no labels, and we are interested in finding similarities between the objects in question. In a sense, you can think of unsupervised learning as a means of discovering labels from the data itself. Unsupervised learning comprises tasks such as dimensionality reduction, clustering, and density estimation. For example, in the iris data discussed above, we can used unsupervised methods to determine combinations of the measurements which best display the structure of the data. As we’ll see below, such a projection of the data can be used to visualize the four-dimensional dataset in two dimensions. Some more involved unsupervised learning problems are:
Sometimes the two may even be combined: e.g. Unsupervised learning can be used to find useful features in heterogeneous data, and then these features can be used within a supervised framework.
An example PCA for visualization Principle Component Analysis (PCA) is a dimension reduction technique that can find the combinations of variables that explain the most variance.
Consider the iris dataset. It cannot be visualized in a single 2D plot, as it has 4 features. We are going to extract 2 combinations of sepal and petal dimensions to visualize it:
X, y = iris.data, iris.target
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)
X_reduced = pca.transform(X)
print "Reduced dataset shape:", X_reduced.shape
import pylab as pl
pl.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y)
print "Meaning of the 2 components:"
for component in pca.components_:
print " + ".join("%.3f x %s" % (value, name)
for value, name in zip(component, iris.feature_names))
Reduced dataset shape: (150, 2) Meaning of the 2 components: 0.362 x sepal length (cm) + -0.082 x sepal width (cm) + 0.857 x petal length (cm) + 0.359 x petal width (cm) -0.657 x sepal length (cm) + -0.730 x sepal width (cm) + 0.176 x petal length (cm) + 0.075 x petal width (cm)
Clustering: Clustering groups together observations that are homogeneous with respect to a given criterion, finding ''clusters'' in the data.
Note that these clusters will uncover relevent hidden structure of the data only if the criterion used highlights it.
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0) # Fixing the RNG in kmeans
k_means.fit(X_reduced)
y_pred = k_means.predict(X_reduced)
pl.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y_pred)
<matplotlib.collections.PathCollection at 0x69d5570>
%pylab inline
import pylab as pl
import numpy as np
# Some nice default configuration for plots
pl.rcParams['figure.figsize'] = 10, 7.5
pl.rcParams['axes.grid'] = True
pl.gray()
Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['plt'] `%pylab --no-import-all` prevents importing * from pylab and numpy
<matplotlib.figure.Figure at 0x69ea950>
Outline of this section:
Let's start by implementing a canonical text classification example:
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
# Load the text data
categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space',
]
twenty_train_small = load_files('../data/twenty_newsgroups/20news-bydate-train/',
categories=categories, charset='latin-1')
twenty_test_small = load_files('../data/twenty_newsgroups/20news-bydate-test/',
categories=categories, charset='latin-1')
# Turn the text documents into vectors of word frequencies
vectorizer = TfidfVectorizer(min_df=2)
X_train = vectorizer.fit_transform(twenty_train_small.data)
y_train = twenty_train_small.target
# Fit a classifier on the training set
classifier = MultinomialNB().fit(X_train, y_train)
print("Training score: {0:.1f}%".format(
classifier.score(X_train, y_train) * 100))
# Evaluate the classifier on the testing set
X_test = vectorizer.transform(twenty_test_small.data)
y_test = twenty_test_small.target
print("Testing score: {0:.1f}%".format(
classifier.score(X_test, y_test) * 100))
/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.5-i386.egg/sklearn/datasets/base.py:161: DeprecationWarning: The charset parameter is deprecated as of version 0.14 and will be removed in 0.16. Use encode instead. DeprecationWarning) /Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.5-i386.egg/sklearn/datasets/base.py:161: DeprecationWarning: The charset parameter is deprecated as of version 0.14 and will be removed in 0.16. Use encode instead. DeprecationWarning)
Training score: 95.1% Testing score: 85.1%
Here is a workflow diagram summary of what happened previously:
Let's explore the dataset loading utility without passing a list of categories: in this case we load the full 20 newsgroups dataset in memory. The source website for the 20 newsgroups already provides a date-based train / test split that is made available using the subset
keyword argument:
ls -l ../data/
total 1744 drwxr-xr-x 3 marcelcaraciolo staff 102 Sep 26 16:54 labeled_faces_wild/ drwxr-xr-x 3 marcelcaraciolo staff 102 Sep 26 16:54 languages/ drwxr-xr-x 6 marcelcaraciolo staff 204 Aug 4 02:30 ml-1m/ -rw-r--r--@ 1 marcelcaraciolo staff 1047 Sep 29 23:46 movie_rating.csv drwxr-xr-x 3 marcelcaraciolo staff 102 Sep 26 16:54 movie_reviews/ -rw-r--r-- 1 marcelcaraciolo staff 371839 Sep 29 20:27 movielens_test.csv -rw-r--r-- 1 marcelcaraciolo staff 512859 Sep 29 20:27 movielens_train.csv drwxr-xr-x 5 marcelcaraciolo staff 170 Sep 30 17:16 twenty_newsgroups/
ls -lh ../data/twenty_newsgroups/20news-bydate-train
total 0 drwxr-xr-x 482 marcelcaraciolo staff 16K Mar 18 2003 alt.atheism/ drwxr-xr-x 586 marcelcaraciolo staff 19K Mar 18 2003 comp.graphics/ drwxr-xr-x 593 marcelcaraciolo staff 20K Mar 18 2003 comp.os.ms-windows.misc/ drwxr-xr-x 592 marcelcaraciolo staff 20K Mar 18 2003 comp.sys.ibm.pc.hardware/ drwxr-xr-x 580 marcelcaraciolo staff 19K Mar 18 2003 comp.sys.mac.hardware/ drwxr-xr-x 595 marcelcaraciolo staff 20K Mar 18 2003 comp.windows.x/ drwxr-xr-x 587 marcelcaraciolo staff 19K Mar 18 2003 misc.forsale/ drwxr-xr-x 596 marcelcaraciolo staff 20K Mar 18 2003 rec.autos/ drwxr-xr-x 600 marcelcaraciolo staff 20K Mar 18 2003 rec.motorcycles/ drwxr-xr-x 599 marcelcaraciolo staff 20K Mar 18 2003 rec.sport.baseball/ drwxr-xr-x 602 marcelcaraciolo staff 20K Mar 18 2003 rec.sport.hockey/ drwxr-xr-x 597 marcelcaraciolo staff 20K Mar 18 2003 sci.crypt/ drwxr-xr-x 593 marcelcaraciolo staff 20K Mar 18 2003 sci.electronics/ drwxr-xr-x 596 marcelcaraciolo staff 20K Mar 18 2003 sci.med/ drwxr-xr-x 595 marcelcaraciolo staff 20K Mar 18 2003 sci.space/ drwxr-xr-x 601 marcelcaraciolo staff 20K Mar 18 2003 soc.religion.christian/ drwxr-xr-x 548 marcelcaraciolo staff 18K Mar 18 2003 talk.politics.guns/ drwxr-xr-x 566 marcelcaraciolo staff 19K Mar 18 2003 talk.politics.mideast/ drwxr-xr-x 467 marcelcaraciolo staff 16K Mar 18 2003 talk.politics.misc/ drwxr-xr-x 379 marcelcaraciolo staff 13K Mar 18 2003 talk.religion.misc/
ls -lh ../data/twenty_newsgroups/20news-bydate-train/alt.atheism/
total 4480 -rw-r--r-- 1 marcelcaraciolo staff 12K Mar 18 2003 49960 -rw-r--r-- 1 marcelcaraciolo staff 31K Mar 18 2003 51060 -rw-r--r-- 1 marcelcaraciolo staff 4.0K Mar 18 2003 51119 -rw-r--r-- 1 marcelcaraciolo staff 1.6K Mar 18 2003 51120 -rw-r--r-- 1 marcelcaraciolo staff 773B Mar 18 2003 51121 -rw-r--r-- 1 marcelcaraciolo staff 4.8K Mar 18 2003 51122 -rw-r--r-- 1 marcelcaraciolo staff 618B Mar 18 2003 51123 -rw-r--r-- 1 marcelcaraciolo staff 1.4K Mar 18 2003 51124 -rw-r--r-- 1 marcelcaraciolo staff 2.7K Mar 18 2003 51125 -rw-r--r-- 1 marcelcaraciolo staff 427B Mar 18 2003 51126 -rw-r--r-- 1 marcelcaraciolo staff 742B Mar 18 2003 51127 -rw-r--r-- 1 marcelcaraciolo staff 650B Mar 18 2003 51128 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 51130 -rw-r--r-- 1 marcelcaraciolo staff 2.3K Mar 18 2003 51131 -rw-r--r-- 1 marcelcaraciolo staff 2.6K Mar 18 2003 51132 -rw-r--r-- 1 marcelcaraciolo staff 1.5K Mar 18 2003 51133 -rw-r--r-- 1 marcelcaraciolo staff 1.2K Mar 18 2003 51134 -rw-r--r-- 1 marcelcaraciolo staff 1.6K Mar 18 2003 51135 -rw-r--r-- 1 marcelcaraciolo staff 2.1K Mar 18 2003 51136 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 51139 -rw-r--r-- 1 marcelcaraciolo staff 409B Mar 18 2003 51140 -rw-r--r-- 1 marcelcaraciolo staff 940B Mar 18 2003 51141 -rw-r--r-- 1 marcelcaraciolo staff 9.0K Mar 18 2003 51142 -rw-r--r-- 1 marcelcaraciolo staff 632B Mar 18 2003 51143 -rw-r--r-- 1 marcelcaraciolo staff 1.2K Mar 18 2003 51144 -rw-r--r-- 1 marcelcaraciolo staff 609B Mar 18 2003 51145 -rw-r--r-- 1 marcelcaraciolo staff 631B Mar 18 2003 51146 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 51147 -rw-r--r-- 1 marcelcaraciolo staff 1.8K Mar 18 2003 51148 -rw-r--r-- 1 marcelcaraciolo staff 405B Mar 18 2003 51149 -rw-r--r-- 1 marcelcaraciolo staff 696B Mar 18 2003 51150 -rw-r--r-- 1 marcelcaraciolo staff 5.5K Mar 18 2003 51151 -rw-r--r-- 1 marcelcaraciolo staff 1.4K Mar 18 2003 51152 -rw-r--r-- 1 marcelcaraciolo staff 5.0K Mar 18 2003 51153 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 51154 -rw-r--r-- 1 marcelcaraciolo staff 1.6K Mar 18 2003 51155 -rw-r--r-- 1 marcelcaraciolo staff 5.0K Mar 18 2003 51156 -rw-r--r-- 1 marcelcaraciolo staff 1.8K Mar 18 2003 51157 -rw-r--r-- 1 marcelcaraciolo staff 604B Mar 18 2003 51158 -rw-r--r-- 1 marcelcaraciolo staff 1.4K Mar 18 2003 51159 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 51160 -rw-r--r-- 1 marcelcaraciolo staff 1.4K Mar 18 2003 51161 -rw-r--r-- 1 marcelcaraciolo staff 2.9K Mar 18 2003 51162 -rw-r--r-- 1 marcelcaraciolo staff 1.1K Mar 18 2003 51163 -rw-r--r-- 1 marcelcaraciolo staff 2.3K Mar 18 2003 51164 -rw-r--r-- 1 marcelcaraciolo staff 4.8K Mar 18 2003 51165 -rw-r--r-- 1 marcelcaraciolo staff 1.2K Mar 18 2003 51169 -rw-r--r-- 1 marcelcaraciolo staff 868B Mar 18 2003 51170 -rw-r--r-- 1 marcelcaraciolo staff 721B Mar 18 2003 51171 -rw-r--r-- 1 marcelcaraciolo staff 3.0K Mar 18 2003 51172 -rw-r--r-- 1 marcelcaraciolo staff 1.9K Mar 18 2003 51173 -rw-r--r-- 1 marcelcaraciolo staff 645B Mar 18 2003 51174 -rw-r--r-- 1 marcelcaraciolo staff 2.4K Mar 18 2003 51175 -rw-r--r-- 1 marcelcaraciolo staff 2.9K Mar 18 2003 51176 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 51177 -rw-r--r-- 1 marcelcaraciolo staff 879B Mar 18 2003 51178 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 51179 -rw-r--r-- 1 marcelcaraciolo staff 994B Mar 18 2003 51180 -rw-r--r-- 1 marcelcaraciolo staff 1.2K Mar 18 2003 51181 -rw-r--r-- 1 marcelcaraciolo staff 2.2K Mar 18 2003 51182 -rw-r--r-- 1 marcelcaraciolo staff 1.7K Mar 18 2003 51183 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 51184 -rw-r--r-- 1 marcelcaraciolo staff 1.2K Mar 18 2003 51185 -rw-r--r-- 1 marcelcaraciolo staff 949B Mar 18 2003 51186 -rw-r--r-- 1 marcelcaraciolo staff 1.9K Mar 18 2003 51187 -rw-r--r-- 1 marcelcaraciolo staff 1.1K Mar 18 2003 51188 -rw-r--r-- 1 marcelcaraciolo staff 834B Mar 18 2003 51189 -rw-r--r-- 1 marcelcaraciolo staff 895B Mar 18 2003 51190 -rw-r--r-- 1 marcelcaraciolo staff 776B Mar 18 2003 51191 -rw-r--r-- 1 marcelcaraciolo staff 1.6K Mar 18 2003 51192 -rw-r--r-- 1 marcelcaraciolo staff 1.8K Mar 18 2003 51193 -rw-r--r-- 1 marcelcaraciolo staff 1.4K Mar 18 2003 51194 -rw-r--r-- 1 marcelcaraciolo staff 964B Mar 18 2003 51195 -rw-r--r-- 1 marcelcaraciolo staff 2.4K Mar 18 2003 51196 -rw-r--r-- 1 marcelcaraciolo staff 759B Mar 18 2003 51197 -rw-r--r-- 1 marcelcaraciolo staff 1.5K Mar 18 2003 51198 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 51199 -rw-r--r-- 1 marcelcaraciolo staff 1.9K Mar 18 2003 51200 -rw-r--r-- 1 marcelcaraciolo staff 916B Mar 18 2003 51201 -rw-r--r-- 1 marcelcaraciolo staff 1.9K Mar 18 2003 51202 -rw-r--r-- 1 marcelcaraciolo staff 1.5K Mar 18 2003 51203 -rw-r--r-- 1 marcelcaraciolo staff 846B Mar 18 2003 51204 -rw-r--r-- 1 marcelcaraciolo staff 1.4K Mar 18 2003 51205 -rw-r--r-- 1 marcelcaraciolo staff 881B Mar 18 2003 51206 -rw-r--r-- 1 marcelcaraciolo staff 6.2K Mar 18 2003 51208 -rw-r--r-- 1 marcelcaraciolo staff 1.7K Mar 18 2003 51209 -rw-r--r-- 1 marcelcaraciolo staff 1.7K Mar 18 2003 51210 -rw-r--r-- 1 marcelcaraciolo staff 10K Mar 18 2003 51211 -rw-r--r-- 1 marcelcaraciolo staff 2.5K Mar 18 2003 51212 -rw-r--r-- 1 marcelcaraciolo staff 1.6K Mar 18 2003 51213 -rw-r--r-- 1 marcelcaraciolo staff 636B Mar 18 2003 51214 -rw-r--r-- 1 marcelcaraciolo staff 989B Mar 18 2003 51215 -rw-r--r-- 1 marcelcaraciolo staff 668B Mar 18 2003 51216 -rw-r--r-- 1 marcelcaraciolo staff 2.8K Mar 18 2003 51217 -rw-r--r-- 1 marcelcaraciolo staff 1.7K Mar 18 2003 51218 -rw-r--r-- 1 marcelcaraciolo staff 905B Mar 18 2003 51219 -rw-r--r-- 1 marcelcaraciolo staff 2.4K Mar 18 2003 51220 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 51221 -rw-r--r-- 1 marcelcaraciolo staff 1.7K Mar 18 2003 51222 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 51223 -rw-r--r-- 1 marcelcaraciolo staff 2.1K Mar 18 2003 51224 -rw-r--r-- 1 marcelcaraciolo staff 1.5K Mar 18 2003 51225 -rw-r--r-- 1 marcelcaraciolo staff 3.4K Mar 18 2003 51226 -rw-r--r-- 1 marcelcaraciolo staff 704B Mar 18 2003 51227 -rw-r--r-- 1 marcelcaraciolo staff 949B Mar 18 2003 51228 -rw-r--r-- 1 marcelcaraciolo staff 714B Mar 18 2003 51229 -rw-r--r-- 1 marcelcaraciolo staff 966B Mar 18 2003 51230 -rw-r--r-- 1 marcelcaraciolo staff 2.9K Mar 18 2003 51231 -rw-r--r-- 1 marcelcaraciolo staff 871B Mar 18 2003 51232 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 51233 -rw-r--r-- 1 marcelcaraciolo staff 1.5K Mar 18 2003 51234 -rw-r--r-- 1 marcelcaraciolo staff 2.4K Mar 18 2003 51235 -rw-r--r-- 1 marcelcaraciolo staff 1.2K Mar 18 2003 51236 -rw-r--r-- 1 marcelcaraciolo staff 564B Mar 18 2003 51237 -rw-r--r-- 1 marcelcaraciolo staff 11K Mar 18 2003 51238 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 51239 -rw-r--r-- 1 marcelcaraciolo staff 749B Mar 18 2003 51240 -rw-r--r-- 1 marcelcaraciolo staff 932B Mar 18 2003 51241 -rw-r--r-- 1 marcelcaraciolo staff 1.2K Mar 18 2003 51242 -rw-r--r-- 1 marcelcaraciolo staff 2.2K Mar 18 2003 51243 -rw-r--r-- 1 marcelcaraciolo staff 554B Mar 18 2003 51244 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 51245 -rw-r--r-- 1 marcelcaraciolo staff 1.7K Mar 18 2003 51246 -rw-r--r-- 1 marcelcaraciolo staff 1.7K Mar 18 2003 51247 -rw-r--r-- 1 marcelcaraciolo staff 1.6K Mar 18 2003 51249 -rw-r--r-- 1 marcelcaraciolo staff 2.8K Mar 18 2003 51250 -rw-r--r-- 1 marcelcaraciolo staff 570B Mar 18 2003 51251 -rw-r--r-- 1 marcelcaraciolo staff 1.8K Mar 18 2003 51252 -rw-r--r-- 1 marcelcaraciolo staff 3.1K Mar 18 2003 51253 -rw-r--r-- 1 marcelcaraciolo staff 2.9K Mar 18 2003 51254 -rw-r--r-- 1 marcelcaraciolo staff 748B Mar 18 2003 51255 -rw-r--r-- 1 marcelcaraciolo staff 2.3K Mar 18 2003 51256 -rw-r--r-- 1 marcelcaraciolo staff 1.2K Mar 18 2003 51258 -rw-r--r-- 1 marcelcaraciolo staff 1.7K Mar 18 2003 51259 -rw-r--r-- 1 marcelcaraciolo staff 6.2K Mar 18 2003 51260 -rw-r--r-- 1 marcelcaraciolo staff 1.6K Mar 18 2003 51261 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 51262 -rw-r--r-- 1 marcelcaraciolo staff 1.2K Mar 18 2003 51265 -rw-r--r-- 1 marcelcaraciolo staff 456B Mar 18 2003 51266 -rw-r--r-- 1 marcelcaraciolo staff 816B Mar 18 2003 51267 -rw-r--r-- 1 marcelcaraciolo staff 2.4K Mar 18 2003 51268 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 51269 -rw-r--r-- 1 marcelcaraciolo staff 3.4K Mar 18 2003 51270 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 51271 -rw-r--r-- 1 marcelcaraciolo staff 2.0K Mar 18 2003 51272 -rw-r--r-- 1 marcelcaraciolo staff 790B Mar 18 2003 51273 -rw-r--r-- 1 marcelcaraciolo staff 1.6K Mar 18 2003 51274 -rw-r--r-- 1 marcelcaraciolo staff 2.5K Mar 18 2003 51275 -rw-r--r-- 1 marcelcaraciolo staff 4.4K Mar 18 2003 51276 -rw-r--r-- 1 marcelcaraciolo staff 1.5K Mar 18 2003 51277 -rw-r--r-- 1 marcelcaraciolo staff 6.2K Mar 18 2003 51278 -rw-r--r-- 1 marcelcaraciolo staff 963B Mar 18 2003 51279 -rw-r--r-- 1 marcelcaraciolo staff 2.0K Mar 18 2003 51280 -rw-r--r-- 1 marcelcaraciolo staff 1.1K Mar 18 2003 51281 -rw-r--r-- 1 marcelcaraciolo staff 618B Mar 18 2003 51282 -rw-r--r-- 1 marcelcaraciolo staff 2.7K Mar 18 2003 51283 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 51284 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 51285 -rw-r--r-- 1 marcelcaraciolo staff 601B Mar 18 2003 51286 -rw-r--r-- 1 marcelcaraciolo staff 751B Mar 18 2003 51287 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 51288 -rw-r--r-- 1 marcelcaraciolo staff 8.0K Mar 18 2003 51290 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 51291 -rw-r--r-- 1 marcelcaraciolo staff 2.9K Mar 18 2003 51292 -rw-r--r-- 1 marcelcaraciolo staff 1.2K Mar 18 2003 51293 -rw-r--r-- 1 marcelcaraciolo staff 1.8K Mar 18 2003 51294 -rw-r--r-- 1 marcelcaraciolo staff 1.9K Mar 18 2003 51295 -rw-r--r-- 1 marcelcaraciolo staff 1.7K Mar 18 2003 51296 -rw-r--r-- 1 marcelcaraciolo staff 4.2K Mar 18 2003 51297 -rw-r--r-- 1 marcelcaraciolo staff 2.6K Mar 18 2003 51298 -rw-r--r-- 1 marcelcaraciolo staff 2.2K Mar 18 2003 51299 -rw-r--r-- 1 marcelcaraciolo staff 2.3K Mar 18 2003 51300 -rw-r--r-- 1 marcelcaraciolo staff 6.3K Mar 18 2003 51301 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 51302 -rw-r--r-- 1 marcelcaraciolo staff 1.9K Mar 18 2003 51303 -rw-r--r-- 1 marcelcaraciolo staff 10K Mar 18 2003 51304 -rw-r--r-- 1 marcelcaraciolo staff 1.5K Mar 18 2003 51305 -rw-r--r-- 1 marcelcaraciolo staff 1.4K Mar 18 2003 51306 -rw-r--r-- 1 marcelcaraciolo staff 4.1K Mar 18 2003 51307 -rw-r--r-- 1 marcelcaraciolo staff 6.2K Mar 18 2003 51308 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 51309 -rw-r--r-- 1 marcelcaraciolo staff 768B Mar 18 2003 51310 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 51311 -rw-r--r-- 1 marcelcaraciolo staff 930B Mar 18 2003 51312 -rw-r--r-- 1 marcelcaraciolo staff 771B Mar 18 2003 51313 -rw-r--r-- 1 marcelcaraciolo staff 670B Mar 18 2003 51314 -rw-r--r-- 1 marcelcaraciolo staff 1.1K Mar 18 2003 51315 -rw-r--r-- 1 marcelcaraciolo staff 3.7K Mar 18 2003 51316 -rw-r--r-- 1 marcelcaraciolo staff 406B Mar 18 2003 51317 -rw-r--r-- 1 marcelcaraciolo staff 5.4K Mar 18 2003 51318 -rw-r--r-- 1 marcelcaraciolo staff 9.6K Mar 18 2003 51319 -rw-r--r-- 1 marcelcaraciolo staff 2.1K Mar 18 2003 51320 -rw-r--r-- 1 marcelcaraciolo staff 29K Mar 18 2003 52499 -rw-r--r-- 1 marcelcaraciolo staff 25K Mar 18 2003 52909 -rw-r--r-- 1 marcelcaraciolo staff 5.8K Mar 18 2003 52910 -rw-r--r-- 1 marcelcaraciolo staff 819B Mar 18 2003 53055 -rw-r--r-- 1 marcelcaraciolo staff 857B Mar 18 2003 53056 -rw-r--r-- 1 marcelcaraciolo staff 755B Mar 18 2003 53057 -rw-r--r-- 1 marcelcaraciolo staff 4.4K Mar 18 2003 53058 -rw-r--r-- 1 marcelcaraciolo staff 2.1K Mar 18 2003 53059 -rw-r--r-- 1 marcelcaraciolo staff 1.1K Mar 18 2003 53062 -rw-r--r-- 1 marcelcaraciolo staff 1.6K Mar 18 2003 53064 -rw-r--r-- 1 marcelcaraciolo staff 515B Mar 18 2003 53065 -rw-r--r-- 1 marcelcaraciolo staff 9.2K Mar 18 2003 53066 -rw-r--r-- 1 marcelcaraciolo staff 2.4K Mar 18 2003 53067 -rw-r--r-- 1 marcelcaraciolo staff 610B Mar 18 2003 53069 -rw-r--r-- 1 marcelcaraciolo staff 759B Mar 18 2003 53070 -rw-r--r-- 1 marcelcaraciolo staff 2.3K Mar 18 2003 53071 -rw-r--r-- 1 marcelcaraciolo staff 1.5K Mar 18 2003 53072 -rw-r--r-- 1 marcelcaraciolo staff 1.9K Mar 18 2003 53073 -rw-r--r-- 1 marcelcaraciolo staff 2.1K Mar 18 2003 53075 -rw-r--r-- 1 marcelcaraciolo staff 411B Mar 18 2003 53078 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 53081 -rw-r--r-- 1 marcelcaraciolo staff 962B Mar 18 2003 53082 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 53083 -rw-r--r-- 1 marcelcaraciolo staff 2.0K Mar 18 2003 53085 -rw-r--r-- 1 marcelcaraciolo staff 1.1K Mar 18 2003 53086 -rw-r--r-- 1 marcelcaraciolo staff 247B Mar 18 2003 53087 -rw-r--r-- 1 marcelcaraciolo staff 3.8K Mar 18 2003 53090 -rw-r--r-- 1 marcelcaraciolo staff 1.1K Mar 18 2003 53093 -rw-r--r-- 1 marcelcaraciolo staff 1.1K Mar 18 2003 53094 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 53095 -rw-r--r-- 1 marcelcaraciolo staff 863B Mar 18 2003 53096 -rw-r--r-- 1 marcelcaraciolo staff 1.1K Mar 18 2003 53097 -rw-r--r-- 1 marcelcaraciolo staff 1.2K Mar 18 2003 53098 -rw-r--r-- 1 marcelcaraciolo staff 1.1K Mar 18 2003 53099 -rw-r--r-- 1 marcelcaraciolo staff 2.0K Mar 18 2003 53106 -rw-r--r-- 1 marcelcaraciolo staff 784B Mar 18 2003 53108 -rw-r--r-- 1 marcelcaraciolo staff 2.3K Mar 18 2003 53110 -rw-r--r-- 1 marcelcaraciolo staff 712B Mar 18 2003 53111 -rw-r--r-- 1 marcelcaraciolo staff 2.4K Mar 18 2003 53112 -rw-r--r-- 1 marcelcaraciolo staff 2.6K Mar 18 2003 53113 -rw-r--r-- 1 marcelcaraciolo staff 1.7K Mar 18 2003 53114 -rw-r--r-- 1 marcelcaraciolo staff 1.5K Mar 18 2003 53117 -rw-r--r-- 1 marcelcaraciolo staff 2.8K Mar 18 2003 53118 -rw-r--r-- 1 marcelcaraciolo staff 4.1K Mar 18 2003 53120 -rw-r--r-- 1 marcelcaraciolo staff 1.8K Mar 18 2003 53121 -rw-r--r-- 1 marcelcaraciolo staff 2.4K Mar 18 2003 53122 -rw-r--r-- 1 marcelcaraciolo staff 1.2K Mar 18 2003 53123 -rw-r--r-- 1 marcelcaraciolo staff 3.4K Mar 18 2003 53124 -rw-r--r-- 1 marcelcaraciolo staff 1.8K Mar 18 2003 53125 -rw-r--r-- 1 marcelcaraciolo staff 1.2K Mar 18 2003 53126 -rw-r--r-- 1 marcelcaraciolo staff 826B Mar 18 2003 53127 -rw-r--r-- 1 marcelcaraciolo staff 958B Mar 18 2003 53130 -rw-r--r-- 1 marcelcaraciolo staff 1.5K Mar 18 2003 53131 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 53132 -rw-r--r-- 1 marcelcaraciolo staff 640B Mar 18 2003 53133 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 53134 -rw-r--r-- 1 marcelcaraciolo staff 2.1K Mar 18 2003 53135 -rw-r--r-- 1 marcelcaraciolo staff 4.2K Mar 18 2003 53136 -rw-r--r-- 1 marcelcaraciolo staff 4.8K Mar 18 2003 53137 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 53139 -rw-r--r-- 1 marcelcaraciolo staff 3.0K Mar 18 2003 53140 -rw-r--r-- 1 marcelcaraciolo staff 2.1K Mar 18 2003 53141 -rw-r--r-- 1 marcelcaraciolo staff 456B Mar 18 2003 53142 -rw-r--r-- 1 marcelcaraciolo staff 760B Mar 18 2003 53143 -rw-r--r-- 1 marcelcaraciolo staff 768B Mar 18 2003 53144 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 53145 -rw-r--r-- 1 marcelcaraciolo staff 1.2K Mar 18 2003 53149 -rw-r--r-- 1 marcelcaraciolo staff 2.1K Mar 18 2003 53150 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 53151 -rw-r--r-- 1 marcelcaraciolo staff 1.9K Mar 18 2003 53153 -rw-r--r-- 1 marcelcaraciolo staff 1.2K Mar 18 2003 53154 -rw-r--r-- 1 marcelcaraciolo staff 1.2K Mar 18 2003 53157 -rw-r--r-- 1 marcelcaraciolo staff 2.0K Mar 18 2003 53158 -rw-r--r-- 1 marcelcaraciolo staff 819B Mar 18 2003 53159 -rw-r--r-- 1 marcelcaraciolo staff 1.9K Mar 18 2003 53160 -rw-r--r-- 1 marcelcaraciolo staff 3.5K Mar 18 2003 53161 -rw-r--r-- 1 marcelcaraciolo staff 1.5K Mar 18 2003 53162 -rw-r--r-- 1 marcelcaraciolo staff 1.9K Mar 18 2003 53163 -rw-r--r-- 1 marcelcaraciolo staff 2.2K Mar 18 2003 53164 -rw-r--r-- 1 marcelcaraciolo staff 1.1K Mar 18 2003 53165 -rw-r--r-- 1 marcelcaraciolo staff 684B Mar 18 2003 53166 -rw-r--r-- 1 marcelcaraciolo staff 443B Mar 18 2003 53167 -rw-r--r-- 1 marcelcaraciolo staff 1.2K Mar 18 2003 53168 -rw-r--r-- 1 marcelcaraciolo staff 1.4K Mar 18 2003 53170 -rw-r--r-- 1 marcelcaraciolo staff 2.5K Mar 18 2003 53171 -rw-r--r-- 1 marcelcaraciolo staff 785B Mar 18 2003 53172 -rw-r--r-- 1 marcelcaraciolo staff 1.1K Mar 18 2003 53173 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 53174 -rw-r--r-- 1 marcelcaraciolo staff 737B Mar 18 2003 53175 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 53176 -rw-r--r-- 1 marcelcaraciolo staff 1.8K Mar 18 2003 53177 -rw-r--r-- 1 marcelcaraciolo staff 2.2K Mar 18 2003 53178 -rw-r--r-- 1 marcelcaraciolo staff 1.6K Mar 18 2003 53179 -rw-r--r-- 1 marcelcaraciolo staff 2.1K Mar 18 2003 53180 -rw-r--r-- 1 marcelcaraciolo staff 3.2K Mar 18 2003 53181 -rw-r--r-- 1 marcelcaraciolo staff 1.2K Mar 18 2003 53182 -rw-r--r-- 1 marcelcaraciolo staff 1.4K Mar 18 2003 53183 -rw-r--r-- 1 marcelcaraciolo staff 1.7K Mar 18 2003 53184 -rw-r--r-- 1 marcelcaraciolo staff 2.6K Mar 18 2003 53185 -rw-r--r-- 1 marcelcaraciolo staff 3.0K Mar 18 2003 53186 -rw-r--r-- 1 marcelcaraciolo staff 665B Mar 18 2003 53187 -rw-r--r-- 1 marcelcaraciolo staff 2.0K Mar 18 2003 53188 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 53190 -rw-r--r-- 1 marcelcaraciolo staff 1.9K Mar 18 2003 53191 -rw-r--r-- 1 marcelcaraciolo staff 1.8K Mar 18 2003 53192 -rw-r--r-- 1 marcelcaraciolo staff 1.4K Mar 18 2003 53193 -rw-r--r-- 1 marcelcaraciolo staff 792B Mar 18 2003 53194 -rw-r--r-- 1 marcelcaraciolo staff 2.0K Mar 18 2003 53195 -rw-r--r-- 1 marcelcaraciolo staff 1.6K Mar 18 2003 53196 -rw-r--r-- 1 marcelcaraciolo staff 2.6K Mar 18 2003 53197 -rw-r--r-- 1 marcelcaraciolo staff 1.1K Mar 18 2003 53198 -rw-r--r-- 1 marcelcaraciolo staff 1.4K Mar 18 2003 53199 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 53201 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 53203 -rw-r--r-- 1 marcelcaraciolo staff 3.7K Mar 18 2003 53208 -rw-r--r-- 1 marcelcaraciolo staff 1.1K Mar 18 2003 53209 -rw-r--r-- 1 marcelcaraciolo staff 1.5K Mar 18 2003 53210 -rw-r--r-- 1 marcelcaraciolo staff 2.7K Mar 18 2003 53211 -rw-r--r-- 1 marcelcaraciolo staff 1.4K Mar 18 2003 53212 -rw-r--r-- 1 marcelcaraciolo staff 2.3K Mar 18 2003 53213 -rw-r--r-- 1 marcelcaraciolo staff 1.9K Mar 18 2003 53214 -rw-r--r-- 1 marcelcaraciolo staff 919B Mar 18 2003 53215 -rw-r--r-- 1 marcelcaraciolo staff 868B Mar 18 2003 53216 -rw-r--r-- 1 marcelcaraciolo staff 2.3K Mar 18 2003 53217 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 53218 -rw-r--r-- 1 marcelcaraciolo staff 1.1K Mar 18 2003 53219 -rw-r--r-- 1 marcelcaraciolo staff 640B Mar 18 2003 53220 -rw-r--r-- 1 marcelcaraciolo staff 1.1K Mar 18 2003 53221 -rw-r--r-- 1 marcelcaraciolo staff 2.0K Mar 18 2003 53222 -rw-r--r-- 1 marcelcaraciolo staff 2.0K Mar 18 2003 53223 -rw-r--r-- 1 marcelcaraciolo staff 3.4K Mar 18 2003 53224 -rw-r--r-- 1 marcelcaraciolo staff 808B Mar 18 2003 53225 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 53226 -rw-r--r-- 1 marcelcaraciolo staff 640B Mar 18 2003 53228 -rw-r--r-- 1 marcelcaraciolo staff 856B Mar 18 2003 53229 -rw-r--r-- 1 marcelcaraciolo staff 967B Mar 18 2003 53230 -rw-r--r-- 1 marcelcaraciolo staff 781B Mar 18 2003 53231 -rw-r--r-- 1 marcelcaraciolo staff 1.2K Mar 18 2003 53232 -rw-r--r-- 1 marcelcaraciolo staff 2.2K Mar 18 2003 53235 -rw-r--r-- 1 marcelcaraciolo staff 1.7K Mar 18 2003 53237 -rw-r--r-- 1 marcelcaraciolo staff 2.2K Mar 18 2003 53238 -rw-r--r-- 1 marcelcaraciolo staff 2.4K Mar 18 2003 53239 -rw-r--r-- 1 marcelcaraciolo staff 1.2K Mar 18 2003 53240 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 53243 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 53248 -rw-r--r-- 1 marcelcaraciolo staff 1.4K Mar 18 2003 53249 -rw-r--r-- 1 marcelcaraciolo staff 1.8K Mar 18 2003 53250 -rw-r--r-- 1 marcelcaraciolo staff 1.5K Mar 18 2003 53251 -rw-r--r-- 1 marcelcaraciolo staff 1.4K Mar 18 2003 53252 -rw-r--r-- 1 marcelcaraciolo staff 1.2K Mar 18 2003 53256 -rw-r--r-- 1 marcelcaraciolo staff 806B Mar 18 2003 53258 -rw-r--r-- 1 marcelcaraciolo staff 4.2K Mar 18 2003 53266 -rw-r--r-- 1 marcelcaraciolo staff 3.5K Mar 18 2003 53267 -rw-r--r-- 1 marcelcaraciolo staff 1.8K Mar 18 2003 53269 -rw-r--r-- 1 marcelcaraciolo staff 3.2K Mar 18 2003 53271 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 53274 -rw-r--r-- 1 marcelcaraciolo staff 2.1K Mar 18 2003 53275 -rw-r--r-- 1 marcelcaraciolo staff 2.0K Mar 18 2003 53281 -rw-r--r-- 1 marcelcaraciolo staff 958B Mar 18 2003 53282 -rw-r--r-- 1 marcelcaraciolo staff 3.2K Mar 18 2003 53283 -rw-r--r-- 1 marcelcaraciolo staff 872B Mar 18 2003 53284 -rw-r--r-- 1 marcelcaraciolo staff 387B Mar 18 2003 53285 -rw-r--r-- 1 marcelcaraciolo staff 3.1K Mar 18 2003 53286 -rw-r--r-- 1 marcelcaraciolo staff 3.5K Mar 18 2003 53287 -rw-r--r-- 1 marcelcaraciolo staff 2.6K Mar 18 2003 53288 -rw-r--r-- 1 marcelcaraciolo staff 956B Mar 18 2003 53289 -rw-r--r-- 1 marcelcaraciolo staff 1.6K Mar 18 2003 53290 -rw-r--r-- 1 marcelcaraciolo staff 10K Mar 18 2003 53292 -rw-r--r-- 1 marcelcaraciolo staff 5.4K Mar 18 2003 53298 -rw-r--r-- 1 marcelcaraciolo staff 945B Mar 18 2003 53303 -rw-r--r-- 1 marcelcaraciolo staff 1.2K Mar 18 2003 53304 -rw-r--r-- 1 marcelcaraciolo staff 1.5K Mar 18 2003 53305 -rw-r--r-- 1 marcelcaraciolo staff 1.4K Mar 18 2003 53306 -rw-r--r-- 1 marcelcaraciolo staff 590B Mar 18 2003 53307 -rw-r--r-- 1 marcelcaraciolo staff 663B Mar 18 2003 53308 -rw-r--r-- 1 marcelcaraciolo staff 907B Mar 18 2003 53309 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 53311 -rw-r--r-- 1 marcelcaraciolo staff 1.5K Mar 18 2003 53312 -rw-r--r-- 1 marcelcaraciolo staff 576B Mar 18 2003 53314 -rw-r--r-- 1 marcelcaraciolo staff 15K Mar 18 2003 53323 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 53334 -rw-r--r-- 1 marcelcaraciolo staff 783B Mar 18 2003 53347 -rw-r--r-- 1 marcelcaraciolo staff 5.8K Mar 18 2003 53351 -rw-r--r-- 1 marcelcaraciolo staff 1.6K Mar 18 2003 53366 -rw-r--r-- 1 marcelcaraciolo staff 698B Mar 18 2003 53370 -rw-r--r-- 1 marcelcaraciolo staff 600B Mar 18 2003 53371 -rw-r--r-- 1 marcelcaraciolo staff 5.6K Mar 18 2003 53373 -rw-r--r-- 1 marcelcaraciolo staff 1.8K Mar 18 2003 53374 -rw-r--r-- 1 marcelcaraciolo staff 1.1K Mar 18 2003 53375 -rw-r--r-- 1 marcelcaraciolo staff 849B Mar 18 2003 53376 -rw-r--r-- 1 marcelcaraciolo staff 621B Mar 18 2003 53377 -rw-r--r-- 1 marcelcaraciolo staff 270B Mar 18 2003 53380 -rw-r--r-- 1 marcelcaraciolo staff 1.1K Mar 18 2003 53381 -rw-r--r-- 1 marcelcaraciolo staff 2.2K Mar 18 2003 53382 -rw-r--r-- 1 marcelcaraciolo staff 1.6K Mar 18 2003 53383 -rw-r--r-- 1 marcelcaraciolo staff 1.6K Mar 18 2003 53387 -rw-r--r-- 1 marcelcaraciolo staff 759B Mar 18 2003 53389 -rw-r--r-- 1 marcelcaraciolo staff 396B Mar 18 2003 53390 -rw-r--r-- 1 marcelcaraciolo staff 669B Mar 18 2003 53391 -rw-r--r-- 1 marcelcaraciolo staff 1.8K Mar 18 2003 53434 -rw-r--r-- 1 marcelcaraciolo staff 1.6K Mar 18 2003 53435 -rw-r--r-- 1 marcelcaraciolo staff 708B Mar 18 2003 53436 -rw-r--r-- 1 marcelcaraciolo staff 887B Mar 18 2003 53437 -rw-r--r-- 1 marcelcaraciolo staff 838B Mar 18 2003 53438 -rw-r--r-- 1 marcelcaraciolo staff 1.4K Mar 18 2003 53439 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 53440 -rw-r--r-- 1 marcelcaraciolo staff 384B Mar 18 2003 53441 -rw-r--r-- 1 marcelcaraciolo staff 857B Mar 18 2003 53442 -rw-r--r-- 1 marcelcaraciolo staff 1.6K Mar 18 2003 53443 -rw-r--r-- 1 marcelcaraciolo staff 1.4K Mar 18 2003 53445 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 53449 -rw-r--r-- 1 marcelcaraciolo staff 2.4K Mar 18 2003 53459 -rw-r--r-- 1 marcelcaraciolo staff 1.4K Mar 18 2003 53460 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 53465 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 53466 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 53467 -rw-r--r-- 1 marcelcaraciolo staff 1.4K Mar 18 2003 53468 -rw-r--r-- 1 marcelcaraciolo staff 1.1K Mar 18 2003 53471 -rw-r--r-- 1 marcelcaraciolo staff 1.9K Mar 18 2003 53477 -rw-r--r-- 1 marcelcaraciolo staff 718B Mar 18 2003 53478 -rw-r--r-- 1 marcelcaraciolo staff 781B Mar 18 2003 53483 -rw-r--r-- 1 marcelcaraciolo staff 1.6K Mar 18 2003 53509 -rw-r--r-- 1 marcelcaraciolo staff 910B Mar 18 2003 53510 -rw-r--r-- 1 marcelcaraciolo staff 781B Mar 18 2003 53512 -rw-r--r-- 1 marcelcaraciolo staff 1.8K Mar 18 2003 53515 -rw-r--r-- 1 marcelcaraciolo staff 2.1K Mar 18 2003 53518 -rw-r--r-- 1 marcelcaraciolo staff 50K Mar 18 2003 53519 -rw-r--r-- 1 marcelcaraciolo staff 6.0K Mar 18 2003 53521 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 53522 -rw-r--r-- 1 marcelcaraciolo staff 2.8K Mar 18 2003 53523 -rw-r--r-- 1 marcelcaraciolo staff 338B Mar 18 2003 53524 -rw-r--r-- 1 marcelcaraciolo staff 1.4K Mar 18 2003 53525 -rw-r--r-- 1 marcelcaraciolo staff 489B Mar 18 2003 53526 -rw-r--r-- 1 marcelcaraciolo staff 2.6K Mar 18 2003 53527 -rw-r--r-- 1 marcelcaraciolo staff 2.4K Mar 18 2003 53528 -rw-r--r-- 1 marcelcaraciolo staff 228B Mar 18 2003 53529 -rw-r--r-- 1 marcelcaraciolo staff 1.1K Mar 18 2003 53531 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 53532 -rw-r--r-- 1 marcelcaraciolo staff 1.2K Mar 18 2003 53533 -rw-r--r-- 1 marcelcaraciolo staff 356B Mar 18 2003 53534 -rw-r--r-- 1 marcelcaraciolo staff 614B Mar 18 2003 53535 -rw-r--r-- 1 marcelcaraciolo staff 895B Mar 18 2003 53571 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 53572 -rw-r--r-- 1 marcelcaraciolo staff 697B Mar 18 2003 53573 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 53574 -rw-r--r-- 1 marcelcaraciolo staff 1.8K Mar 18 2003 53654 -rw-r--r-- 1 marcelcaraciolo staff 2.3K Mar 18 2003 53655 -rw-r--r-- 1 marcelcaraciolo staff 2.5K Mar 18 2003 53656 -rw-r--r-- 1 marcelcaraciolo staff 2.1K Mar 18 2003 53660 -rw-r--r-- 1 marcelcaraciolo staff 6.8K Mar 18 2003 53661 -rw-r--r-- 1 marcelcaraciolo staff 1.8K Mar 18 2003 53753 -rw-r--r-- 1 marcelcaraciolo staff 698B Mar 18 2003 53754 -rw-r--r-- 1 marcelcaraciolo staff 779B Mar 18 2003 53755 -rw-r--r-- 1 marcelcaraciolo staff 3.9K Mar 18 2003 53756 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 53757 -rw-r--r-- 1 marcelcaraciolo staff 2.2K Mar 18 2003 53758 -rw-r--r-- 1 marcelcaraciolo staff 745B Mar 18 2003 53759 -rw-r--r-- 1 marcelcaraciolo staff 1.9K Mar 18 2003 53760 -rw-r--r-- 1 marcelcaraciolo staff 592B Mar 18 2003 53761 -rw-r--r-- 1 marcelcaraciolo staff 658B Mar 18 2003 53762 -rw-r--r-- 1 marcelcaraciolo staff 756B Mar 18 2003 53763 -rw-r--r-- 1 marcelcaraciolo staff 2.7K Mar 18 2003 53764 -rw-r--r-- 1 marcelcaraciolo staff 1.1K Mar 18 2003 53765 -rw-r--r-- 1 marcelcaraciolo staff 906B Mar 18 2003 53766 -rw-r--r-- 1 marcelcaraciolo staff 535B Mar 18 2003 53780 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 53785 -rw-r--r-- 1 marcelcaraciolo staff 2.3K Mar 18 2003 54165 -rw-r--r-- 1 marcelcaraciolo staff 2.8K Mar 18 2003 54166 -rw-r--r-- 1 marcelcaraciolo staff 547B Mar 18 2003 54167 -rw-r--r-- 1 marcelcaraciolo staff 2.4K Mar 18 2003 54168 -rw-r--r-- 1 marcelcaraciolo staff 4.7K Mar 18 2003 54178 -rw-r--r-- 1 marcelcaraciolo staff 1.8K Mar 18 2003 54179 -rw-r--r-- 1 marcelcaraciolo staff 4.4K Mar 18 2003 54180 -rw-r--r-- 1 marcelcaraciolo staff 1.3K Mar 18 2003 54181 -rw-r--r-- 1 marcelcaraciolo staff 3.0K Mar 18 2003 54182 -rw-r--r-- 1 marcelcaraciolo staff 1.4K Mar 18 2003 54198 -rw-r--r-- 1 marcelcaraciolo staff 1.8K Mar 18 2003 54199 -rw-r--r-- 1 marcelcaraciolo staff 2.5K Mar 18 2003 54200 -rw-r--r-- 1 marcelcaraciolo staff 1.7K Mar 18 2003 54201 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 54202 -rw-r--r-- 1 marcelcaraciolo staff 1.2K Mar 18 2003 54203 -rw-r--r-- 1 marcelcaraciolo staff 565B Mar 18 2003 54204 -rw-r--r-- 1 marcelcaraciolo staff 641B Mar 18 2003 54227 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 54228 -rw-r--r-- 1 marcelcaraciolo staff 877B Mar 18 2003 54470 -rw-r--r-- 1 marcelcaraciolo staff 1.0K Mar 18 2003 54471 -rw-r--r-- 1 marcelcaraciolo staff 993B Mar 18 2003 54472 -rw-r--r-- 1 marcelcaraciolo staff 434B Mar 18 2003 54473
The load_files
function can load text files from a 2 levels folder structure assuming folder names represent categories:
#print(load_files.__doc__)
all_twenty_train = load_files('../data/twenty_newsgroups/20news-bydate-train/',
charset='latin-1', random_state=42)
all_twenty_test = load_files('../data/twenty_newsgroups/20news-bydate-test/',
charset='latin-1', random_state=42)
/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.5-i386.egg/sklearn/datasets/base.py:161: DeprecationWarning: The charset parameter is deprecated as of version 0.14 and will be removed in 0.16. Use encode instead. DeprecationWarning) /Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.5-i386.egg/sklearn/datasets/base.py:161: DeprecationWarning: The charset parameter is deprecated as of version 0.14 and will be removed in 0.16. Use encode instead. DeprecationWarning)
all_target_names = all_twenty_train.target_names
all_target_names
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
all_twenty_train.target
array([12, 6, 9, ..., 9, 1, 12])
all_twenty_train.target.shape
(11314,)
all_twenty_test.target.shape
(7532,)
len(all_twenty_train.data)
11314
type(all_twenty_train.data[0])
unicode
def display_sample(i, dataset):
print("Class name: " + dataset.target_names[dataset.target[i]])
print("Text content:\n")
print(dataset.data[i])
display_sample(0, all_twenty_train)
Class name: sci.electronics Text content: From: wtm@uhura.neoucom.edu (Bill Mayhew) Subject: Re: How to the disks copy protected. Organization: Northeastern Ohio Universities College of Medicine Lines: 23 Write a good manual to go with the software. The hassle of photocopying the manual is offset by simplicity of purchasing the package for only $15. Also, consider offering an inexpensive but attractive perc for registered users. For instance, a coffee mug. You could produce and mail the incentive for a couple of dollars, so consider pricing the product at $17.95. You're lucky if only 20% of the instances of your program in use are non-licensed users. The best approach is to estimate your loss and accomodate that into your price structure. Sure it hurts legitimate users, but too bad. Retailers have to charge off loss to shoplifters onto paying customers; the software industry is the same. Unless your product is exceptionally unique, using an ostensibly copy-proof disk will just send your customers to the competetion. -- Bill Mayhew NEOUCOM Computer Services Department Rootstown, OH 44272-9995 USA phone: 216-325-2511 wtm@uhura.neoucom.edu (140.220.1.1) 146.580: N8WED
display_sample(1, all_twenty_train)
Class name: misc.forsale Text content: From: andy@SAIL.Stanford.EDU (Andy Freeman) Subject: Re: Catalog of Hard-to-Find PC Enhancements (Repost) Organization: Computer Science Department, Stanford University. Lines: 33 >andy@SAIL.Stanford.EDU (Andy Freeman) writes: >> >In article <C5ELME.4z4@unix.portal.com> jdoll@shell.portal.com (Joe Doll) wr >> >> "The Catalog of Personal Computing Tools for Engineers and Scien- >> >> tists" lists hardware cards and application software packages for >> >> PC/XT/AT/PS/2 class machines. Focus is on engineering and scien- >> >> tific applications of PCs, such as data acquisition/control, >> >> design automation, and data analysis and presentation. >> > >> >> If you would like a free copy, reply with your (U. S. Postal) >> >> mailing address. >> >> Don't bother - it never comes. It's a cheap trick for building a >> mailing list to sell if my junk mail flow is any indication. >> >> -andy sent his address months ago > >Perhaps we can get Portal to nuke this weasal. I never received a >catalog either. If that person doesn't respond to a growing flame, then >we can assume that we'yall look forward to lotsa junk mail. I don't want him nuked, I want him to be honest. The junk mail has been much more interesting than the promised catalog. If I'd known what I was going to get, I wouldn't have hesitated. I wouldn't be surprised if there were other folks who looked at the ad and said "nope" but who would be very interested in the junk mail that results. Similarly, there are people who wanted the advertised catalog who aren't happy with the junk they got instead. The folks buying the mailing lists would prefer an honest ad, and so would the people reading it. -andy --
Let's compute the (uncompressed, in-memory) size of the training and test sets in MB assuming an 8 bit encoding (in this case, all chars can be encoded using the latin-1 charset).
def text_size(text, charset='iso-8859-1'):
return len(text.encode(charset)) * 8 * 1e-6
train_size_mb = sum(text_size(text) for text in all_twenty_train.data)
test_size_mb = sum(text_size(text) for text in all_twenty_test.data)
print("Training set size: {0} MB".format(int(train_size_mb)))
print("Testing set size: {0} MB".format(int(test_size_mb)))
Training set size: 176 MB Testing set size: 110 MB
If we only consider a small subset of the 4 categories selected from the initial example:
train_small_size_mb = sum(text_size(text) for text in twenty_train_small.data)
test_small_size_mb = sum(text_size(text) for text in twenty_test_small.data)
print("Training set size: {0} MB".format(int(train_small_size_mb)))
print("Testing set size: {0} MB".format(int(test_small_size_mb)))
Training set size: 31 MB Testing set size: 22 MB
from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVectorizer()
TfidfVectorizer(analyzer=u'word', binary=False, charset=None, charset_error=None, decode_error=u'strict', dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True, stop_words=None, strip_accents=None, sublinear_tf=False, token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True, vocabulary=None)
vectorizer = TfidfVectorizer(min_df=1)
%time X_train_small = vectorizer.fit_transform(twenty_train_small.data)
CPU times: user 1.09 s, sys: 65.3 ms, total: 1.15 s Wall time: 1.3 s
The results is not a numpy.array
but instead a scipy.sparse
matrix. This datastructure is quite similar to a 2D numpy array but it does not store the zeros.
X_train_small
<2034x34118 sparse matrix of type '<type 'numpy.float64'>' with 323433 stored elements in Compressed Sparse Row format>
scipy.sparse matrices also have a shape attribute to access the dimensions:
n_samples, n_features = X_train_small.shape
This dataset has around 2000 samples (the rows of the data matrix):
n_samples
2034
This is the same value as the number of strings in the original list of text documents:
len(twenty_train_small.data)
2034
The columns represent the individual token occurrences:
n_features
34118
This number is the size of the vocabulary of the model extracted during fit in a Python dictionary:
type(vectorizer.vocabulary_)
dict
len(vectorizer.vocabulary_)
34118
The keys of the vocabulary_
attribute are also called feature names and can be accessed as a list of strings.
len(vectorizer.get_feature_names())
34118
Here are the first 10 elements (sorted in lexicographical order):
vectorizer.get_feature_names()[:10]
[u'00', u'000', u'0000', u'00000', u'000000', u'000005102000', u'000021', u'000062david42', u'0000vec', u'0001']
Let's have a look at the features from the middle:
vectorizer.get_feature_names()[n_features / 2:n_features / 2 + 10]
[u'inadequate', u'inala', u'inalienable', u'inane', u'inanimate', u'inapplicable', u'inappropriate', u'inappropriately', u'inaudible', u'inbreeding']
Now that we have extracted a vector representation of the data, it's a good idea to project the data on the first 2D of a Principal Component Analysis to get a feel of the data. Note that the RandomizedPCA
class can accept scipy.sparse
matrices as input (as an alternative to numpy arrays):
from sklearn.decomposition import RandomizedPCA
%time X_train_small_pca = RandomizedPCA(n_components=2).fit_transform(X_train_small)
CPU times: user 164 ms, sys: 15.5 ms, total: 179 ms Wall time: 393 ms
/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.5-i386.egg/sklearn/decomposition/pca.py:512: DeprecationWarning: Sparse matrix support is deprecated and will be dropped in 0.16. Use TruncatedSVD instead. DeprecationWarning)
from itertools import cycle
colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k']
for i, c in zip(np.unique(y_train), cycle(colors)):
pl.scatter(X_train_small_pca[y_train == i, 0],
X_train_small_pca[y_train == i, 1],
c=c, label=twenty_train_small.target_names[i], alpha=0.5)
_ = pl.legend(loc='best')
We can observe that there is a large overlap of the samples from different categories. This is to be expected as the PCA linear projection projects data from a 34118 dimensional space down to 2 dimensions: data that is linearly separable in 34118D is often no longer linearly separable in 2D.
Still we can notice an interesting pattern: the newsgroups on religion and atheism occupy the much the same region and computer graphics and space science / space overlap more together than they do with the religion or atheism newsgroups.
We have previously extracted a vector representation of the training corpus and put it into a variable name X_train_small
. To train a supervised model, in this case a classifier, we also need
y_train_small = twenty_train_small.target
y_train_small.shape
(2034,)
y_train_small
array([1, 2, 2, ..., 2, 1, 1])
We can shape that we have the same number of samples for the input data and the labels:
X_train_small.shape[0] == y_train_small.shape[0]
True
We can now train a classifier, for instance a Multinomial Naive Bayesian classifier:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB(alpha=0.1)
clf
MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)
clf.fit(X_train_small, y_train_small)
MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)
We can now evaluate the classifier on the testing set. Let's first use the builtin score function, which is the rate of correct classification in the test set:
X_test_small = vectorizer.transform(twenty_test_small.data)
y_test_small = twenty_test_small.target
X_test_small.shape
(1353, 34118)
y_test_small.shape
(1353,)
clf.score(X_test_small, y_test_small)
0.89652623798965259
We can also compute the score on the test set and observe that the model is both overfitting and underfitting a bit at the same time:
clf.score(X_train_small, y_train_small)
0.99262536873156337
The text vectorizer has many parameters to customize it's behavior, in particular how it extracts tokens:
TfidfVectorizer()
TfidfVectorizer(analyzer=u'word', binary=False, charset=None, charset_error=None, decode_error=u'strict', dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=True, stop_words=None, strip_accents=None, sublinear_tf=False, token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True, vocabulary=None)
print(TfidfVectorizer.__doc__)
Convert a collection of raw documents to a matrix of TF-IDF features. Equivalent to CountVectorizer followed by TfidfTransformer. Parameters ---------- input : string {'filename', 'file', 'content'} If filename, the sequence passed as an argument to fit is expected to be a list of filenames that need reading to fetch the raw content to analyze. If 'file', the sequence items must have 'read' method (file-like object) it is called to fetch the bytes in memory. Otherwise the input is expected to be the sequence strings or bytes items are expected to be analyzed directly. encoding : string, 'utf-8' by default. If bytes or files are given to analyze, this encoding is used to decode. decode_error : {'strict', 'ignore', 'replace'} Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given `encoding`. By default, it is 'strict', meaning that a UnicodeDecodeError will be raised. Other values are 'ignore' and 'replace'. strip_accents : {'ascii', 'unicode', None} Remove accents during the preprocessing step. 'ascii' is a fast method that only works on characters that have an direct ASCII mapping. 'unicode' is a slightly slower method that works on any characters. None (default) does nothing. analyzer : string, {'word', 'char'} or callable Whether the feature should be made of word or character n-grams. If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input. preprocessor : callable or None (default) Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps. tokenizer : callable or None (default) Override the string tokenization step while preserving the preprocessing and n-grams generation steps. ngram_range : tuple (min_n, max_n) The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. stop_words : string {'english'}, list, or None (default) If a string, it is passed to _check_stop_list and the appropriate stop list is returned. 'english' is currently the only supported string value. If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms. lowercase : boolean, default True Convert all characters to lowercase befor tokenizing. token_pattern : string Regular expression denoting what constitutes a "token", only used if `tokenize == 'word'`. The default regexp select tokens of 2 or more letters characters (punctuation is completely ignored and always treated as a token separator). max_df : float in range [0.0, 1.0] or int, optional, 1.0 by default When building the vocabulary ignore terms that have a term frequency strictly higher than the given threshold (corpus specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None. min_df : float in range [0.0, 1.0] or int, optional, 1 by default When building the vocabulary ignore terms that have a term frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None. max_features : optional, None by default If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None. vocabulary : Mapping or iterable, optional Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents. binary : boolean, False by default. If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. dtype : type, optional Type of the matrix returned by fit_transform() or transform(). norm : 'l1', 'l2' or None, optional Norm used to normalize term vectors. None for no normalization. use_idf : boolean, optional Enable inverse-document-frequency reweighting. smooth_idf : boolean, optional Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions. sublinear_tf : boolean, optional Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf). See also -------- CountVectorizer Tokenize the documents and count the occurrences of token and return them as a sparse matrix TfidfTransformer Apply Term Frequency Inverse Document Frequency normalization to a sparse matrix of occurrence counts.
The easiest way to introspect what the vectorizer is actually doing for a given test of parameters is call the vectorizer.build_analyzer()
to get an instance of the text analyzer it uses to process the text:
analyzer = TfidfVectorizer().build_analyzer()
analyzer("I love scikit-learn: this is a cool Python lib!")
[u'love', u'scikit', u'learn', u'this', u'is', u'cool', u'python', u'lib']
You can notice that all the tokens are lowercase, that the single letter word "I" was dropped, and that hyphenation is used. Let's change some of that default behavior:
analyzer = TfidfVectorizer(
preprocessor=lambda text: text, # disable lowercasing
token_pattern=ur'(?u)\b[\w-]+\b', # treat hyphen as a letter
# do not exclude single letter tokens
).build_analyzer()
analyzer("I love scikit-learn: this is a cool Python lib!")
[u'I', u'love', u'scikit-learn', u'this', u'is', u'a', u'cool', u'Python', u'lib']