%pylab inline
import pylab as pl
import numpy as np
# Some nice default configuration for plots
pl.rcParams['figure.figsize'] = 10, 7.5
pl.rcParams['axes.grid'] = True
pl.gray()
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline]. For more information, type 'help(pylab)'.
Outline of this section:
Let's start by implementing a canonical text classification example:
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
# Load the text data
categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space',
]
twenty_train_subset = load_files('datasets/20news-bydate-train/',
categories=categories, charset='latin-1')
twenty_test_subset = load_files('datasets/20news-bydate-test/',
categories=categories, charset='latin-1')
# Turn the text documents into vectors of word frequencies
vectorizer = TfidfVectorizer(min_df=2)
X_train = vectorizer.fit_transform(twenty_train_subset.data)
y_train = twenty_train_subset.target
# Fit a classifier on the training set
classifier = MultinomialNB().fit(X_train, y_train)
print("Training score: {0:.1f}%".format(
classifier.score(X_train, y_train) * 100))
# Evaluate the classifier on the testing set
X_test = vectorizer.transform(twenty_test_subset.data)
y_test = twenty_test_subset.target
print("Testing score: {0:.1f}%".format(
classifier.score(X_test, y_test) * 100))
Training score: 95.1% Testing score: 85.1%
Here is a workflow diagram summary of what happened previously:
from IPython.core.display import Image, display
display(Image(filename='figures/supervised_scikit_learn.png'))
Let's now decompose what we just did to understand and customize each step:
Let's explore the dataset loading utility without passing a list of categories: in this case we load the full 20 newsgroups dataset in memory. The source website for the 20 newsgroups already provides a date-based train / test split that is made available using the subset
keyword argument:
ls -l datasets/
total 187176 drwxr-xr-x 22 ogrisel staff 748 Mar 18 2003 20news-bydate-test/ drwxr-xr-x 22 ogrisel staff 748 Mar 18 2003 20news-bydate-train/ -rw-r--r-- 1 ogrisel staff 14464277 Jun 20 15:41 20news-bydate.tar.gz drwxr-xr-x 7 ogrisel staff 238 Jun 24 16:04 lfw_home/ drwxr-xr-x 4 ogrisel staff 136 Jun 20 15:43 sentiment140/ -rw-r--r-- 1 ogrisel staff 81363704 Jun 20 15:43 trainingandtestdata.zip
ls -lh datasets/20news-bydate-train
total 0 drwxr-xr-x 482 ogrisel staff 16K Mar 18 2003 alt.atheism/ drwxr-xr-x 586 ogrisel staff 19K Mar 18 2003 comp.graphics/ drwxr-xr-x 593 ogrisel staff 20K Mar 18 2003 comp.os.ms-windows.misc/ drwxr-xr-x 592 ogrisel staff 20K Mar 18 2003 comp.sys.ibm.pc.hardware/ drwxr-xr-x 580 ogrisel staff 19K Mar 18 2003 comp.sys.mac.hardware/ drwxr-xr-x 595 ogrisel staff 20K Mar 18 2003 comp.windows.x/ drwxr-xr-x 587 ogrisel staff 19K Mar 18 2003 misc.forsale/ drwxr-xr-x 596 ogrisel staff 20K Mar 18 2003 rec.autos/ drwxr-xr-x 600 ogrisel staff 20K Mar 18 2003 rec.motorcycles/ drwxr-xr-x 599 ogrisel staff 20K Mar 18 2003 rec.sport.baseball/ drwxr-xr-x 602 ogrisel staff 20K Mar 18 2003 rec.sport.hockey/ drwxr-xr-x 597 ogrisel staff 20K Mar 18 2003 sci.crypt/ drwxr-xr-x 593 ogrisel staff 20K Mar 18 2003 sci.electronics/ drwxr-xr-x 596 ogrisel staff 20K Mar 18 2003 sci.med/ drwxr-xr-x 595 ogrisel staff 20K Mar 18 2003 sci.space/ drwxr-xr-x 601 ogrisel staff 20K Mar 18 2003 soc.religion.christian/ drwxr-xr-x 548 ogrisel staff 18K Mar 18 2003 talk.politics.guns/ drwxr-xr-x 566 ogrisel staff 19K Mar 18 2003 talk.politics.mideast/ drwxr-xr-x 467 ogrisel staff 16K Mar 18 2003 talk.politics.misc/ drwxr-xr-x 379 ogrisel staff 13K Mar 18 2003 talk.religion.misc/
ls -lh datasets/20news-bydate-train/alt.atheism/
total 4480 -rw-r--r-- 1 ogrisel staff 12K Mar 18 2003 49960 -rw-r--r-- 1 ogrisel staff 31K Mar 18 2003 51060 -rw-r--r-- 1 ogrisel staff 4.0K Mar 18 2003 51119 -rw-r--r-- 1 ogrisel staff 1.6K Mar 18 2003 51120 -rw-r--r-- 1 ogrisel staff 773B Mar 18 2003 51121 -rw-r--r-- 1 ogrisel staff 4.8K Mar 18 2003 51122 -rw-r--r-- 1 ogrisel staff 618B Mar 18 2003 51123 -rw-r--r-- 1 ogrisel staff 1.4K Mar 18 2003 51124 -rw-r--r-- 1 ogrisel staff 2.7K Mar 18 2003 51125 -rw-r--r-- 1 ogrisel staff 427B Mar 18 2003 51126 -rw-r--r-- 1 ogrisel staff 742B Mar 18 2003 51127 -rw-r--r-- 1 ogrisel staff 650B Mar 18 2003 51128 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 51130 -rw-r--r-- 1 ogrisel staff 2.3K Mar 18 2003 51131 -rw-r--r-- 1 ogrisel staff 2.6K Mar 18 2003 51132 -rw-r--r-- 1 ogrisel staff 1.5K Mar 18 2003 51133 -rw-r--r-- 1 ogrisel staff 1.2K Mar 18 2003 51134 -rw-r--r-- 1 ogrisel staff 1.6K Mar 18 2003 51135 -rw-r--r-- 1 ogrisel staff 2.1K Mar 18 2003 51136 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 51139 -rw-r--r-- 1 ogrisel staff 409B Mar 18 2003 51140 -rw-r--r-- 1 ogrisel staff 940B Mar 18 2003 51141 -rw-r--r-- 1 ogrisel staff 9.0K Mar 18 2003 51142 -rw-r--r-- 1 ogrisel staff 632B Mar 18 2003 51143 -rw-r--r-- 1 ogrisel staff 1.2K Mar 18 2003 51144 -rw-r--r-- 1 ogrisel staff 609B Mar 18 2003 51145 -rw-r--r-- 1 ogrisel staff 631B Mar 18 2003 51146 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 51147 -rw-r--r-- 1 ogrisel staff 1.8K Mar 18 2003 51148 -rw-r--r-- 1 ogrisel staff 405B Mar 18 2003 51149 -rw-r--r-- 1 ogrisel staff 696B Mar 18 2003 51150 -rw-r--r-- 1 ogrisel staff 5.5K Mar 18 2003 51151 -rw-r--r-- 1 ogrisel staff 1.4K Mar 18 2003 51152 -rw-r--r-- 1 ogrisel staff 5.0K Mar 18 2003 51153 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 51154 -rw-r--r-- 1 ogrisel staff 1.6K Mar 18 2003 51155 -rw-r--r-- 1 ogrisel staff 5.0K Mar 18 2003 51156 -rw-r--r-- 1 ogrisel staff 1.8K Mar 18 2003 51157 -rw-r--r-- 1 ogrisel staff 604B Mar 18 2003 51158 -rw-r--r-- 1 ogrisel staff 1.4K Mar 18 2003 51159 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 51160 -rw-r--r-- 1 ogrisel staff 1.4K Mar 18 2003 51161 -rw-r--r-- 1 ogrisel staff 2.9K Mar 18 2003 51162 -rw-r--r-- 1 ogrisel staff 1.1K Mar 18 2003 51163 -rw-r--r-- 1 ogrisel staff 2.3K Mar 18 2003 51164 -rw-r--r-- 1 ogrisel staff 4.8K Mar 18 2003 51165 -rw-r--r-- 1 ogrisel staff 1.2K Mar 18 2003 51169 -rw-r--r-- 1 ogrisel staff 868B Mar 18 2003 51170 -rw-r--r-- 1 ogrisel staff 721B Mar 18 2003 51171 -rw-r--r-- 1 ogrisel staff 3.0K Mar 18 2003 51172 -rw-r--r-- 1 ogrisel staff 1.9K Mar 18 2003 51173 -rw-r--r-- 1 ogrisel staff 645B Mar 18 2003 51174 -rw-r--r-- 1 ogrisel staff 2.4K Mar 18 2003 51175 -rw-r--r-- 1 ogrisel staff 2.9K Mar 18 2003 51176 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 51177 -rw-r--r-- 1 ogrisel staff 879B Mar 18 2003 51178 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 51179 -rw-r--r-- 1 ogrisel staff 994B Mar 18 2003 51180 -rw-r--r-- 1 ogrisel staff 1.2K Mar 18 2003 51181 -rw-r--r-- 1 ogrisel staff 2.2K Mar 18 2003 51182 -rw-r--r-- 1 ogrisel staff 1.7K Mar 18 2003 51183 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 51184 -rw-r--r-- 1 ogrisel staff 1.2K Mar 18 2003 51185 -rw-r--r-- 1 ogrisel staff 949B Mar 18 2003 51186 -rw-r--r-- 1 ogrisel staff 1.9K Mar 18 2003 51187 -rw-r--r-- 1 ogrisel staff 1.1K Mar 18 2003 51188 -rw-r--r-- 1 ogrisel staff 834B Mar 18 2003 51189 -rw-r--r-- 1 ogrisel staff 895B Mar 18 2003 51190 -rw-r--r-- 1 ogrisel staff 776B Mar 18 2003 51191 -rw-r--r-- 1 ogrisel staff 1.6K Mar 18 2003 51192 -rw-r--r-- 1 ogrisel staff 1.8K Mar 18 2003 51193 -rw-r--r-- 1 ogrisel staff 1.4K Mar 18 2003 51194 -rw-r--r-- 1 ogrisel staff 964B Mar 18 2003 51195 -rw-r--r-- 1 ogrisel staff 2.4K Mar 18 2003 51196 -rw-r--r-- 1 ogrisel staff 759B Mar 18 2003 51197 -rw-r--r-- 1 ogrisel staff 1.5K Mar 18 2003 51198 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 51199 -rw-r--r-- 1 ogrisel staff 1.9K Mar 18 2003 51200 -rw-r--r-- 1 ogrisel staff 916B Mar 18 2003 51201 -rw-r--r-- 1 ogrisel staff 1.9K Mar 18 2003 51202 -rw-r--r-- 1 ogrisel staff 1.5K Mar 18 2003 51203 -rw-r--r-- 1 ogrisel staff 846B Mar 18 2003 51204 -rw-r--r-- 1 ogrisel staff 1.4K Mar 18 2003 51205 -rw-r--r-- 1 ogrisel staff 881B Mar 18 2003 51206 -rw-r--r-- 1 ogrisel staff 6.2K Mar 18 2003 51208 -rw-r--r-- 1 ogrisel staff 1.7K Mar 18 2003 51209 -rw-r--r-- 1 ogrisel staff 1.7K Mar 18 2003 51210 -rw-r--r-- 1 ogrisel staff 10K Mar 18 2003 51211 -rw-r--r-- 1 ogrisel staff 2.5K Mar 18 2003 51212 -rw-r--r-- 1 ogrisel staff 1.6K Mar 18 2003 51213 -rw-r--r-- 1 ogrisel staff 636B Mar 18 2003 51214 -rw-r--r-- 1 ogrisel staff 989B Mar 18 2003 51215 -rw-r--r-- 1 ogrisel staff 668B Mar 18 2003 51216 -rw-r--r-- 1 ogrisel staff 2.8K Mar 18 2003 51217 -rw-r--r-- 1 ogrisel staff 1.7K Mar 18 2003 51218 -rw-r--r-- 1 ogrisel staff 905B Mar 18 2003 51219 -rw-r--r-- 1 ogrisel staff 2.4K Mar 18 2003 51220 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 51221 -rw-r--r-- 1 ogrisel staff 1.7K Mar 18 2003 51222 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 51223 -rw-r--r-- 1 ogrisel staff 2.1K Mar 18 2003 51224 -rw-r--r-- 1 ogrisel staff 1.5K Mar 18 2003 51225 -rw-r--r-- 1 ogrisel staff 3.4K Mar 18 2003 51226 -rw-r--r-- 1 ogrisel staff 704B Mar 18 2003 51227 -rw-r--r-- 1 ogrisel staff 949B Mar 18 2003 51228 -rw-r--r-- 1 ogrisel staff 714B Mar 18 2003 51229 -rw-r--r-- 1 ogrisel staff 966B Mar 18 2003 51230 -rw-r--r-- 1 ogrisel staff 2.9K Mar 18 2003 51231 -rw-r--r-- 1 ogrisel staff 871B Mar 18 2003 51232 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 51233 -rw-r--r-- 1 ogrisel staff 1.5K Mar 18 2003 51234 -rw-r--r-- 1 ogrisel staff 2.4K Mar 18 2003 51235 -rw-r--r-- 1 ogrisel staff 1.2K Mar 18 2003 51236 -rw-r--r-- 1 ogrisel staff 564B Mar 18 2003 51237 -rw-r--r-- 1 ogrisel staff 11K Mar 18 2003 51238 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 51239 -rw-r--r-- 1 ogrisel staff 749B Mar 18 2003 51240 -rw-r--r-- 1 ogrisel staff 932B Mar 18 2003 51241 -rw-r--r-- 1 ogrisel staff 1.2K Mar 18 2003 51242 -rw-r--r-- 1 ogrisel staff 2.2K Mar 18 2003 51243 -rw-r--r-- 1 ogrisel staff 554B Mar 18 2003 51244 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 51245 -rw-r--r-- 1 ogrisel staff 1.7K Mar 18 2003 51246 -rw-r--r-- 1 ogrisel staff 1.7K Mar 18 2003 51247 -rw-r--r-- 1 ogrisel staff 1.6K Mar 18 2003 51249 -rw-r--r-- 1 ogrisel staff 2.8K Mar 18 2003 51250 -rw-r--r-- 1 ogrisel staff 570B Mar 18 2003 51251 -rw-r--r-- 1 ogrisel staff 1.8K Mar 18 2003 51252 -rw-r--r-- 1 ogrisel staff 3.1K Mar 18 2003 51253 -rw-r--r-- 1 ogrisel staff 2.9K Mar 18 2003 51254 -rw-r--r-- 1 ogrisel staff 748B Mar 18 2003 51255 -rw-r--r-- 1 ogrisel staff 2.3K Mar 18 2003 51256 -rw-r--r-- 1 ogrisel staff 1.2K Mar 18 2003 51258 -rw-r--r-- 1 ogrisel staff 1.7K Mar 18 2003 51259 -rw-r--r-- 1 ogrisel staff 6.2K Mar 18 2003 51260 -rw-r--r-- 1 ogrisel staff 1.6K Mar 18 2003 51261 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 51262 -rw-r--r-- 1 ogrisel staff 1.2K Mar 18 2003 51265 -rw-r--r-- 1 ogrisel staff 456B Mar 18 2003 51266 -rw-r--r-- 1 ogrisel staff 816B Mar 18 2003 51267 -rw-r--r-- 1 ogrisel staff 2.4K Mar 18 2003 51268 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 51269 -rw-r--r-- 1 ogrisel staff 3.4K Mar 18 2003 51270 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 51271 -rw-r--r-- 1 ogrisel staff 2.0K Mar 18 2003 51272 -rw-r--r-- 1 ogrisel staff 790B Mar 18 2003 51273 -rw-r--r-- 1 ogrisel staff 1.6K Mar 18 2003 51274 -rw-r--r-- 1 ogrisel staff 2.5K Mar 18 2003 51275 -rw-r--r-- 1 ogrisel staff 4.4K Mar 18 2003 51276 -rw-r--r-- 1 ogrisel staff 1.5K Mar 18 2003 51277 -rw-r--r-- 1 ogrisel staff 6.2K Mar 18 2003 51278 -rw-r--r-- 1 ogrisel staff 963B Mar 18 2003 51279 -rw-r--r-- 1 ogrisel staff 2.0K Mar 18 2003 51280 -rw-r--r-- 1 ogrisel staff 1.1K Mar 18 2003 51281 -rw-r--r-- 1 ogrisel staff 618B Mar 18 2003 51282 -rw-r--r-- 1 ogrisel staff 2.7K Mar 18 2003 51283 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 51284 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 51285 -rw-r--r-- 1 ogrisel staff 601B Mar 18 2003 51286 -rw-r--r-- 1 ogrisel staff 751B Mar 18 2003 51287 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 51288 -rw-r--r-- 1 ogrisel staff 8.0K Mar 18 2003 51290 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 51291 -rw-r--r-- 1 ogrisel staff 2.9K Mar 18 2003 51292 -rw-r--r-- 1 ogrisel staff 1.2K Mar 18 2003 51293 -rw-r--r-- 1 ogrisel staff 1.8K Mar 18 2003 51294 -rw-r--r-- 1 ogrisel staff 1.9K Mar 18 2003 51295 -rw-r--r-- 1 ogrisel staff 1.7K Mar 18 2003 51296 -rw-r--r-- 1 ogrisel staff 4.2K Mar 18 2003 51297 -rw-r--r-- 1 ogrisel staff 2.6K Mar 18 2003 51298 -rw-r--r-- 1 ogrisel staff 2.2K Mar 18 2003 51299 -rw-r--r-- 1 ogrisel staff 2.3K Mar 18 2003 51300 -rw-r--r-- 1 ogrisel staff 6.3K Mar 18 2003 51301 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 51302 -rw-r--r-- 1 ogrisel staff 1.9K Mar 18 2003 51303 -rw-r--r-- 1 ogrisel staff 10K Mar 18 2003 51304 -rw-r--r-- 1 ogrisel staff 1.5K Mar 18 2003 51305 -rw-r--r-- 1 ogrisel staff 1.4K Mar 18 2003 51306 -rw-r--r-- 1 ogrisel staff 4.1K Mar 18 2003 51307 -rw-r--r-- 1 ogrisel staff 6.2K Mar 18 2003 51308 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 51309 -rw-r--r-- 1 ogrisel staff 768B Mar 18 2003 51310 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 51311 -rw-r--r-- 1 ogrisel staff 930B Mar 18 2003 51312 -rw-r--r-- 1 ogrisel staff 771B Mar 18 2003 51313 -rw-r--r-- 1 ogrisel staff 670B Mar 18 2003 51314 -rw-r--r-- 1 ogrisel staff 1.1K Mar 18 2003 51315 -rw-r--r-- 1 ogrisel staff 3.7K Mar 18 2003 51316 -rw-r--r-- 1 ogrisel staff 406B Mar 18 2003 51317 -rw-r--r-- 1 ogrisel staff 5.4K Mar 18 2003 51318 -rw-r--r-- 1 ogrisel staff 9.6K Mar 18 2003 51319 -rw-r--r-- 1 ogrisel staff 2.1K Mar 18 2003 51320 -rw-r--r-- 1 ogrisel staff 29K Mar 18 2003 52499 -rw-r--r-- 1 ogrisel staff 25K Mar 18 2003 52909 -rw-r--r-- 1 ogrisel staff 5.8K Mar 18 2003 52910 -rw-r--r-- 1 ogrisel staff 819B Mar 18 2003 53055 -rw-r--r-- 1 ogrisel staff 857B Mar 18 2003 53056 -rw-r--r-- 1 ogrisel staff 755B Mar 18 2003 53057 -rw-r--r-- 1 ogrisel staff 4.4K Mar 18 2003 53058 -rw-r--r-- 1 ogrisel staff 2.1K Mar 18 2003 53059 -rw-r--r-- 1 ogrisel staff 1.1K Mar 18 2003 53062 -rw-r--r-- 1 ogrisel staff 1.6K Mar 18 2003 53064 -rw-r--r-- 1 ogrisel staff 515B Mar 18 2003 53065 -rw-r--r-- 1 ogrisel staff 9.2K Mar 18 2003 53066 -rw-r--r-- 1 ogrisel staff 2.4K Mar 18 2003 53067 -rw-r--r-- 1 ogrisel staff 610B Mar 18 2003 53069 -rw-r--r-- 1 ogrisel staff 759B Mar 18 2003 53070 -rw-r--r-- 1 ogrisel staff 2.3K Mar 18 2003 53071 -rw-r--r-- 1 ogrisel staff 1.5K Mar 18 2003 53072 -rw-r--r-- 1 ogrisel staff 1.9K Mar 18 2003 53073 -rw-r--r-- 1 ogrisel staff 2.1K Mar 18 2003 53075 -rw-r--r-- 1 ogrisel staff 411B Mar 18 2003 53078 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 53081 -rw-r--r-- 1 ogrisel staff 962B Mar 18 2003 53082 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 53083 -rw-r--r-- 1 ogrisel staff 2.0K Mar 18 2003 53085 -rw-r--r-- 1 ogrisel staff 1.1K Mar 18 2003 53086 -rw-r--r-- 1 ogrisel staff 247B Mar 18 2003 53087 -rw-r--r-- 1 ogrisel staff 3.8K Mar 18 2003 53090 -rw-r--r-- 1 ogrisel staff 1.1K Mar 18 2003 53093 -rw-r--r-- 1 ogrisel staff 1.1K Mar 18 2003 53094 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 53095 -rw-r--r-- 1 ogrisel staff 863B Mar 18 2003 53096 -rw-r--r-- 1 ogrisel staff 1.1K Mar 18 2003 53097 -rw-r--r-- 1 ogrisel staff 1.2K Mar 18 2003 53098 -rw-r--r-- 1 ogrisel staff 1.1K Mar 18 2003 53099 -rw-r--r-- 1 ogrisel staff 2.0K Mar 18 2003 53106 -rw-r--r-- 1 ogrisel staff 784B Mar 18 2003 53108 -rw-r--r-- 1 ogrisel staff 2.3K Mar 18 2003 53110 -rw-r--r-- 1 ogrisel staff 712B Mar 18 2003 53111 -rw-r--r-- 1 ogrisel staff 2.4K Mar 18 2003 53112 -rw-r--r-- 1 ogrisel staff 2.6K Mar 18 2003 53113 -rw-r--r-- 1 ogrisel staff 1.7K Mar 18 2003 53114 -rw-r--r-- 1 ogrisel staff 1.5K Mar 18 2003 53117 -rw-r--r-- 1 ogrisel staff 2.8K Mar 18 2003 53118 -rw-r--r-- 1 ogrisel staff 4.1K Mar 18 2003 53120 -rw-r--r-- 1 ogrisel staff 1.8K Mar 18 2003 53121 -rw-r--r-- 1 ogrisel staff 2.4K Mar 18 2003 53122 -rw-r--r-- 1 ogrisel staff 1.2K Mar 18 2003 53123 -rw-r--r-- 1 ogrisel staff 3.4K Mar 18 2003 53124 -rw-r--r-- 1 ogrisel staff 1.8K Mar 18 2003 53125 -rw-r--r-- 1 ogrisel staff 1.2K Mar 18 2003 53126 -rw-r--r-- 1 ogrisel staff 826B Mar 18 2003 53127 -rw-r--r-- 1 ogrisel staff 958B Mar 18 2003 53130 -rw-r--r-- 1 ogrisel staff 1.5K Mar 18 2003 53131 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 53132 -rw-r--r-- 1 ogrisel staff 640B Mar 18 2003 53133 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 53134 -rw-r--r-- 1 ogrisel staff 2.1K Mar 18 2003 53135 -rw-r--r-- 1 ogrisel staff 4.2K Mar 18 2003 53136 -rw-r--r-- 1 ogrisel staff 4.8K Mar 18 2003 53137 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 53139 -rw-r--r-- 1 ogrisel staff 3.0K Mar 18 2003 53140 -rw-r--r-- 1 ogrisel staff 2.1K Mar 18 2003 53141 -rw-r--r-- 1 ogrisel staff 456B Mar 18 2003 53142 -rw-r--r-- 1 ogrisel staff 760B Mar 18 2003 53143 -rw-r--r-- 1 ogrisel staff 768B Mar 18 2003 53144 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 53145 -rw-r--r-- 1 ogrisel staff 1.2K Mar 18 2003 53149 -rw-r--r-- 1 ogrisel staff 2.1K Mar 18 2003 53150 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 53151 -rw-r--r-- 1 ogrisel staff 1.9K Mar 18 2003 53153 -rw-r--r-- 1 ogrisel staff 1.2K Mar 18 2003 53154 -rw-r--r-- 1 ogrisel staff 1.2K Mar 18 2003 53157 -rw-r--r-- 1 ogrisel staff 2.0K Mar 18 2003 53158 -rw-r--r-- 1 ogrisel staff 819B Mar 18 2003 53159 -rw-r--r-- 1 ogrisel staff 1.9K Mar 18 2003 53160 -rw-r--r-- 1 ogrisel staff 3.5K Mar 18 2003 53161 -rw-r--r-- 1 ogrisel staff 1.5K Mar 18 2003 53162 -rw-r--r-- 1 ogrisel staff 1.9K Mar 18 2003 53163 -rw-r--r-- 1 ogrisel staff 2.2K Mar 18 2003 53164 -rw-r--r-- 1 ogrisel staff 1.1K Mar 18 2003 53165 -rw-r--r-- 1 ogrisel staff 684B Mar 18 2003 53166 -rw-r--r-- 1 ogrisel staff 443B Mar 18 2003 53167 -rw-r--r-- 1 ogrisel staff 1.2K Mar 18 2003 53168 -rw-r--r-- 1 ogrisel staff 1.4K Mar 18 2003 53170 -rw-r--r-- 1 ogrisel staff 2.5K Mar 18 2003 53171 -rw-r--r-- 1 ogrisel staff 785B Mar 18 2003 53172 -rw-r--r-- 1 ogrisel staff 1.1K Mar 18 2003 53173 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 53174 -rw-r--r-- 1 ogrisel staff 737B Mar 18 2003 53175 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 53176 -rw-r--r-- 1 ogrisel staff 1.8K Mar 18 2003 53177 -rw-r--r-- 1 ogrisel staff 2.2K Mar 18 2003 53178 -rw-r--r-- 1 ogrisel staff 1.6K Mar 18 2003 53179 -rw-r--r-- 1 ogrisel staff 2.1K Mar 18 2003 53180 -rw-r--r-- 1 ogrisel staff 3.2K Mar 18 2003 53181 -rw-r--r-- 1 ogrisel staff 1.2K Mar 18 2003 53182 -rw-r--r-- 1 ogrisel staff 1.4K Mar 18 2003 53183 -rw-r--r-- 1 ogrisel staff 1.7K Mar 18 2003 53184 -rw-r--r-- 1 ogrisel staff 2.6K Mar 18 2003 53185 -rw-r--r-- 1 ogrisel staff 3.0K Mar 18 2003 53186 -rw-r--r-- 1 ogrisel staff 665B Mar 18 2003 53187 -rw-r--r-- 1 ogrisel staff 2.0K Mar 18 2003 53188 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 53190 -rw-r--r-- 1 ogrisel staff 1.9K Mar 18 2003 53191 -rw-r--r-- 1 ogrisel staff 1.8K Mar 18 2003 53192 -rw-r--r-- 1 ogrisel staff 1.4K Mar 18 2003 53193 -rw-r--r-- 1 ogrisel staff 792B Mar 18 2003 53194 -rw-r--r-- 1 ogrisel staff 2.0K Mar 18 2003 53195 -rw-r--r-- 1 ogrisel staff 1.6K Mar 18 2003 53196 -rw-r--r-- 1 ogrisel staff 2.6K Mar 18 2003 53197 -rw-r--r-- 1 ogrisel staff 1.1K Mar 18 2003 53198 -rw-r--r-- 1 ogrisel staff 1.4K Mar 18 2003 53199 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 53201 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 53203 -rw-r--r-- 1 ogrisel staff 3.7K Mar 18 2003 53208 -rw-r--r-- 1 ogrisel staff 1.1K Mar 18 2003 53209 -rw-r--r-- 1 ogrisel staff 1.5K Mar 18 2003 53210 -rw-r--r-- 1 ogrisel staff 2.7K Mar 18 2003 53211 -rw-r--r-- 1 ogrisel staff 1.4K Mar 18 2003 53212 -rw-r--r-- 1 ogrisel staff 2.3K Mar 18 2003 53213 -rw-r--r-- 1 ogrisel staff 1.9K Mar 18 2003 53214 -rw-r--r-- 1 ogrisel staff 919B Mar 18 2003 53215 -rw-r--r-- 1 ogrisel staff 868B Mar 18 2003 53216 -rw-r--r-- 1 ogrisel staff 2.3K Mar 18 2003 53217 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 53218 -rw-r--r-- 1 ogrisel staff 1.1K Mar 18 2003 53219 -rw-r--r-- 1 ogrisel staff 640B Mar 18 2003 53220 -rw-r--r-- 1 ogrisel staff 1.1K Mar 18 2003 53221 -rw-r--r-- 1 ogrisel staff 2.0K Mar 18 2003 53222 -rw-r--r-- 1 ogrisel staff 2.0K Mar 18 2003 53223 -rw-r--r-- 1 ogrisel staff 3.4K Mar 18 2003 53224 -rw-r--r-- 1 ogrisel staff 808B Mar 18 2003 53225 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 53226 -rw-r--r-- 1 ogrisel staff 640B Mar 18 2003 53228 -rw-r--r-- 1 ogrisel staff 856B Mar 18 2003 53229 -rw-r--r-- 1 ogrisel staff 967B Mar 18 2003 53230 -rw-r--r-- 1 ogrisel staff 781B Mar 18 2003 53231 -rw-r--r-- 1 ogrisel staff 1.2K Mar 18 2003 53232 -rw-r--r-- 1 ogrisel staff 2.2K Mar 18 2003 53235 -rw-r--r-- 1 ogrisel staff 1.7K Mar 18 2003 53237 -rw-r--r-- 1 ogrisel staff 2.2K Mar 18 2003 53238 -rw-r--r-- 1 ogrisel staff 2.4K Mar 18 2003 53239 -rw-r--r-- 1 ogrisel staff 1.2K Mar 18 2003 53240 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 53243 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 53248 -rw-r--r-- 1 ogrisel staff 1.4K Mar 18 2003 53249 -rw-r--r-- 1 ogrisel staff 1.8K Mar 18 2003 53250 -rw-r--r-- 1 ogrisel staff 1.5K Mar 18 2003 53251 -rw-r--r-- 1 ogrisel staff 1.4K Mar 18 2003 53252 -rw-r--r-- 1 ogrisel staff 1.2K Mar 18 2003 53256 -rw-r--r-- 1 ogrisel staff 806B Mar 18 2003 53258 -rw-r--r-- 1 ogrisel staff 4.2K Mar 18 2003 53266 -rw-r--r-- 1 ogrisel staff 3.5K Mar 18 2003 53267 -rw-r--r-- 1 ogrisel staff 1.8K Mar 18 2003 53269 -rw-r--r-- 1 ogrisel staff 3.2K Mar 18 2003 53271 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 53274 -rw-r--r-- 1 ogrisel staff 2.1K Mar 18 2003 53275 -rw-r--r-- 1 ogrisel staff 2.0K Mar 18 2003 53281 -rw-r--r-- 1 ogrisel staff 958B Mar 18 2003 53282 -rw-r--r-- 1 ogrisel staff 3.2K Mar 18 2003 53283 -rw-r--r-- 1 ogrisel staff 872B Mar 18 2003 53284 -rw-r--r-- 1 ogrisel staff 387B Mar 18 2003 53285 -rw-r--r-- 1 ogrisel staff 3.1K Mar 18 2003 53286 -rw-r--r-- 1 ogrisel staff 3.5K Mar 18 2003 53287 -rw-r--r-- 1 ogrisel staff 2.6K Mar 18 2003 53288 -rw-r--r-- 1 ogrisel staff 956B Mar 18 2003 53289 -rw-r--r-- 1 ogrisel staff 1.6K Mar 18 2003 53290 -rw-r--r-- 1 ogrisel staff 10K Mar 18 2003 53292 -rw-r--r-- 1 ogrisel staff 5.4K Mar 18 2003 53298 -rw-r--r-- 1 ogrisel staff 945B Mar 18 2003 53303 -rw-r--r-- 1 ogrisel staff 1.2K Mar 18 2003 53304 -rw-r--r-- 1 ogrisel staff 1.5K Mar 18 2003 53305 -rw-r--r-- 1 ogrisel staff 1.4K Mar 18 2003 53306 -rw-r--r-- 1 ogrisel staff 590B Mar 18 2003 53307 -rw-r--r-- 1 ogrisel staff 663B Mar 18 2003 53308 -rw-r--r-- 1 ogrisel staff 907B Mar 18 2003 53309 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 53311 -rw-r--r-- 1 ogrisel staff 1.5K Mar 18 2003 53312 -rw-r--r-- 1 ogrisel staff 576B Mar 18 2003 53314 -rw-r--r-- 1 ogrisel staff 15K Mar 18 2003 53323 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 53334 -rw-r--r-- 1 ogrisel staff 783B Mar 18 2003 53347 -rw-r--r-- 1 ogrisel staff 5.8K Mar 18 2003 53351 -rw-r--r-- 1 ogrisel staff 1.6K Mar 18 2003 53366 -rw-r--r-- 1 ogrisel staff 698B Mar 18 2003 53370 -rw-r--r-- 1 ogrisel staff 600B Mar 18 2003 53371 -rw-r--r-- 1 ogrisel staff 5.6K Mar 18 2003 53373 -rw-r--r-- 1 ogrisel staff 1.8K Mar 18 2003 53374 -rw-r--r-- 1 ogrisel staff 1.1K Mar 18 2003 53375 -rw-r--r-- 1 ogrisel staff 849B Mar 18 2003 53376 -rw-r--r-- 1 ogrisel staff 621B Mar 18 2003 53377 -rw-r--r-- 1 ogrisel staff 270B Mar 18 2003 53380 -rw-r--r-- 1 ogrisel staff 1.1K Mar 18 2003 53381 -rw-r--r-- 1 ogrisel staff 2.2K Mar 18 2003 53382 -rw-r--r-- 1 ogrisel staff 1.6K Mar 18 2003 53383 -rw-r--r-- 1 ogrisel staff 1.6K Mar 18 2003 53387 -rw-r--r-- 1 ogrisel staff 759B Mar 18 2003 53389 -rw-r--r-- 1 ogrisel staff 396B Mar 18 2003 53390 -rw-r--r-- 1 ogrisel staff 669B Mar 18 2003 53391 -rw-r--r-- 1 ogrisel staff 1.8K Mar 18 2003 53434 -rw-r--r-- 1 ogrisel staff 1.6K Mar 18 2003 53435 -rw-r--r-- 1 ogrisel staff 708B Mar 18 2003 53436 -rw-r--r-- 1 ogrisel staff 887B Mar 18 2003 53437 -rw-r--r-- 1 ogrisel staff 838B Mar 18 2003 53438 -rw-r--r-- 1 ogrisel staff 1.4K Mar 18 2003 53439 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 53440 -rw-r--r-- 1 ogrisel staff 384B Mar 18 2003 53441 -rw-r--r-- 1 ogrisel staff 857B Mar 18 2003 53442 -rw-r--r-- 1 ogrisel staff 1.6K Mar 18 2003 53443 -rw-r--r-- 1 ogrisel staff 1.4K Mar 18 2003 53445 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 53449 -rw-r--r-- 1 ogrisel staff 2.4K Mar 18 2003 53459 -rw-r--r-- 1 ogrisel staff 1.4K Mar 18 2003 53460 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 53465 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 53466 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 53467 -rw-r--r-- 1 ogrisel staff 1.4K Mar 18 2003 53468 -rw-r--r-- 1 ogrisel staff 1.1K Mar 18 2003 53471 -rw-r--r-- 1 ogrisel staff 1.9K Mar 18 2003 53477 -rw-r--r-- 1 ogrisel staff 718B Mar 18 2003 53478 -rw-r--r-- 1 ogrisel staff 781B Mar 18 2003 53483 -rw-r--r-- 1 ogrisel staff 1.6K Mar 18 2003 53509 -rw-r--r-- 1 ogrisel staff 910B Mar 18 2003 53510 -rw-r--r-- 1 ogrisel staff 781B Mar 18 2003 53512 -rw-r--r-- 1 ogrisel staff 1.8K Mar 18 2003 53515 -rw-r--r-- 1 ogrisel staff 2.1K Mar 18 2003 53518 -rw-r--r-- 1 ogrisel staff 50K Mar 18 2003 53519 -rw-r--r-- 1 ogrisel staff 6.0K Mar 18 2003 53521 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 53522 -rw-r--r-- 1 ogrisel staff 2.8K Mar 18 2003 53523 -rw-r--r-- 1 ogrisel staff 338B Mar 18 2003 53524 -rw-r--r-- 1 ogrisel staff 1.4K Mar 18 2003 53525 -rw-r--r-- 1 ogrisel staff 489B Mar 18 2003 53526 -rw-r--r-- 1 ogrisel staff 2.6K Mar 18 2003 53527 -rw-r--r-- 1 ogrisel staff 2.4K Mar 18 2003 53528 -rw-r--r-- 1 ogrisel staff 228B Mar 18 2003 53529 -rw-r--r-- 1 ogrisel staff 1.1K Mar 18 2003 53531 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 53532 -rw-r--r-- 1 ogrisel staff 1.2K Mar 18 2003 53533 -rw-r--r-- 1 ogrisel staff 356B Mar 18 2003 53534 -rw-r--r-- 1 ogrisel staff 614B Mar 18 2003 53535 -rw-r--r-- 1 ogrisel staff 895B Mar 18 2003 53571 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 53572 -rw-r--r-- 1 ogrisel staff 697B Mar 18 2003 53573 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 53574 -rw-r--r-- 1 ogrisel staff 1.8K Mar 18 2003 53654 -rw-r--r-- 1 ogrisel staff 2.3K Mar 18 2003 53655 -rw-r--r-- 1 ogrisel staff 2.5K Mar 18 2003 53656 -rw-r--r-- 1 ogrisel staff 2.1K Mar 18 2003 53660 -rw-r--r-- 1 ogrisel staff 6.8K Mar 18 2003 53661 -rw-r--r-- 1 ogrisel staff 1.8K Mar 18 2003 53753 -rw-r--r-- 1 ogrisel staff 698B Mar 18 2003 53754 -rw-r--r-- 1 ogrisel staff 779B Mar 18 2003 53755 -rw-r--r-- 1 ogrisel staff 3.9K Mar 18 2003 53756 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 53757 -rw-r--r-- 1 ogrisel staff 2.2K Mar 18 2003 53758 -rw-r--r-- 1 ogrisel staff 745B Mar 18 2003 53759 -rw-r--r-- 1 ogrisel staff 1.9K Mar 18 2003 53760 -rw-r--r-- 1 ogrisel staff 592B Mar 18 2003 53761 -rw-r--r-- 1 ogrisel staff 658B Mar 18 2003 53762 -rw-r--r-- 1 ogrisel staff 756B Mar 18 2003 53763 -rw-r--r-- 1 ogrisel staff 2.7K Mar 18 2003 53764 -rw-r--r-- 1 ogrisel staff 1.1K Mar 18 2003 53765 -rw-r--r-- 1 ogrisel staff 906B Mar 18 2003 53766 -rw-r--r-- 1 ogrisel staff 535B Mar 18 2003 53780 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 53785 -rw-r--r-- 1 ogrisel staff 2.3K Mar 18 2003 54165 -rw-r--r-- 1 ogrisel staff 2.8K Mar 18 2003 54166 -rw-r--r-- 1 ogrisel staff 547B Mar 18 2003 54167 -rw-r--r-- 1 ogrisel staff 2.4K Mar 18 2003 54168 -rw-r--r-- 1 ogrisel staff 4.7K Mar 18 2003 54178 -rw-r--r-- 1 ogrisel staff 1.8K Mar 18 2003 54179 -rw-r--r-- 1 ogrisel staff 4.4K Mar 18 2003 54180 -rw-r--r-- 1 ogrisel staff 1.3K Mar 18 2003 54181 -rw-r--r-- 1 ogrisel staff 3.0K Mar 18 2003 54182 -rw-r--r-- 1 ogrisel staff 1.4K Mar 18 2003 54198 -rw-r--r-- 1 ogrisel staff 1.8K Mar 18 2003 54199 -rw-r--r-- 1 ogrisel staff 2.5K Mar 18 2003 54200 -rw-r--r-- 1 ogrisel staff 1.7K Mar 18 2003 54201 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 54202 -rw-r--r-- 1 ogrisel staff 1.2K Mar 18 2003 54203 -rw-r--r-- 1 ogrisel staff 565B Mar 18 2003 54204 -rw-r--r-- 1 ogrisel staff 641B Mar 18 2003 54227 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 54228 -rw-r--r-- 1 ogrisel staff 877B Mar 18 2003 54470 -rw-r--r-- 1 ogrisel staff 1.0K Mar 18 2003 54471 -rw-r--r-- 1 ogrisel staff 993B Mar 18 2003 54472 -rw-r--r-- 1 ogrisel staff 434B Mar 18 2003 54473
The load_files
function can load text files from a 2 levels folder structure assuming folder names represent categories:
#print(load_files.__doc__)
all_twenty_train = load_files('datasets/20news-bydate-train/',
charset='latin-1', random_state=42)
all_twenty_test = load_files('datasets/20news-bydate-test/',
charset='latin-1', random_state=42)
all_target_names = all_twenty_train.target_names
all_target_names
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
all_twenty_train.target
array([12, 6, 9, ..., 9, 1, 12])
all_twenty_train.target.shape
(11314,)
all_twenty_test.target.shape
(7532,)
len(all_twenty_train.data)
11314
type(all_twenty_train.data[0])
unicode
def display_sample(i, dataset):
print("Class name: " + dataset.target_names[dataset.target[i]])
print("Text content:\n")
print(dataset.data[i])
display_sample(0, all_twenty_train)
Class name: sci.electronics Text content: From: wtm@uhura.neoucom.edu (Bill Mayhew) Subject: Re: How to the disks copy protected. Organization: Northeastern Ohio Universities College of Medicine Lines: 23 Write a good manual to go with the software. The hassle of photocopying the manual is offset by simplicity of purchasing the package for only $15. Also, consider offering an inexpensive but attractive perc for registered users. For instance, a coffee mug. You could produce and mail the incentive for a couple of dollars, so consider pricing the product at $17.95. You're lucky if only 20% of the instances of your program in use are non-licensed users. The best approach is to estimate your loss and accomodate that into your price structure. Sure it hurts legitimate users, but too bad. Retailers have to charge off loss to shoplifters onto paying customers; the software industry is the same. Unless your product is exceptionally unique, using an ostensibly copy-proof disk will just send your customers to the competetion. -- Bill Mayhew NEOUCOM Computer Services Department Rootstown, OH 44272-9995 USA phone: 216-325-2511 wtm@uhura.neoucom.edu (140.220.1.1) 146.580: N8WED
display_sample(1, all_twenty_train)
Class name: misc.forsale Text content: From: andy@SAIL.Stanford.EDU (Andy Freeman) Subject: Re: Catalog of Hard-to-Find PC Enhancements (Repost) Organization: Computer Science Department, Stanford University. Lines: 33 >andy@SAIL.Stanford.EDU (Andy Freeman) writes: >> >In article <C5ELME.4z4@unix.portal.com> jdoll@shell.portal.com (Joe Doll) wr >> >> "The Catalog of Personal Computing Tools for Engineers and Scien- >> >> tists" lists hardware cards and application software packages for >> >> PC/XT/AT/PS/2 class machines. Focus is on engineering and scien- >> >> tific applications of PCs, such as data acquisition/control, >> >> design automation, and data analysis and presentation. >> > >> >> If you would like a free copy, reply with your (U. S. Postal) >> >> mailing address. >> >> Don't bother - it never comes. It's a cheap trick for building a >> mailing list to sell if my junk mail flow is any indication. >> >> -andy sent his address months ago > >Perhaps we can get Portal to nuke this weasal. I never received a >catalog either. If that person doesn't respond to a growing flame, then >we can assume that we'yall look forward to lotsa junk mail. I don't want him nuked, I want him to be honest. The junk mail has been much more interesting than the promised catalog. If I'd known what I was going to get, I wouldn't have hesitated. I wouldn't be surprised if there were other folks who looked at the ad and said "nope" but who would be very interested in the junk mail that results. Similarly, there are people who wanted the advertised catalog who aren't happy with the junk they got instead. The folks buying the mailing lists would prefer an honest ad, and so would the people reading it. -andy --
Let's compute the (uncompressed, in-memory) size of the training and test sets in MB assuming an 8 bit encoding (in this case, all chars can be encoded using the latin-1 charset).
def text_size(text, charset='iso-8859-1'):
return len(text.encode(charset)) * 8 * 1e-6
train_size_mb = sum(text_size(text) for text in all_twenty_train.data)
test_size_mb = sum(text_size(text) for text in all_twenty_test.data)
print("Training set size: {0} MB".format(int(train_size_mb)))
print("Testing set size: {0} MB".format(int(test_size_mb)))
Training set size: 176 MB Testing set size: 110 MB
If we only consider a small subset of the 4 categories selected from the initial example:
train_subset_size_mb = sum(text_size(text) for text in twenty_train_subset.data)
test_subset_size_mb = sum(text_size(text) for text in twenty_test_subset.data)
print("Training set size: {0} MB".format(int(train_subset_size_mb)))
print("Testing set size: {0} MB".format(int(test_subset_size_mb)))
Training set size: 31 MB Testing set size: 22 MB
from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVectorizer()
TfidfVectorizer(analyzer='word', binary=False, charset='utf-8', charset_error='strict', dtype=<type 'long'>, input='content', lowercase=True, max_df=1.0, max_features=None, max_n=None, min_df=2, min_n=None, ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True, stop_words=None, strip_accents=None, sublinear_tf=False, token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True, vocabulary=None)
vectorizer = TfidfVectorizer(min_df=1)
%time X_train = vectorizer.fit_transform(twenty_train_subset.data)
CPU times: user 1.87 s, sys: 0.08 s, total: 1.95 s Wall time: 1.91 s
The results is not a numpy.array
but instead a scipy.sparse
matrix. This datastructure is quite similar to a 2D numpy array but it does not store the zeros.
X_train
<2034x34118 sparse matrix of type '<type 'numpy.float64'>' with 323433 stored elements in Compressed Sparse Row format>
scipy.sparse matrices also have a shape attribute to access the dimensions:
n_samples, n_features = X_train.shape
This dataset has around 2000 samples (the rows of the data matrix):
n_samples
2034
This is the same value as the number of strings in the original list of text documents:
len(twenty_train_subset.data)
2034
The columns represent the individual token occurrences:
n_features
34118
This number is the size of the vocabulary of the model extracted during fit in a Python dictionary:
type(vectorizer.vocabulary_)
dict
len(vectorizer.vocabulary_)
34118
The keys of the vocabulary_
attribute are also called feature names and can be accessed as a list of strings.
len(vectorizer.get_feature_names())
34118
Here are the first 10 elements (sorted in lexicographical order):
vectorizer.get_feature_names()[:10]
[u'00', u'000', u'0000', u'00000', u'000000', u'000005102000', u'000021', u'000062david42', u'0000vec', u'0001']
Let's have a look at the features from the middle:
vectorizer.get_feature_names()[n_features / 2:n_features / 2 + 10]
[u'inadequate', u'inala', u'inalienable', u'inane', u'inanimate', u'inapplicable', u'inappropriate', u'inappropriately', u'inaudible', u'inbreeding']
In adition to the text of the documents that has been vectorized one also has access to the label information:
y_train = twenty_train_subset.target
target_names = twenty_train_subset.target_names
target_names
['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']
y_train.shape
(2034,)
y_train
array([1, 2, 2, ..., 2, 1, 1])
# We can shape that we have the same number of samples for the input data and the labels:
X_train.shape[0] == y_train.shape[0]
True
Now that we have extracted a vector representation of the data, it's a good idea to project the data on the first 2D of a Principal Component Analysis to get a feel of the data. Note that the RandomizedPCA
class can accept scipy.sparse
matrices as input (as an alternative to numpy arrays):
from sklearn.decomposition import RandomizedPCA
%time X_train_pca = RandomizedPCA(n_components=2).fit_transform(X_train)
CPU times: user 0.09 s, sys: 0.02 s, total: 0.10 s Wall time: 0.10 s
/usr/local/lib/python2.7/site-packages/scikits/__init__.py:1: UserWarning: Module argparse was already imported from /usr/local/Cellar/python/2.7.5/Frameworks/Python.framework/Versions/2.7/lib/python2.7/argparse.pyc, but /usr/local/lib/python2.7/site-packages is being added to sys.path __import__('pkg_resources').declare_namespace(__name__)
from itertools import cycle
colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k']
for i, c in zip(np.unique(y_train), cycle(colors)):
pl.scatter(X_train_pca[y_train == i, 0],
X_train_pca[y_train == i, 1],
c=c, label=target_names[i], alpha=0.5)
_ = pl.legend(loc='best')
We can observe that there is a large overlap of the samples from different categories. This is to be expected as the PCA linear projection projects data from a 34118 dimensional space down to 2 dimensions: data that is linearly separable in 34118D is often no longer linearly separable in 2D.
Still we can notice an interesting pattern: the newsgroups on religion and atheism occupy the much the same region and computer graphics and space science / space overlap more together than they do with the religion or atheism newsgroups.
We can now train a classifier, for instance a Multinomial Naive Bayesian classifier which is a fast baseline for text classification tasks:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB(alpha=0.1)
clf
MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)
clf.fit(X_train, y_train)
MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)
We can now evaluate the classifier on the testing set. Let's first use the builtin score function, which is the rate of correct classification in the test set:
X_test = vectorizer.transform(twenty_test_subset.data)
y_test = twenty_test_subset.target
X_test.shape
(1353, 34118)
y_test.shape
(1353,)
clf.score(X_test, y_test)
0.89652623798965259
We can also compute the score on the test set and observe that the model is both overfitting and underfitting a bit at the same time:
clf.score(X_train, y_train)
0.99262536873156337
The text vectorizer has many parameters to customize it's behavior, in particular how it extracts tokens:
TfidfVectorizer()
TfidfVectorizer(analyzer='word', binary=False, charset='utf-8', charset_error='strict', dtype=<type 'long'>, input='content', lowercase=True, max_df=1.0, max_features=None, max_n=None, min_df=2, min_n=None, ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True, stop_words=None, strip_accents=None, sublinear_tf=False, token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True, vocabulary=None)
print(TfidfVectorizer.__doc__)
Convert a collection of raw documents to a matrix of TF-IDF features. Equivalent to CountVectorizer followed by TfidfTransformer. Parameters ---------- input : string {'filename', 'file', 'content'} If filename, the sequence passed as an argument to fit is expected to be a list of filenames that need reading to fetch the raw content to analyze. If 'file', the sequence items must have 'read' method (file-like object) it is called to fetch the bytes in memory. Otherwise the input is expected to be the sequence strings or bytes items are expected to be analyzed directly. charset : string, 'utf-8' by default. If bytes or files are given to analyze, this charset is used to decode. charset_error : {'strict', 'ignore', 'replace'} Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given `charset`. By default, it is 'strict', meaning that a UnicodeDecodeError will be raised. Other values are 'ignore' and 'replace'. strip_accents : {'ascii', 'unicode', None} Remove accents during the preprocessing step. 'ascii' is a fast method that only works on characters that have an direct ASCII mapping. 'unicode' is a slightly slower method that works on any characters. None (default) does nothing. analyzer : string, {'word', 'char'} or callable Whether the feature should be made of word or character n-grams. If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input. preprocessor : callable or None (default) Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps. tokenizer : callable or None (default) Override the string tokenization step while preserving the preprocessing and n-grams generation steps. ngram_range : tuple (min_n, max_n) The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. stop_words : string {'english'}, list, or None (default) If a string, it is passed to _check_stop_list and the appropriate stop list is returned. 'english' is currently the only supported string value. If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms. lowercase : boolean, default True Convert all characters to lowercase befor tokenizing. token_pattern : string Regular expression denoting what constitutes a "token", only used if `tokenize == 'word'`. The default regexp select tokens of 2 or more letters characters (punctuation is completely ignored and always treated as a token separator). max_df : float in range [0.0, 1.0] or int, optional, 1.0 by default When building the vocabulary ignore terms that have a term frequency strictly higher than the given threshold (corpus specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None. min_df : float in range [0.0, 1.0] or int, optional, 2 by default When building the vocabulary ignore terms that have a term frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None. max_features : optional, None by default If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None. vocabulary : Mapping or iterable, optional Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents. binary : boolean, False by default. If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. dtype : type, optional Type of the matrix returned by fit_transform() or transform(). norm : 'l1', 'l2' or None, optional Norm used to normalize term vectors. None for no normalization. use_idf : boolean, optional Enable inverse-document-frequency reweighting. smooth_idf : boolean, optional Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions. sublinear_tf : boolean, optional Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf). See also -------- CountVectorizer Tokenize the documents and count the occurrences of token and return them as a sparse matrix TfidfTransformer Apply Term Frequency Inverse Document Frequency normalization to a sparse matrix of occurrence counts.
The easiest way to introspect what the vectorizer is actually doing for a given test of parameters is call the vectorizer.build_analyzer()
to get an instance of the text analyzer it uses to process the text:
analyzer = TfidfVectorizer().build_analyzer()
analyzer("I love scikit-learn: this is a cool Python lib!")
[u'love', u'scikit', u'learn', u'this', u'is', u'cool', u'python', u'lib']
You can notice that all the tokens are lowercase, that the single letter word "I" was dropped, and that hyphenation is used. Let's change some of that default behavior:
analyzer = TfidfVectorizer(
preprocessor=lambda text: text, # disable lowercasing
token_pattern=ur'(?u)\b[\w-]+\b', # treat hyphen as a letter
# do not exclude single letter tokens
).build_analyzer()
analyzer("I love scikit-learn: this is a cool Python lib!")
[u'I', u'love', u'scikit-learn', u'this', u'is', u'a', u'cool', u'Python', u'lib']
The analyzer name comes from the Lucene parlance: it wraps the sequential application of:
The analyzer system of scikit-learn is much more basic than lucene's though.
Exercise:
Hint: the TfidfVectorizer
class can accept python functions to customize the preprocessor
, tokenizer
or analyzer
stages of the vectorizer.
type TfidfVectorizer()
alone in a cell to see the default value of the parameters
type TfidfVectorizer.__doc__
to print the constructor parameters doc or ?
suffix operator on a any Python class or method to read the docstring or even the ??
operator to read the source code.
Solution:
# %load solutions/05B_strip_headers.py
The feature extraction class has many options to customize its behavior:
print(TfidfVectorizer.__doc__)
Convert a collection of raw documents to a matrix of TF-IDF features. Equivalent to CountVectorizer followed by TfidfTransformer. Parameters ---------- input : string {'filename', 'file', 'content'} If filename, the sequence passed as an argument to fit is expected to be a list of filenames that need reading to fetch the raw content to analyze. If 'file', the sequence items must have 'read' method (file-like object) it is called to fetch the bytes in memory. Otherwise the input is expected to be the sequence strings or bytes items are expected to be analyzed directly. charset : string, 'utf-8' by default. If bytes or files are given to analyze, this charset is used to decode. charset_error : {'strict', 'ignore', 'replace'} Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given `charset`. By default, it is 'strict', meaning that a UnicodeDecodeError will be raised. Other values are 'ignore' and 'replace'. strip_accents : {'ascii', 'unicode', None} Remove accents during the preprocessing step. 'ascii' is a fast method that only works on characters that have an direct ASCII mapping. 'unicode' is a slightly slower method that works on any characters. None (default) does nothing. analyzer : string, {'word', 'char'} or callable Whether the feature should be made of word or character n-grams. If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input. preprocessor : callable or None (default) Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps. tokenizer : callable or None (default) Override the string tokenization step while preserving the preprocessing and n-grams generation steps. ngram_range : tuple (min_n, max_n) The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. stop_words : string {'english'}, list, or None (default) If a string, it is passed to _check_stop_list and the appropriate stop list is returned. 'english' is currently the only supported string value. If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms. lowercase : boolean, default True Convert all characters to lowercase befor tokenizing. token_pattern : string Regular expression denoting what constitutes a "token", only used if `tokenize == 'word'`. The default regexp select tokens of 2 or more letters characters (punctuation is completely ignored and always treated as a token separator). max_df : float in range [0.0, 1.0] or int, optional, 1.0 by default When building the vocabulary ignore terms that have a term frequency strictly higher than the given threshold (corpus specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None. min_df : float in range [0.0, 1.0] or int, optional, 2 by default When building the vocabulary ignore terms that have a term frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None. max_features : optional, None by default If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This parameter is ignored if vocabulary is not None. vocabulary : Mapping or iterable, optional Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents. binary : boolean, False by default. If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts. dtype : type, optional Type of the matrix returned by fit_transform() or transform(). norm : 'l1', 'l2' or None, optional Norm used to normalize term vectors. None for no normalization. use_idf : boolean, optional Enable inverse-document-frequency reweighting. smooth_idf : boolean, optional Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions. sublinear_tf : boolean, optional Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf). See also -------- CountVectorizer Tokenize the documents and count the occurrences of token and return them as a sparse matrix TfidfTransformer Apply Term Frequency Inverse Document Frequency normalization to a sparse matrix of occurrence counts.
In order to evaluate the impact of the parameters of the feature extraction one can chain a configured feature extraction and linear classifier (as an alternative to the naive Bayes model:
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.pipeline import Pipeline
pipeline = Pipeline((
('vec', TfidfVectorizer(min_df=1, max_df=0.8, use_idf=True)),
('clf', PassiveAggressiveClassifier(C=1)),
))
Such a pipeline can then be used to evaluate the performance on the test set:
pipeline.fit(twenty_train_subset.data, twenty_train_subset.target)
print("Train score:")
print(pipeline.score(twenty_train_subset.data, twenty_train_subset.target))
print("Test score:")
print(pipeline.score(twenty_test_subset.data, twenty_test_subset.target))
Train score: 1.0 Test score: 0.888396156689
Let's collect info on the fitted components of the previously trained model:
vec_name, vec = pipeline.steps[0]
clf_name, clf = pipeline.steps[1]
feature_names = vec.get_feature_names()
feature_weights = clf.coef_
feature_weights.shape
(4, 34109)
By sorting the feature weights on the linear model and asking the vectorizer what their names is, one can get a clue on what the model did actually learn on the data:
def display_important_features(feature_names, target_names, weights, n_top=30):
for i, target_name in enumerate(target_names):
print("Class: " + target_name)
print("")
sorted_features_indices = weights[i].argsort()[::-1]
most_important = sorted_features_indices[:n_top]
print(", ".join("{0}: {1:.4f}".format(feature_names[j], weights[i, j])
for j in most_important))
print("...")
least_important = sorted_features_indices[-n_top:]
print(", ".join("{0}: {1:.4f}".format(feature_names[j], weights[i, j])
for j in least_important))
print("")
display_important_features(feature_names, target_names, feature_weights)
Class: alt.atheism atheism: 2.8369, atheists: 2.7697, keith: 2.6781, cobb: 2.1986, islamic: 1.7952, okcforum: 1.6646, caltech: 1.5838, rice: 1.5769, bobby: 1.5187, peace: 1.5151, freedom: 1.4775, wingate: 1.4733, tammy: 1.4702, enviroleague: 1.4619, atheist: 1.4277, psilink: 1.3985, rushdie: 1.3846, tek: 1.3809, jaeger: 1.3783, osrhe: 1.3591, bible: 1.3543, wwc: 1.3375, mangoe: 1.3324, perry: 1.3082, religion: 1.2733, benedikt: 1.2581, liar: 1.2288, lunatic: 1.2110, free: 1.2060, charley: 1.2006 ... good: -0.8709, dm: -0.8764, 10: -0.8786, brian: -0.8900, objective: -0.8986, deal: -0.9098, thanks: -0.9174, order: -0.9174, image: -0.9258, scic: -0.9314, force: -0.9314, useful: -0.9377, com: -0.9414, weiss: -0.9428, interested: -0.9465, use: -0.9525, buffalo: -0.9580, fbi: -0.9660, 2000: -0.9810, they: -1.0051, muhammad: -1.0165, out: -1.0520, kevin: -1.0545, org: -1.0908, morality: -1.1773, mail: -1.1945, graphics: -1.5805, christian: -1.6466, hudson: -1.6503, space: -1.8655 Class: comp.graphics graphics: 4.3650, image: 2.5319, tiff: 1.9232, file: 1.8831, animation: 1.8733, 3d: 1.7270, card: 1.7127, files: 1.6637, 42: 1.6542, 3do: 1.6326, points: 1.6154, code: 1.5795, computer: 1.5767, video: 1.5549, color: 1.5069, polygon: 1.5057, windows: 1.4597, comp: 1.4421, package: 1.3865, format: 1.3183, pc: 1.2518, email: 1.2262, cview: 1.2155, hi: 1.2004, 24: 1.1909, postscript: 1.1827, virtual: 1.1706, sphere: 1.1691, looking: 1.1613, images: 1.1561 ... astronomy: -0.9077, are: -0.9133, who: -0.9217, bill: -0.9354, atheism: -0.9397, org: -0.9404, christian: -0.9489, funding: -0.9494, that: -0.9597, by: -0.9654, solar: -0.9708, access: -0.9722, us: -0.9907, planets: -0.9992, cmu: -1.0507, moon: -1.0730, you: -1.0802, nasa: -1.0859, dgi: -1.1009, jennise: -1.1009, writes: -1.1152, was: -1.1369, beast: -1.1597, dc: -1.2858, he: -1.3806, orbit: -1.3853, edu: -1.4121, re: -1.4396, god: -1.6422, space: -3.5582 Class: sci.space space: 5.7627, orbit: 2.3450, dc: 2.0973, nasa: 2.0815, moon: 1.9315, launch: 1.8711, sci: 1.7931, alaska: 1.7344, solar: 1.6946, henry: 1.6384, pat: 1.5734, ether: 1.5178, nick: 1.4982, planets: 1.4155, dietz: 1.3681, cmu: 1.3530, aurora: 1.3106, nicho: 1.2958, funding: 1.2768, lunar: 1.2757, astronomy: 1.2595, flight: 1.2418, rockets: 1.2048, jennise: 1.1963, dgi: 1.1963, shuttle: 1.1652, spacecraft: 1.1631, sky: 1.1593, digex: 1.1247, rochester: 1.1080 ... any: -0.8163, computer: -0.8183, gaspra: -0.8261, bible: -0.8342, video: -0.8485, religion: -0.8640, format: -0.8682, fbi: -0.8720, com: -0.8725, card: -0.8737, cc: -0.8828, code: -0.8875, 24: -0.8883, library: -0.8904, sgi: -0.9208, halat: -0.9531, 3d: -0.9607, ___: -0.9630, points: -1.0150, tiff: -1.0278, color: -1.0560, keith: -1.0664, koresh: -1.1302, file: -1.1529, files: -1.1679, image: -1.3169, christian: -1.3767, animation: -1.4241, god: -1.7873, graphics: -2.5640 Class: talk.religion.misc christian: 3.0979, hudson: 1.8959, who: 1.8842, beast: 1.8652, fbi: 1.6698, mr: 1.6386, buffalo: 1.6148, 2000: 1.5694, abortion: 1.5172, church: 1.5061, koresh: 1.4853, weiss: 1.4829, morality: 1.4750, brian: 1.4736, order: 1.4545, frank: 1.4508, biblical: 1.4123, 666: 1.3742, thyagi: 1.3520, terrorist: 1.3306, christians: 1.3202, mormons: 1.2810, amdahl: 1.2641, blood: 1.2380, freenet: 1.2299, rosicrucian: 1.2122, mitre: 1.2032, christ: 1.1982, objective: 1.1635, love: 1.1519 ... file: -0.9489, saturn: -0.9516, university: -0.9569, on: -0.9592, ac: -0.9685, lunatic: -0.9820, for: -0.9882, orbit: -0.9893, some: -1.0031, anyone: -1.0355, uk: -1.0703, liar: -1.0715, ibm: -1.0965, wwc: -1.1029, thanks: -1.1200, freedom: -1.1455, nasa: -1.1951, free: -1.2008, thing: -1.2337, atheist: -1.2573, princeton: -1.2966, cobb: -1.3150, keith: -1.4660, caltech: -1.4869, graphics: -1.5331, edu: -1.5969, atheism: -1.7381, it: -1.7571, atheists: -1.9418, space: -2.2211
from sklearn.metrics import classification_report
predicted = pipeline.predict(twenty_test_subset.data)
print(classification_report(twenty_test_subset.target, predicted,
target_names=target_names))
precision recall f1-score support alt.atheism 0.87 0.78 0.83 319 comp.graphics 0.93 0.96 0.95 389 sci.space 0.95 0.95 0.95 394 talk.religion.misc 0.76 0.80 0.78 251 avg / total 0.89 0.89 0.89 1353
The confusion matrix summarize which class where by having a look at off-diagonal entries: here we can see that articles about atheism have been wrongly classified as being about religion 57 times for instance:
from sklearn.metrics import confusion_matrix
confusion_matrix(twenty_test_subset.target, predicted)
array([[250, 5, 7, 57], [ 2, 375, 6, 6], [ 2, 15, 375, 2], [ 32, 9, 8, 202]])