クラスタリング¶

文書の類似度・関連性¶

文書間の類似度や関連性の問題を通して、クラスタリングの手法について学ぶ。
ここでは、bug-of-wordsと呼ばれる手法を使用する。単語の出現回数のみに注目した手法で、シンプルであるがロバストに文章を判定することが可能である。bug-of-words文書だけでなく画像などの分野にも応用されており、応用範囲の広い考え方でもある。

それでは、テキストデータをbug-of-wordsに変換してみよう。sklearnに含まれる、文書をワード単位のベクトルに変換するライブラリを使う。

In [12]:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)

min_dfはCountVectorizerで定義される。
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
頻繁に使われていないワードを無視するときに使用する。
設定する値が整数の場合、その数より出現回数の小さい単語は無視する。値が0から1の間の場合、データセットの数に対する割合の閾値を示し、閾値より小さい出現頻度の単語を無視する。

In [13]:

print(vectorizer)

CountVectorizer(analyzer=word, binary=False, charset=None, charset_error=None,
        decode_error=strict, dtype=<type 'numpy.int64'>, encoding=utf-8,
        input=content, lowercase=True, max_df=1.0, max_features=None,
        min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=(?u)\b\w\w+\b, tokenizer=None,
        vocabulary=None)

上記は、Scikitがデフォルトで選択するパラメータを表示している。
analyzer=wordは単語レベルで出現回数が数えられていることを示している。
token_patternは正規表現で記述されていて、単語の決定方法を定義する。

In [14]:

content = ["How to format my hard disk", "Hard disk format problems "]
X = vectorizer.fit_transform(content)
vectorizer.get_feature_names()

Out[14]:

[u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to']

In [15]:

print X.toarray()

[[1 1 1 1 1 0 1]
 [1 1 1 0 0 1 0]]

In [16]:

print X.toarray().transpose()

[[1 1]
 [1 1]
 [1 1]
 [1 0]
 [1 0]
 [0 1]
 [1 0]]

Xには文章に含まれる単語を表現したベクトルである。1列目が1つめ、2列目が2つめの文章のベクトルである。

疎なベクトルの扱い¶

In [17]:

print X

  (0, 0)	1
  (1, 0)	1
  (0, 1)	1
  (1, 1)	1
  (0, 2)	1
  (1, 2)	1
  (0, 3)	1
  (0, 4)	1
  (1, 5)	1
  (0, 6)	1

上記は変数Xをそのまま表示したものである。疎なベクトルを扱うため、出現した単語のみのデータを保持するように工夫されている。1列目が文章のインデックス、2列目が単語のインデックスを示している。
例えば、(0,0)は1つめの文章に1つめの単語が含まれていることを示す。また、2つめの文章に3つめの単語が含まれていない、ことがわかる。内部実装は coo_matrix (COOrdinate)

単語を数える¶

In [18]:

import os
import sys

import scipy as sp

sys.path.append('/Users/masai/Documents/BuildingMachineLearningSystemsWithPython/ch03')

from utils import DATA_DIR

TOY_DIR = os.path.join(DATA_DIR, "toy")

posts = [open(os.path.join(TOY_DIR, f)).read() for f in os.listdir(TOY_DIR)]

In [19]:

print posts

['This is a toy post about machine learning. Actually, it contains not much interesting stuff.', 'Imaging databases provide storage capabilities.', 'Most imaging databases save images permanently.\n', 'Imaging databases store data.', 'Imaging databases store data. Imaging databases store data. Imaging databases store data.']

osはos.pathモジュールを使ってファイルの入出力を行うために利用する。
sysを利用して、import先のファイルパスを追加する。
http://kannokanno.hatenablog.com/entry/20130503/1367571825

早速、上記でロードしたファイルの単語を数えてみましょう。

In [20]:

X_train = vectorizer.fit_transform(posts)

In [21]:

num_samples, num_features = X_train.shape

In [22]:

print("#samples: %d, #features: %d" % (num_samples,num_features))

#samples: 5, #features: 25

上記は、5つの文書と25個の単語が存在することを示す。

In [23]:

print(vectorizer.get_feature_names())

[u'about', u'actually', u'capabilities', u'contains', u'data', u'databases', u'images', u'imaging', u'interesting', u'is', u'it', u'learning', u'machine', u'most', u'much', u'not', u'permanently', u'post', u'provide', u'save', u'storage', u'store', u'stuff', u'this', u'toy']

vectorizerに新しい文書を追加¶

In [24]:

new_post = "imaging databases"
new_post_vec = vectorizer.transform([new_post])
print(new_post_vec)

  (0, 5)	1
  (0, 7)	1

NumPyの配列で表現する¶

ベクトル化したデータは、toarray()メソッドを用いるとNumPyのndarrayで表現できる。
類似度などの計算に用いるベクトル表現は、こちらである。
ベクトルがスパースであることがわかる。

In [25]:

print(new_post_vec.toarray())

[[0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

類似度の計算¶

SciPyを用いてユークリッド距離を計算する。procedure dist_rawを定義する。scipyのprocedure normを利用する。

In [26]:

import numpy as np
import scipy as sp

def dist_raw(v1, v2):
    delta = v1 - v2
    return sp.linalg.norm(delta.toarray())

linalg はベクトル delta の長さを得る。
new_postと他の文書との距離を計算して、最も類似した文書を探す

In [27]:

import sys
best_doc = None
best_dist = sys.maxint
best_i = None
for i in range(0, num_samples):
    post = posts[i]
    if post == new_post:
        continue
    post_vec = X_train.getrow(i)
    d = dist_raw(post_vec, new_post_vec)
    print "=== Post %i with dist=%.2f: %s" %(i, d, post)
    if d < best_dist:
        best_dist = d
        best_i = i
print("Best post is %i with dist=%.2f"%(best_i, best_dist))

=== Post 0 with dist=4.00: This is a toy post about machine learning. Actually, it contains not much interesting stuff.
=== Post 1 with dist=1.73: Imaging databases provide storage capabilities.
=== Post 2 with dist=2.00: Most imaging databases save images permanently.

=== Post 3 with dist=1.41: Imaging databases store data.
=== Post 4 with dist=5.10: Imaging databases store data. Imaging databases store data. Imaging databases store data.
Best post is 3 with dist=1.41

sysは、pythonの整数型でサポートされる最大の整数を得るために使用する。
best_distは、最小のユークリッド距離を格納する。
best_iは、最小のユークリッド距離を記録した文書のindexを格納する。
文書の数でループするために、range(0, num_samples)を利用する。

In [31]:

print X_train.shape

(5, 25)

In [35]:

print X_train.toarray()

[[1 1 0 1 0 0 0 0 1 1 1 1 1 0 1 1 0 1 0 0 0 0 1 1 1]
 [0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0]
 [0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0]
 [0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 3 3 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0]]

上記からわかるように、X_trainは5行25列の行列である。
各行は文書に対応し、それぞれの文書に含まれる単語のインデックスを保持する。

In [32]:

print X_train

  (0, 0)	1
  (0, 1)	1
  (1, 2)	1
  (0, 3)	1
  (3, 4)	1
  (4, 4)	3
  (1, 5)	1
  (2, 5)	1
  (3, 5)	1
  (4, 5)	3
  (2, 6)	1
  (1, 7)	1
  (2, 7)	1
  (3, 7)	1
  (4, 7)	3
  (0, 8)	1
  (0, 9)	1
  (0, 10)	1
  (0, 11)	1
  (0, 12)	1
  (2, 13)	1
  (0, 14)	1
  (0, 15)	1
  (2, 16)	1
  (0, 17)	1
  (1, 18)	1
  (2, 19)	1
  (1, 20)	1
  (3, 21)	1
  (4, 21)	3
  (0, 22)	1
  (0, 23)	1
  (0, 24)	1

0行目を取り出す

In [30]:

print X_train.getrow(0)

  (0, 0)	1
  (0, 1)	1
  (0, 3)	1
  (0, 8)	1
  (0, 9)	1
  (0, 10)	1
  (0, 11)	1
  (0, 12)	1
  (0, 14)	1
  (0, 15)	1
  (0, 17)	1
  (0, 22)	1
  (0, 23)	1
  (0, 24)	1

1行目を取り出す

In [60]:

print X_train.getrow(1)

  (0, 2)	1
  (0, 5)	1
  (0, 7)	1
  (0, 18)	1
  (0, 20)	1

1行目のベクトル表現

In [61]:

print X_train.getrow(1).toarray()

[[0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0]]

単語の出現回数ベクトルの正規化¶

出現回数を正規化：単語の出現回数によって類似度が増大しないようにする。
そのためには、距離を与える関数を変更する。
a / sp.linalg.norm(a) とすることにより、ベクトルaを正規化できる。

In [38]:

def dist_norm(v1, v2):
    v1_normalized = v1 / sp.linalg.norm(v1.toarray())
    v2_normalized = v2 / sp.linalg.norm(v2.toarray())
    delta = v1_normalized - v2_normalized
    return sp.linalg.norm(delta.toarray())

In [39]:

import sys
best_doc = None
best_dist = sys.maxint
best_i = None
for i in range(0, num_samples):
    post = posts[i]
    if post == new_post:
        continue
    post_vec = X_train.getrow(i)
    d = dist_norm(post_vec, new_post_vec)
    print "=== Post %i with dist=%.2f: %s" %(i, d, post)
    if d < best_dist:
        best_dist = d
        best_i = i
print("Best post is %i with dist=%.2f"%(best_i, best_dist))

=== Post 0 with dist=1.41: This is a toy post about machine learning. Actually, it contains not much interesting stuff.
=== Post 1 with dist=0.86: Imaging databases provide storage capabilities.
=== Post 2 with dist=0.92: Most imaging databases save images permanently.

=== Post 3 with dist=0.77: Imaging databases store data.
=== Post 4 with dist=0.77: Imaging databases store data. Imaging databases store data. Imaging databases store data.
Best post is 3 with dist=0.77

重要度の低い単語を取り除く¶

文章の分類に貢献しない stop wordと呼ばれる、様々な文章によく含まれる単語を対象外にする処理がよく行われる。
テキスト処理のための CountVectorizer には、stop word を登録する機能がある。

In [40]:

vectorizer = CountVectorizer(min_df=1, stop_words='english')

In [43]:

print vectorizer.get_stop_words()

frozenset(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'four', 'not', 'own', 'through', 'yourselves', 'fify', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom', 'seeming', 'under', 'ours', 'has', 'might', 'thereafter', 'latterly', 'do', 'them', 'his', 'around', 'than', 'get', 'very', 'de', 'none', 'cannot', 'every', 'whether', 'they', 'front', 'during', 'thus', 'now', 'him', 'nor', 'name', 'several', 'hereafter', 'always', 'who', 'cry', 'whither', 'this', 'someone', 'either', 'each', 'become', 'thereupon', 'sometime', 'side', 'two', 'therein', 'twelve', 'because', 'often', 'ten', 'our', 'eg', 'some', 'back', 'up', 'go', 'namely', 'towards', 'are', 'further', 'beyond', 'ourselves', 'yet', 'out', 'even', 'will', 'what', 'still', 'for', 'bottom', 'mine', 'since', 'please', 'forty', 'per', 'its', 'everything', 'behind', 'un', 'above', 'between', 'it', 'neither', 'seemed', 'ever', 'across', 'she', 'somehow', 'be', 'we', 'full', 'never', 'sixty', 'however', 'here', 'otherwise', 'were', 'whereupon', 'nowhere', 'although', 'found', 'alone', 're', 'along', 'fifteen', 'by', 'both', 'about', 'last', 'would', 'anything', 'via', 'many', 'could', 'thence', 'put', 'against', 'keep', 'etc', 'amount', 'became', 'ltd', 'hence', 'onto', 'or', 'con', 'among', 'already', 'co', 'afterwards', 'formerly', 'within', 'seems', 'into', 'others', 'while', 'whatever', 'except', 'down', 'hers', 'everyone', 'done', 'least', 'another', 'whoever', 'moreover', 'couldnt', 'throughout', 'anyhow', 'yourself', 'three', 'from', 'her', 'few', 'together', 'top', 'there', 'due', 'been', 'next', 'anyone', 'eleven', 'much', 'call', 'therefore', 'interest', 'then', 'thru', 'themselves', 'hundred', 'was', 'sincere', 'empty', 'more', 'himself', 'elsewhere', 'mostly', 'on', 'fire', 'am', 'becoming', 'hereby', 'amongst', 'else', 'part', 'everywhere', 'too', 'herself', 'former', 'those', 'he', 'me', 'myself', 'made', 'twenty', 'these', 'bill', 'cant', 'us', 'until', 'besides', 'nevertheless', 'below', 'anywhere', 'nine', 'can', 'of', 'your', 'toward', 'my', 'something', 'and', 'whereafter', 'whenever', 'give', 'almost', 'wherever', 'is', 'describe', 'beforehand', 'herein', 'an', 'as', 'itself', 'at', 'have', 'in', 'seem', 'whence', 'ie', 'any', 'fill', 'again', 'hasnt', 'inc', 'thereby', 'thin', 'no', 'perhaps', 'latter', 'meanwhile', 'when', 'detail', 'same', 'wherein', 'beside', 'also', 'that', 'other', 'take', 'which', 'becomes', 'you', 'if', 'nobody', 'see', 'though', 'may', 'after', 'upon', 'most', 'hereupon', 'eight', 'but', 'serious', 'nothing', 'such', 'why', 'a', 'off', 'whereby', 'third', 'i', 'whole', 'noone', 'sometimes', 'well', 'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'whereas', 'once'])

english と指定するだけで、318個のstop wordsが登録される。

In [44]:

X_train = vectorizer.fit_transform(posts)

In [45]:

print X_train.shape

(5, 18)

単語の数が 27 から 18 個になる。つまり、先ほどまでは無駄があった、というわけだ。
改めて類似度を計算すると以下のようになる。

次にquery用のベクトルを生成する。vectorizerを使って、新しい文書new_postのみのベクトルを生成する。

In [47]:

new_post = "imaging databases"
new_post_vec = vectorizer.transform([new_post])
print(new_post_vec)

  (0, 4)	1
  (0, 6)	1

クエリとデータベースを比較して文書間の類似度を計算する。

In [48]:

import sys
best_doc = None
best_dist = sys.maxint
best_i = None
for i in range(0, num_samples):
    post = posts[i]
    if post == new_post:
        continue
    post_vec = X_train.getrow(i)
    d = dist_norm(post_vec, new_post_vec)
    print "=== Post %i with dist=%.2f: %s" %(i, d, post)
    if d < best_dist:
        best_dist = d
        best_i = i
print("Best post is %i with dist=%.2f"%(best_i, best_dist))

=== Post 0 with dist=1.41: This is a toy post about machine learning. Actually, it contains not much interesting stuff.
=== Post 1 with dist=0.86: Imaging databases provide storage capabilities.
=== Post 2 with dist=0.86: Most imaging databases save images permanently.

=== Post 3 with dist=0.77: Imaging databases store data.
=== Post 4 with dist=0.77: Imaging databases store data. Imaging databases store data. Imaging databases store data.
Best post is 3 with dist=0.77

ステミング¶

意味的には同じ単語の語形変化をロバストに同一とみなす。
Natural Language Toolkit (NLTK) を利用する。

In [52]:

import nltk.stem
english_stemmer = nltk.stem.SnowballStemmer('english')
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))
vectorizer = StemmedCountVectorizer(min_df=1, stop_words='english')

In [58]:

X_train = vectorizer.fit_transform(posts)
new_post = "imaging databases"
new_post_vec = vectorizer.transform([new_post])
print(new_post_vec)

  (0, 4)	1
  (0, 5)	1

In [59]:

import sys
best_doc = None
best_dist = sys.maxint
best_i = None
for i in range(0, num_samples):
    post = posts[i]
    if post == new_post:
        continue
    post_vec = X_train.getrow(i)
    d = dist_norm(post_vec, new_post_vec)
    print "=== Post %i with dist=%.2f: %s" %(i, d, post)
    if d < best_dist:
        best_dist = d
        best_i = i
print("Best post is %i with dist=%.2f"%(best_i, best_dist))

=== Post 0 with dist=1.41: This is a toy post about machine learning. Actually, it contains not much interesting stuff.
=== Post 1 with dist=0.86: Imaging databases provide storage capabilities.
=== Post 2 with dist=0.63: Most imaging databases save images permanently.

=== Post 3 with dist=0.77: Imaging databases store data.
=== Post 4 with dist=0.77: Imaging databases store data. Imaging databases store data. Imaging databases store data.
Best post is 2 with dist=0.63

stemmingを行うと、imagesとimagingが同一に扱われるため、Post2との距離が縮まった。

あの有名なTF-IDF(term frequency - inverse document frequency)を用いる¶

ある単語に対して、対象の文書中に出現した回数をカウントするのに加えて、その単語が他の文書でどれくらい出現するかをカウントして、その回数で割る。

= 特定の文書だけで現れやすい単語のレスポンスを高くする

In [66]:

import scipy as sp
import math
def tfidf(term, doc, docset):
    tf = float(doc.count(term))/sum(doc.count(w) for w in set(doc))
    idf = math.log(float(len(docset))/(len([doc for doc in docset if term in doc])))
    return tf * idf

docは対象の文章。単語を複数含む。
docsetは他の文章も含めた文章の集合。
以下に例を示す。

In [62]:

doc_a, doc_abb, doc_abc = ["a"],["a", "b", "b"],["a","b","c"]

In [63]:

D=[doc_a, doc_abb, doc_abc]

In [69]:

print(tfidf("a", doc_a, D))

0.0

dic_aの単語"a"についてtfidfは0。つまりその他の文書にも含まれているため、分類に役に立つ情報ではない。

In [70]:

print(tfidf("b", doc_abb, D))

0.270310072072

In [72]:

print(tfidf("c", doc_a, D))

0.0

In [73]:

print(tfidf("c", doc_abc, D))

0.366204096223

cはdoc_abcにとって、大変重要であることがわかる。

In [75]:

print(tfidf("b", doc_abb, D))

0.270310072072

In [77]:

print(tfidf(("b"), doc_abc, D))

0.135155036036

bはdoc_abbにとって、doc_abcの2倍重要であることがわかる。bはdoc_abbにdoc_abcの2倍含まれいる。

SciPyには、上記よりも複雑なことを考慮したTF-IDFの実装TfidfVectorizerが含まれている。

In [86]:

from sklearn.feature_extraction.text import TfidfVectorizer
class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc: (
            english_stemmer.stem(w) for w in analyzer(doc))
vectorizer = StemmedTfidfVectorizer(min_df=1, stop_words='english')

このvectorizerを使って、類似度を算出すると以下のようになる。

In [87]:

X_train = vectorizer.fit_transform(posts)

In [91]:

X_train.shape

Out[91]:

(5, 17)

In [94]:

print X_train

  (0, 16)	0.353553390593
  (0, 15)	0.353553390593
  (0, 10)	0.353553390593
  (0, 8)	0.353553390593
  (0, 7)	0.353553390593
  (0, 6)	0.353553390593
  (0, 2)	0.353553390593
  (0, 0)	0.353553390593
  (1, 13)	0.524517221085
  (1, 11)	0.524517221085
  (1, 5)	0.295503853068
  (1, 4)	0.295503853068
  (1, 1)	0.524517221085
  (2, 12)	0.528000507961
  (2, 9)	0.528000507961
  (2, 5)	0.594932552269
  (2, 4)	0.297466276134
  (3, 14)	0.579747589583
  (3, 5)	0.4048366737
  (3, 4)	0.4048366737
  (3, 3)	0.579747589583
  (4, 14)	0.579747589583
  (4, 5)	0.4048366737
  (4, 4)	0.4048366737
  (4, 3)	0.579747589583

[文書index, 単語index]で表される各要素には、出現回数ではなくTF-IDF が格納されている。

In [92]:

new_post = "imaging databases"
new_post_vec = vectorizer.transform([new_post])
print(new_post_vec)

  (0, 5)	0.707106781187
  (0, 4)	0.707106781187

In [93]:

import sys
best_doc = None
best_dist = sys.maxint
best_i = None
for i in range(0, num_samples):
    post = posts[i]
    if post == new_post:
        continue
    post_vec = X_train.getrow(i)
    d = dist_norm(post_vec, new_post_vec)
    print "=== Post %i with dist=%.2f: %s" %(i, d, post)
    if d < best_dist:
        best_dist = d
        best_i = i
print("Best post is %i with dist=%.2f"%(best_i, best_dist))

=== Post 0 with dist=1.41: This is a toy post about machine learning. Actually, it contains not much interesting stuff.
=== Post 1 with dist=1.08: Imaging databases provide storage capabilities.
=== Post 2 with dist=0.86: Most imaging databases save images permanently.

=== Post 3 with dist=0.92: Imaging databases store data.
=== Post 4 with dist=0.92: Imaging databases store data. Imaging databases store data. Imaging databases store data.
Best post is 2 with dist=0.86

bug of wordsの欠点¶

単語の関連性、コンテキストを考慮していない
否定的な意味を正しく捉えられない：ユニグラムではなく、バイグラムやトリグラムを1つのまとまりとしてカウントすると良い
タイプミスに対応できない

クラスタリング¶

クラスタリングには文書の内容を適切に表現した特徴ベクトルが必要であり、ここではbug of words表現を用いる。

クラスタリングの種類¶

flat clustering：クラスタの数を指定、類似性の高いデータがグルーピングされる
hierarchical clustering：クラスタの数を指定しないほうが計算効率が良い、クラスタ同士の類似関係によって階層構造を作成する

SciKItでは数多くのクラスタリングの手法を提供している。

KMeans¶

ここで用いる最もひろく使われているであろうクラスタリング手法。初期値の取り方に依存して失敗する場合があることを考慮したアルゴリズムもある。

クラスタリングの実施¶

KMeansライブラリをimportしたら、クラスタ中心の初期値を決定するためのrandam seedを生成する。
クラスタの数を3つとし、clusteringを可視化するためのplot_clusteringを作成する。

In [24]:

import os
import scipy as sp
from scipy.stats import norm
from matplotlib import pylab
from sklearn.cluster import KMeans
seed = 2
sp.random.seed(seed)  # to reproduce the data later on

num_clusters = 3
def plot_clustering(x, y, title, mx=None, ymax=None, xmin=None, km=None):
    pylab.figure(num=None, figsize=(8, 6))
    if km:
        pylab.scatter(x, y, s=50, c=km.predict(list(zip(x, y))))
    else:
        pylab.scatter(x, y, s=50)

    pylab.title(title)
    pylab.xlabel("Occurrence word 1")
    pylab.ylabel("Occurrence word 2")

    pylab.autoscale(tight=True)
    pylab.ylim(ymin=0, ymax=1)
    pylab.xlim(xmin=0, xmax=1)
    pylab.grid(True, linestyle='-', color='0.75')

    return pylab

正規分布に従った乱数を生成する。

In [2]:

xw1 = norm(loc=0.3, scale=.15).rvs(20)
yw1 = norm(loc=0.3, scale=.15).rvs(20)

xw2 = norm(loc=0.7, scale=.15).rvs(20)
yw2 = norm(loc=0.7, scale=.15).rvs(20)

xw3 = norm(loc=0.2, scale=.15).rvs(20)
yw3 = norm(loc=0.8, scale=.15).rvs(20)

In [3]:

print xw1

[ 0.23748632  0.29155998 -0.02042941  0.54604062  0.03098466  0.1737379
  0.37543221  0.11320679  0.14130717  0.16364886  0.38271811  0.6438312
  0.30623091  0.13231118  0.38085875  0.21057605  0.29713043  0.47625018
  0.18781936  0.30135379]

In [4]:

print yw1

[ 0.16828382  0.27653487  0.33848557  0.15168314  0.24917671  0.2645724
  0.20435175  0.12185816  0.08681742  0.27697572  0.25964146  0.63470502
 -0.06521514  0.31690898  0.35556668  0.50394508  0.37527858  0.17336794
  0.30000146  0.38135289]

2つの単語だけからなる文書の特徴ベクトルを、3つの正規分布モデルに従った乱数を使って擬似的に生成した。

In [5]:

x = sp.append(sp.append(xw1, xw2), xw3)
y = sp.append(sp.append(yw1, yw2), yw3)

ベクトルを結合する

In [6]:

print x

[ 0.23748632  0.29155998 -0.02042941  0.54604062  0.03098466  0.1737379
  0.37543221  0.11320679  0.14130717  0.16364886  0.38271811  0.6438312
  0.30623091  0.13231118  0.38085875  0.21057605  0.29713043  0.47625018
  0.18781936  0.30135379  0.65297377  0.81565176  0.4197864   0.9596777
  0.9201517   0.6496484   0.79170112  0.70719559  0.57562971  0.71315653
  0.85005488  0.64283612  0.64364959  0.68882939  0.76502445  0.89175688
  0.6047981   0.77625944  0.7324174   0.42120814  0.17157961  0.1884172
  0.32370545  0.38723194  0.13941616 -0.0076778   0.40508531  0.38268284
  0.1306992   0.25263327  0.25727994  0.28494132  0.2306312   0.41100444
 -0.06069393  0.35612359  0.2570708   0.16742971  0.37602972 -0.15154048]

In [7]:

print y

[ 0.16828382  0.27653487  0.33848557  0.15168314  0.24917671  0.2645724
  0.20435175  0.12185816  0.08681742  0.27697572  0.25964146  0.63470502
 -0.06521514  0.31690898  0.35556668  0.50394508  0.37527858  0.17336794
  0.30000146  0.38135289  0.63710253  0.68015067  0.69406446  0.74890052
  0.39395154  0.70693833  0.59834866  0.48408415  0.77864446  0.81029194
  0.60201246  0.82636844  0.64277253  0.70997335  0.53518916  0.93767306
  0.30108258  0.68628211  0.80426794  0.39498002  0.97422822  0.85791171
  0.63003001  0.86496388  0.75438703  1.18779423  1.07529991  0.86610348
  0.69211192  0.71248781  0.75124256  0.71596482  0.66466309  0.71135416
  0.75857308  0.72246742  0.69521151  0.66066621  1.18256574  0.57902401]

IPython Notebookで描画出来るように pyplot をimport し、pylab inluneを記述。

In [17]:

import matplotlib.pyplot as plt
%pylab inline

Populating the interactive namespace from numpy and matplotlib

In [18]:

i = 1
plot_clustering(x, y, "Vectors")

Out[18]:

<module 'matplotlib.pylab' from '/Users/masai/.pyenv/versions/anaconda-2.0.1/lib/python2.7/site-packages/matplotlib/pylab.pyc'>

In [30]:

i += 1

# 1 iteration ####################

mx, my = sp.meshgrid(sp.arange(0, 1, 0.001), sp.arange(0, 1, 0.001))

km = KMeans(init='random', n_clusters=num_clusters, verbose=1,
            n_init=1, max_iter=1,
            random_state=seed)
km.fit(sp.array(list(zip(x, y))))
plot_clustering(x, y, "Clustering iteration 1", km=km)
c1a, c1b, c1c = km.cluster_centers_
pylab.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1],
              marker='x', linewidth=2, s=100, color='black')

Initialization complete
Iteration  0, inertia 4.749

Out[30]:

<matplotlib.collections.PathCollection at 0x10c799e50>

In [33]:

i += 1

# 2 iterations ####################
km = KMeans(init='random', n_clusters=num_clusters, verbose=1,
            n_init=1, max_iter=2,
            random_state=seed)
km.fit(sp.array(list(zip(x, y))))

Z = km.predict(sp.c_[mx.ravel(), my.ravel()]).reshape(mx.shape)

plot_clustering(x, y, "Clustering iteration 2", km=km)
c2a, c2b, c2c = km.cluster_centers_
pylab.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1],
              marker='x', linewidth=2, s=100, color='black')
pylab.gca().add_patch(
    pylab.Arrow(c1a[0], c1a[1], c2a[0] - c1a[0], c2a[1] - c1a[1], width=0.1))
pylab.gca().add_patch(
    pylab.Arrow(c1b[0], c1b[1], c2b[0] - c1b[0], c2b[1] - c1b[1], width=0.1))
pylab.gca().add_patch(
    pylab.Arrow(c1c[0], c1c[1], c2c[0] - c1c[0], c2c[1] - c1c[1], width=0.1))

Initialization complete
Iteration  0, inertia 4.749
Iteration  1, inertia 3.379

Out[33]:

<matplotlib.patches.Arrow at 0x1141f6650>

In [35]:

i += 1

# 3 iterations ####################
km = KMeans(init='random', n_clusters=num_clusters, verbose=1,
            n_init=1, max_iter=10,
            random_state=seed)
km.fit(sp.array(list(zip(x, y))))

Z = km.predict(sp.c_[mx.ravel(), my.ravel()]).reshape(mx.shape)

plot_clustering(x, y, "Clustering iteration 10", km=km)
pylab.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1],
              marker='x', linewidth=2, s=100, color='black')

Initialization complete
Iteration  0, inertia 4.749
Iteration  1, inertia 3.379
Iteration  2, inertia 2.600
Iteration  3, inertia 2.497
Iteration  4, inertia 2.447
Converged at iteration 4

Out[35]:

<matplotlib.collections.PathCollection at 0x10274e310>

validationデータを用いた評価¶

validationデータ 20Newsgroups DataSet¶

http://qwone.com/~jason/20Newsgroups/ からダウンロードしても良いが、scikitのfetch_20newsgroups()で代わりにデータをロードすることができる。

必要なデータとscipyをimport

In [111]:

import sklearn.datasets
import scipy as sp

クエリ用の文書を定義

In [113]:

new_post = \
    """Disk drive problems. Hi, I have a problem with my hard disk.
After 1 year it is working only sporadically now.
I tried to format it, but now it doesn't boot any more.
Any ideas? Thanks.
"""

20newsgroupsデータ・セットをロード

In [114]:

all_data = sklearn.datasets.fetch_20newsgroups(subset="all")
print("Number of total posts: %i" % len(all_data.filenames))

Number of total posts: 18846

categotyを指定してtrainデータを取得

In [115]:

groups = [
    'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware',
    'comp.sys.mac.hardware', 'comp.windows.x', 'sci.space']
train_data = sklearn.datasets.fetch_20newsgroups(subset="train",
                                                 categories=groups)
print("Number of training posts in tech groups:", len(train_data.filenames))

('Number of training posts in tech groups:', 3529)

In [116]:

labels = train_data.target
num_clusters=50

In [48]:

import nltk.stem

In [49]:

english_stemmer = nltk.stem.SnowballStemmer('english')

vectorizerを定義して、train_dataから特徴量をつくる

In [52]:

from sklearn.feature_extraction.text import TfidfVectorizer
class StemmedTfidfVectorizer(TfidfVectorizer):

    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))

vectorizer = StemmedTfidfVectorizer(min_df=10, max_df=0.5,
                                    stop_words='english', decode_error='ignore'
                                    )
vectorized = vectorizer.fit_transform(train_data.data)

In [58]:

num_samples, num_features = vectorized.shape
print("#samples: %d, #features: %d" % (num_samples, num_features))

#samples: 3529, #features: 4712

KMeansでクラスタリング

In [59]:

from sklearn.cluster import KMeans

In [60]:

km = KMeans(n_clusters=num_clusters, n_init=1, verbose=1, random_state=3)
clustered = km.fit(vectorized)
print("km.labels=%s" % km.labels_)

Initialization complete
Iteration  0, inertia 5686.053
Iteration  1, inertia 3164.888
Iteration  2, inertia 3132.208
Iteration  3, inertia 3111.713
Iteration  4, inertia 3098.584
Iteration  5, inertia 3092.191
Iteration  6, inertia 3087.277
Iteration  7, inertia 3084.100
Iteration  8, inertia 3082.800
Iteration  9, inertia 3082.234
Iteration 10, inertia 3081.949
Iteration 11, inertia 3081.843
Iteration 12, inertia 3081.791
Iteration 13, inertia 3081.752
Iteration 14, inertia 3081.660
Iteration 15, inertia 3081.617
Iteration 16, inertia 3081.589
Iteration 17, inertia 3081.571
Converged at iteration 17
km.labels=[48 23 31 ...,  6  2 22]

In [62]:

print("km.labels_.shape=%s" % km.labels_.shape)

km.labels_.shape=3529

クラスタリングの評価を行うことができるパッケージ metrics をインポートする。
クラスタリング手法を変更した場合、以下の数値を確認すると良い。

In [70]:

from sklearn import metrics
# Homogeneity は，比較対象となるクラスタリング結果におけるあるクラスタが，
# もう片方の正解となるクラスタリング結果におけるあるクラスタに属する要素のみを含んでいるかどうかの指標
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
# Completeness はあるクラスタリング結果におけるあるクラスタの要素全てが，比較対象となるクラスタリング結果
# において同じクラスタに属しているかどうか(同じ語義をもつ要素が1つのクラスタにどれだけまとめられているか)
print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
# HomogeneityとCompletenessは1 つのクラスタに属すべき要素群がどの程度同じクラスタに分類されやすいかを検証しやすい

# V-measureはHomogeneityとCompletnessの調和平均
print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))

Homogeneity: 0.445
Completeness: 0.231
V-measure: 0.304

他の評価指標も、metricsなら簡単に出力が可能。

In [74]:

# Adjusted Rand Index
# 調整Rand指標。2つのクラスタ間で分け方の類似度を示す指標。相関がない場合に0となるようにAdjustしたもの。
print("Adjusted Rand Index: %0.3f" % metrics.adjusted_rand_score(labels, km.labels_))

# Adjusted Mutual Information
# Mutual Informationは相互情報量とよばれる。文章にカテゴリが与えられている場合、単語とカテゴリの相関性の強さを数値化してものである。
print("Adjusted Mutual Information: %0.3f" % metrics.adjusted_mutual_info_score(labels, km.labels_))

# Silihouette Coefficient(シルエット係数)
# あるクラスタについて、クラスタに含まれる要素の凝集性と他のクラスタに含まれる要素の分離性のバランスを数値化したもの
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(vectorized, labels, sample_size=1000))

Adjusted Rand Index: 0.094
Adjusted Mutual Information: 0.223
Silhouette Coefficient: 0.006

クエリ

In [76]:

new_post

Out[76]:

"Disk drive problems. Hi, I have a problem with my hard disk.\nAfter 1 year it is working only sporadically now.\nI tried to format it, but now it doesn't boot any more.\nAny ideas? Thanks.\n"

In [80]:

# クエリのベクトル化
new_post_vec = vectorizer.transform([new_post])

# クラスタの推測
new_post_label = km.predict(new_post_vec)[0]

In [81]:

# 同じクラスタに所属する文書のインデックスを取得
similar_indices = (km.labels_ == new_post_label).nonzero()[0]

nonzeroはゼロでないインデックスを返す

In [78]:

print(similar_indices)

[  69  152  157  167  201  225  228  233  359  463  479  520  552  580  622
  676  779  882  884  917  939 1114 1253 1286 1486 1531 1752 1806 1809 1986
 2061 2249 2351 2412 2447 2493 2499 2510 2512 2600 2730 2800 2889 3080 3111
 3145 3146 3199 3202 3278 3285 3297 3310 3350 3437 3458]

nonzeroは以下のように、boolにも使用できる。Trueの場合のみ抽出することができる。

In [91]:

np.array([True, False, True, False, False]).nonzero()

Out[91]:

(array([0, 2]),)

In [94]:

np.array([[True, False, True], [True, True, False]]).nonzero()

Out[94]:

(array([0, 0, 1, 1]), array([0, 2, 0, 1]))

距離計算

In [98]:

sp.linalg.norm((new_post_vec - vectorized[0]).toarray())

Out[98]:

1.4142135623730951

In [100]:

similar = []
similar_index = []
for i in similar_indices:
    dist = sp.linalg.norm((new_post_vec - vectorized[i]).toarray())
    similar.append((dist, train_data.data[i]))
    similar_index.append((dist, i))

In [101]:

similar_index

Out[101]:

[(1.2613192374763227, 69),
 (1.3665586630895377, 152),
 (1.3230483856740909, 157),
 (1.160451955581586, 167),
 (1.1867577549899049, 201),
 (1.3601802039416646, 225),
 (1.3074714760135029, 228),
 (1.3475446115881746, 233),
 (1.1590572535586865, 359),
 (1.1503043264096682, 463),
 (1.3686339201024835, 479),
 (1.3395099562315893, 520),
 (1.3405036765703295, 552),
 (1.2554372724367662, 580),
 (1.3092623304321134, 622),
 (1.1539892053985115, 676),
 (1.3600124807383815, 779),
 (1.3072768986506598, 882),
 (1.3287675098520277, 884),
 (1.2632055766509744, 917),
 (1.1703009657776047, 939),
 (1.1063375279728889, 1114),
 (1.3306004461393541, 1253),
 (1.2005787014858564, 1286),
 (1.2666418605121041, 1486),
 (1.3162209344604823, 1531),
 (1.0378441731334072, 1752),
 (1.2793959084781283, 1806),
 (1.2635553685752812, 1809),
 (1.1716497460538475, 1986),
 (1.3383954097713806, 2061),
 (1.3440642926620332, 2249),
 (1.3328753056525837, 2351),
 (1.3233240874750281, 2412),
 (1.1657785603329736, 2447),
 (1.3457915508992047, 2493),
 (1.2634799816995834, 2499),
 (1.2571713698475409, 2510),
 (1.1213311970667454, 2512),
 (1.3077376701447476, 2600),
 (1.2854652478322826, 2730),
 (1.156361562019395, 2800),
 (1.2368000749072581, 2889),
 (1.2582262180197388, 3080),
 (1.2316979371219503, 3111),
 (1.0494693076510362, 3145),
 (1.3118266609870635, 3146),
 (1.3464766154690284, 3199),
 (1.3074012413334692, 3202),
 (1.3468022970298807, 3278),
 (1.1760804023305398, 3285),
 (1.3654831746826823, 3297),
 (1.3470251456011364, 3310),
 (1.2542702591505619, 3350),
 (1.1080887634209331, 3437),
 (1.1602070210357869, 3458)]

In [102]:

len(similar)

Out[102]:

ソートも簡単に行うことができる

In [103]:

similar = sorted(similar)

In [104]:

print("Count similar: %i" % len(similar))

Count similar: 56

In [106]:

# 最も類似
show_at_1 = similar[0]
# 中間
show_at_2 = similar[int(len(similar) / 10)]
# 最も類似していない
show_at_3 = similar[int(len(similar) / 2)]

In [109]:

print("== #1 most similar ==")
print(show_at_1)
print("== #2 ==")
print(show_at_2)
print("== #3 most different ==")
print(show_at_3)

== #1 most similar ==
(1.0378441731334072, u"From: Thomas Dachsel <GERTHD@mvs.sas.com>\nSubject: BOOT PROBLEM with IDE controller\nNntp-Posting-Host: sdcmvs.mvs.sas.com\nOrganization: SAS Institute Inc.\nLines: 25\n\nHi,\nI've got a Multi I/O card (IDE controller + serial/parallel\ninterface) and two floppy drives (5 1/4, 3 1/2) and a\nQuantum ProDrive 80AT connected to it.\nI was able to format the hard disk, but I could not boot from\nit. I can boot from drive A: (which disk drive does not matter)\nbut if I remove the disk from drive A and press the reset switch,\nthe LED of drive A: continues to glow, and the hard disk is\nnot accessed at all.\nI guess this must be a problem of either the Multi I/o card\nor floppy disk drive settings (jumper configuration?)\nDoes someone have any hint what could be the reason for it.\nPlease reply by email to GERTHD@MVS.SAS.COM\nThanks,\nThomas\n+-------------------------------------------------------------------+\n| Thomas Dachsel                                                    |\n| Internet: GERTHD@MVS.SAS.COM                                      |\n| Fidonet:  Thomas_Dachsel@camel.fido.de (2:247/40)                 |\n| Subnet:   dachsel@rnivh.rni.sub.org (UUCP in Germany, now active) |\n| Phone:    +49 6221 4150 (work), +49 6203 12274 (home)             |\n| Fax:      +49 6221 415101                                         |\n| Snail:    SAS Institute GmbH, P.O.Box 105307, D-W-6900 Heidelberg |\n| Tagline:  One bad sector can ruin a whole day...                  |\n+-------------------------------------------------------------------+\n")
== #2 ==
(1.1503043264096682, u'From: rpao@mts.mivj.ca.us (Roger C. Pao)\nSubject: Re: Booting from B drive\nOrganization: MicroTech Software\nLines: 34\n\nglang@slee01.srl.ford.com (Gordon Lang) writes:\n\n>David Weisberger (djweisbe@unix.amherst.edu) wrote:\n>: I have a 5 1/4" drive as drive A.  How can I make the system boot from\n>: my 3 1/2" B drive?  (Optimally, the computer would be able to boot\n>: from either A or B, checking them in order for a bootable disk.  But\n>: if I have to switch cables around and simply switch the drives so that\n>: it can\'t boot 5 1/4" disks, that\'s OK.  Also, boot_b won\'t do the trick\n>: for me.)\n>: \n>: Thanks,\n>:   Davebo\n>We had the same issue plague us for months on our Gateway.  I finally\n>got tired of it so I permanently interchanged the drives.  The only\n>reason I didn\'t do it in the first place was because I had several\n>bootable 5-1/4\'s and some 5-1/4 based install disks which expected\n>the A drive.  I order all new software (and upgrades) to be 3-1/2 and\n>the number of "stupid" install programs that can\'t handle an alternate\n>drive are declining with time - the ones I had are now upgraded.  And\n>as for the bootable 5-1/4\'s I just cut 3-1/2 replacements.\n\n>If switching the drives is not an option, you might be able to wire up\n>a drive switch to your computer chasis.  I haven\'t tried it but I think\n>it would work as long as it is wired carefully.\n\nI did this.  I use a relay (Radio Shack 4PDT) instead of a huge\nswitch.  This way, if the relay breaks, my drives will still work.\n\nIt works fine, but you may still need to change the CMOS before the\ndrive switch will work correctly for some programs.\n\nrp93\n-- \nRoger C. Pao  {gordius,bagdad}!mts!rpao, rpao@mts.mivj.ca.us\n')
== #3 most different ==
(1.2793959084781283, u'From: vg@volkmar.Stollmann.DE (Volkmar Grote)\nSubject: IBM PS/1 vs TEAC FD\nDistribution: world\nOrganization: Me? Organized?\nLines: 21\n\nHello,\n\nI already tried our national news group without success.\n\nI tried to replace a friend\'s original IBM floppy disk in his PS/1-PC\nwith a normal TEAC drive.\nI already identified the power supply on pins 3 (5V) and 6 (12V), shorted\npin 6 (5.25"/3.5" switch) and inserted pullup resistors (2K2) on pins\n8, 26, 28, 30, and 34.\nThe computer doesn\'t complain about a missing FD, but the FD\'s light\nstays on all the time. The drive spins up o.k. when I insert a disk,\nbut I can\'t access it.\nThe TEAC works fine in a normal PC.\n\nAre there any points I missed?\n\nThank you.\n\tVolkmar\n\n---\nVolkmar.Grote@Stollmann.DE\n')

ノイズに対する別の見方¶

クラスタリングを完璧にするよりも、ノイズの扱い方に注意せよ¶

zip関数はリストのタプルをタプルのリストにする
複数のシーケンスに対するループがシンプルに書ける

In [122]:

post_group = zip(train_data.data, train_data.target)

In [133]:

all = [(len(post[0]), post[0], train_data.target_names[post[1]])
       for post in post_group]
graphics = sorted([post for post in all if post[2] == 'comp.graphics'])
print(graphics[5:7])

[(245, u'From: SITUNAYA@IBM3090.BHAM.AC.UK\nSubject: test....(sorry)\nOrganization: The University of Birmingham, United Kingdom\nLines: 1\nNNTP-Posting-Host: ibm3090.bham.ac.uk\n\n==============================================================================\n', 'comp.graphics'), (250, u"From: pallis@server.uwindsor.ca (PALLIS  DIMITRIOS        )\nSubject: Re: Genoa Blitz 24 hits 1600x1200x256 NI !\nLines: 3\n\ni am sorry, but this genoa card does nothing that the ATI ultra plus 2mb\ncan't do, PLUS the ATI costs 330$US street price ....\n\n", 'comp.graphics')]

上記の2つの文書が、なぜcomp.graphicsに属するのか、分類の決めては見つからないので、単語化を行う。

In [141]:

noise_post = graphics[5][1]
print(noise_post)
analyzer = vectorizer.build_analyzer()
print(list(analyzer(noise_post)))

From: SITUNAYA@IBM3090.BHAM.AC.UK
Subject: test....(sorry)
Organization: The University of Birmingham, United Kingdom
Lines: 1
NNTP-Posting-Host: ibm3090.bham.ac.uk

==============================================================================

[u'situnaya', u'ibm3090', u'bham', u'ac', u'uk', u'subject', u'test', u'sorri', u'organ', u'univers', u'birmingham', u'unit', u'kingdom', u'line', u'nntp', u'post', u'host', u'ibm3090', u'bham', u'ac', u'uk']

In [142]:

noise_post_2 = graphics[6][1]
print(noise_post_2)
analyzer = vectorizer.build_analyzer()
print(list(analyzer(noise_post_2)))

From: pallis@server.uwindsor.ca (PALLIS  DIMITRIOS        )
Subject: Re: Genoa Blitz 24 hits 1600x1200x256 NI !
Lines: 3

i am sorry, but this genoa card does nothing that the ATI ultra plus 2mb
can't do, PLUS the ATI costs 330$US street price ....


[u'palli', u'server', u'uwindsor', u'ca', u'palli', u'dimitrio', u'subject', u'genoa', u'blitz', u'24', u'hit', u'1600x1200x256', u'ni', u'line', u'sorri', u'genoa', u'card', u'doe', u'ati', u'ultra', u'plus', u'2mb', u'plus', u'ati', u'cost', u'330', u'street', u'price']

さらにトークン化、小文字に変換、ストップワードを取り除いた結果を見てみる

In [145]:

useful = set(analyzer(noise_post)).intersection(vectorizer.get_feature_names())
print(sorted(useful))

[u'ac', u'birmingham', u'host', u'kingdom', u'nntp', u'sorri', u'test', u'uk', u'unit', u'univers']

In [146]:

useful_2 = set(analyzer(noise_post_2)).intersection(vectorizer.get_feature_names())
print(sorted(useful))

[u'ac', u'birmingham', u'host', u'kingdom', u'nntp', u'sorri', u'test', u'uk', u'unit', u'univers']

さらに、他の文書に多く現れる単語かどうか（分類に寄与するかどうか）を見るため、TF-IDFも見る必要がある。

In [147]:

for term in sorted(useful):
    print('IDF(%s)=%.2f' % (term,
                            vectorizer._tfidf.idf_[vectorizer.vocabulary_[term]]))

IDF(ac)=3.51
IDF(birmingham)=6.77
IDF(host)=1.74
IDF(kingdom)=6.68
IDF(nntp)=1.77
IDF(sorri)=4.14
IDF(test)=3.83
IDF(uk)=3.70
IDF(unit)=4.42
IDF(univers)=1.91

上記例だと、birminghamとkingdomの2つの単語が、分類に大きく寄与していることがわかる。

まとめ¶

ここまで、文書の類似度やクラスタリングを学ぶため、以下の様な手法を確認した。

文書の特徴ベクトルの生成
特徴ベクトルの距離
クラスタリングの手法
metricsによるクラスタリングの評価
文書のノイズ

改善の余地がある項目も多い。

特徴ベクトルの数
特徴ベクトルの距離を変更する（cos類似度やピアソン相関係数、Jaccard係数など）
クラスタリング手法の変更(K-Means以外にも多くのクラスタリング手法が存在する)