This is the iPython notebook complementing the article http://www.mathiasbernhard.ch/genealogy-part-ii/
To load the wiki-entries from a pickle dump and start directly with the NLP-ML part, go here
import pandas as pd
from bs4 import BeautifulSoup
import urllib
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
path = r'data/genealogy.csv'
df = pd.read_csv(path)
df.head()
Prefix | Name | Vorname | weitere Vornamen | Beruf | geboren | gestorben | Alter | URL | |
---|---|---|---|---|---|---|---|---|---|
0 | NaN | Pythagoras | von Samos | NaN | Philosoph/Mathematik | -570 | -495 | 75 | http://en.wikipedia.org/wiki/Pythagoras |
1 | NaN | Platon | NaN | NaN | Philosoph/Mathematik | -427 | -347 | 80 | http://en.wikipedia.org/wiki/Plato |
2 | NaN | Aristoteles | NaN | NaN | Philosoph/Logik | -384 | -322 | 62 | http://en.wikipedia.org/wiki/Aristotle |
3 | NaN | Epicurus | NaN | NaN | Philosoph | -341 | -270 | 71 | http://en.wikipedia.org/wiki/Epicurus |
4 | NaN | Euklid | von Alexandria | NaN | Mathematik | -325 | -265 | 60 | http://en.wikipedia.org/wiki/Euclid |
def extract_article(url):
site = urllib.urlopen(url)
soup = BeautifulSoup(site)
article = soup.find("div", "mw-body-content").get_text()
return article
pythagoras = extract_article(df.URL[0])
print pythagoras[760:1400]
Pythagoras of Samos (US /pɪˈθæɡərəs/;[1] UK /paɪˈθæɡərəs/;[2] Greek: Πυθαγόρας ὁ Σάμιος Pythagóras ho Sámios "Pythagoras the Samian", or simply Πυθαγόρας; Πυθαγόρης in Ionian Greek; c. 570 – c. 495 BC)[3][4] was an Ionian Greek philosopher, mathematician, and founder of the religious movement called Pythagoreanism. Most of the information about Pythagoras was written down centuries after he lived, so very little reliable information is known about him. He was born on the island of Samos, and might have travelled widely in his youth, visiting Egypt and other places seeking knowledge. Around 530 BC, he moved to Croton, in Magna Graec
names = map(lambda x : x.split("/")[-1], df.URL)
names[:25]
['Pythagoras', 'Plato', 'Aristotle', 'Epicurus', 'Euclid', 'Archimedes', 'Lucretius', 'Vitruvius', 'Fibonacci', 'Dante_Alighieri', 'Filippo_Brunelleschi', 'Johannes_Gutenberg', 'Leon_Battista_Alberti', 'Donato_Bramante', 'Leonardo_da_Vinci', 'Albrecht_D%C3%BCrer', 'Sebastiano_Serlio', 'Michelangelo', 'Parmigianino', 'Giacomo_Barozzi_da_Vignola', 'Andrea_Palladio', 'Philibert_de_l%27Orme', 'Rafael_Bombelli', 'John_Napier', 'Francis_Bacon']
wiki_entries = []
for url in df.URL:
wiki_entries.append(extract_article(url))
len(wiki_entries)
262
import pickle
pickle.dump(wiki_entries, open('data/wiki_entries.pkl','w'))
The file loaded in the next step can be downloaded from here: http://www.mathiasbernhard.ch/notebooks/wiki_entries.pkl.zip
pkl_entries = pickle.load(open('data/wiki_entries.pkl','r'))
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(pkl_entries)
X_train_counts.shape
(262, 68851)
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape
(262, 68851)
as for example used by Andrej Karpathy and Alexander Fabisch
rkeisler on guardian articles: https://github.com/rkeisler/tsne_guardian/blob/master/tsne_guardian.py
vecs (countvectorizer) force vectors to have unit length: norm = np.sqrt(vecs.multiply(vecs).sum(1)) vecs = vecs.multiply(1./norm) distance_matrix = sklearn.metrics.pairwise.pairwise_distances(vecs, metric='cosine') model = TSNE(early_exaggeration=4) pos = model.fit_transform(distance_matrix)
additional parameters for TfidfVectorizer proposed by A. Karpathy
min_df=2, stop_words = 'english',\ strip_accents = 'unicode', lowercase=True, ngram_range=(1,2),\ norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True
from sklearn.feature_extraction.text import TfidfVectorizer
vectors = TfidfVectorizer().fit_transform(pkl_entries)
vectors.shape
(262, 68851)
from sklearn.manifold import TSNE
def plot_embedding(pos):
fig = plt.figure(figsize=(10, 10))
ax = plt.axes(frameon=False)
plt.setp(ax, xticks=(), yticks=())
plt.scatter(pos[:,0],pos[:,1], s=5, color='r')
for i, txt in enumerate(names):
plt.annotate(txt, (pos[i,0], pos[i,1]), fontsize=6)
from sklearn.decomposition import TruncatedSVD
X_reduced = TruncatedSVD(n_components=50, random_state=0).fit_transform(vectors)
X_embedded = TSNE(n_components=2, perplexity=40, verbose=2).fit_transform(X_reduced)
[t-SNE] Computing pairwise distances... [t-SNE] Computed conditional probabilities for sample 262 / 262 [t-SNE] Mean sigma: 0.181085 [t-SNE] Iteration 10: error = 19.5840120, gradient norm = 0.0329279 [t-SNE] Iteration 20: error = 17.0377197, gradient norm = 0.0752239 [t-SNE] Iteration 30: error = 16.1263235, gradient norm = 0.0850579 [t-SNE] Iteration 32: did not make any progress during the last 30 episodes. Finished. [t-SNE] Iteration 40: error = 15.6700067, gradient norm = 0.0858924 [t-SNE] Iteration 50: error = 16.0684620, gradient norm = 0.0771126 [t-SNE] Iteration 60: error = 16.0344312, gradient norm = 0.0775531 [t-SNE] Iteration 64: did not make any progress during the last 30 episodes. Finished. [t-SNE] Error after 64 iterations with early exaggeration: 16.027074 [t-SNE] Iteration 70: error = 1.9491604, gradient norm = 0.0199896 [t-SNE] Iteration 80: error = 1.7117975, gradient norm = 0.0209856 [t-SNE] Iteration 90: error = 1.8939256, gradient norm = 0.0254609 [t-SNE] Iteration 100: error = 2.2310724, gradient norm = 0.0347741 [t-SNE] Iteration 110: error = 2.4389540, gradient norm = 0.0383329 [t-SNE] Iteration 111: did not make any progress during the last 30 episodes. Finished. [t-SNE] Error after 111 iterations: 2.467875
plot_embedding(X_embedded)
vectorizer = TfidfVectorizer(min_df=2, stop_words = 'english',\
strip_accents = 'unicode', lowercase=True, ngram_range=(1,2),\
norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)
X = vectorizer.fit_transform(pkl_entries)
D = -(X * X.T).todense() # Distance matrix: dot product between tfidf vectors
ak_embed = TSNE(n_components=2, perplexity=40, verbose=2).fit_transform(D)
[t-SNE] Computing pairwise distances... [t-SNE] Computed conditional probabilities for sample 262 / 262 [t-SNE] Mean sigma: 0.298632 [t-SNE] Iteration 10: error = 18.3174214, gradient norm = 0.0362003 [t-SNE] Iteration 20: error = 16.8947560, gradient norm = 0.0595788 [t-SNE] Iteration 30: error = 16.2785143, gradient norm = 0.1064113 [t-SNE] Iteration 32: did not make any progress during the last 30 episodes. Finished. [t-SNE] Iteration 40: error = 17.4717802, gradient norm = 0.0666907 [t-SNE] Iteration 50: error = 16.8543602, gradient norm = 0.0710531 [t-SNE] Iteration 60: error = 17.3436406, gradient norm = 0.0682378 [t-SNE] Iteration 65: did not make any progress during the last 30 episodes. Finished. [t-SNE] Error after 65 iterations with early exaggeration: 17.397422 [t-SNE] Iteration 70: error = 2.5222497, gradient norm = 0.0224481 [t-SNE] Iteration 80: error = 2.1122603, gradient norm = 0.0244114 [t-SNE] Iteration 90: error = 2.2152029, gradient norm = 0.0314139 [t-SNE] Iteration 100: error = 2.5744293, gradient norm = 0.0421679 [t-SNE] Iteration 110: error = 2.7956470, gradient norm = 0.0458885 [t-SNE] Iteration 113: did not make any progress during the last 30 episodes. Finished. [t-SNE] Error after 113 iterations: 2.842739
plot_embedding(ak_embed)
import numpy as np
from sklearn.metrics.pairwise import pairwise_distances
vecs = X_train_counts
#force vectors to have unit length:
norm = np.sqrt(vecs.multiply(vecs).sum(1))
vecs = vecs.multiply(1./norm)
distance_matrix = pairwise_distances(vecs, metric='cosine')
model = TSNE(early_exaggeration=4)
rk_embed = model.fit_transform(distance_matrix)
plot_embedding(rk_embed)
plt.figure(figsize=(8,8))
plt.imshow(distance_matrix, cmap='coolwarm')
<matplotlib.image.AxesImage at 0x10d258990>
plt.figure(figsize=(8,8))
plt.imshow(-D, norm=LogNorm(), cmap='coolwarm')
<matplotlib.image.AxesImage at 0x10dc11bd0>
ak_embed_pre = TSNE(n_components=2, perplexity=40, verbose=2, metric='precomputed').fit_transform(D)
[t-SNE] Computed conditional probabilities for sample 262 / 262 [t-SNE] Mean sigma: 0.141009 [t-SNE] Iteration 10: error = 19.2839620, gradient norm = 0.0330228 [t-SNE] Iteration 20: error = 17.6546334, gradient norm = 0.0499292 [t-SNE] Iteration 30: error = 16.7189882, gradient norm = 0.1086522 [t-SNE] Iteration 32: did not make any progress during the last 30 episodes. Finished. [t-SNE] Iteration 40: error = 16.8981159, gradient norm = 0.0836658 [t-SNE] Iteration 50: error = 17.2641179, gradient norm = 0.0736577 [t-SNE] Iteration 60: error = 17.2952214, gradient norm = 0.0740660 [t-SNE] Iteration 64: did not make any progress during the last 30 episodes. Finished. [t-SNE] Error after 64 iterations with early exaggeration: 17.388337 [t-SNE] Iteration 70: error = 2.5358830, gradient norm = 0.0225712 [t-SNE] Iteration 80: error = 2.2483497, gradient norm = 0.0273856 [t-SNE] Iteration 90: error = 2.3097594, gradient norm = 0.0313441 [t-SNE] Iteration 100: error = 2.6407048, gradient norm = 0.0425939 [t-SNE] Iteration 110: error = 2.8024695, gradient norm = 0.0535090 [t-SNE] Iteration 114: did not make any progress during the last 30 episodes. Finished. [t-SNE] Error after 114 iterations: 2.891214
plot_embedding(ak_embed_pre)