Introduction¶

PyDSM is a lightweight Python 3 framework for building and exploring distributional semantic models with focus on extensibility and ease of use.

The IndexMatrix¶

At its core is the IndexMatrix class, which provides a sparse matrix representation with string representations of the word and column index. Much of its functionality is inspired by the Pandas DataFrame, but since the DataFrame performs poorly in sparse situations (SparseDataFrame converts too often to its dense representation), the IndexMatrix will do for now.

The DSMs¶

There are two models included in the package: A simple cooccurrence matrix model (CooccurrenceDSM), and the Random Indexing model. Building new models can be done by creating a class extending pydsm.model.DSM, and providing a _build method for building the DSM as a dictionary of dictionary of values.

Examples¶

Building a DSM¶

In [1]:

import pydsm
cooc = pydsm.build(pydsm.CooccurrenceDSM, corpus='wiki.20k', window_size=(2,2), language='en')
cooc

   98% [================================================================ ]

Total time: 149.80 s

Out[1]:

CooccurrenceDSM
Vocab size: 356916
[356916, 356916]  metius  synechism  hermandiederik  ...
igazság           0.0     0.0        0.0             ...
kformula          0.0     0.0        0.0             ...
synechism         0.0     0.0        0.0             ...
...               ...     ...        ...             ...

Apply weighting¶

In [2]:

import pydsm.weighting as weighting
ppmi = cooc.apply_weighting(weight_func=weighting.ppmi)
ppmi

Out[2]:

CooccurrenceDSM
Vocab size: 356916
[356916, 356916]  metius  synechism  hermandiederik  ...
igazság           0.0     0.0        0.0             ...
kformula          0.0     0.0        0.0             ...
synechism         0.0     0.0        0.0             ...
...               ...     ...        ...             ...

Finding nearest neighbors¶

In [3]:

import pydsm.similarity as similarity
ppmi.nearest_neighbors('moon', sim_func=similarity.cos)

Out[3]:

[356916, 1]  moon
moon         0.9999999999999996
moons        0.10031087979984758
dysnomia     0.08623286288348603
pluto        0.08572034915790229
lunar        0.08285183916068635
sailor       0.08276499255478217
menazil      0.07943614314690178
alkamar      0.07943614314690178
senshi       0.07882882415306644
mars         0.07846039338300181
...          ...

Compose¶

In [4]:

import pydsm.composition as composition
car_brand = ppmi.compose('black', 'white', comp_func=composition.multiplicative)
car_brand

Out[4]:

[1, 356916]  metius  synechism  hermandiederik  wendt  settheory  elderselect  kolkha  ...
black white  0.0     0.0        0.0             0.0    0.0        0.0          0.0     ...

In [6]:

ppmi.nearest_neighbors(car_brand, sim_func=similarity.cos)

Out[6]:

[356916, 1]        black white
white              0.32082268788455864
black              0.3120890846417935
orangebrownorange  0.1150050121381072
superauto          0.11211844385043106
upperparts         0.10831323874421267
whitegraywhite     0.10710022284277554
blue               0.10413968267114806
red                0.1034924372826562
yellow             0.10085997033212568
nonhispanic        0.09767574768550827
...                ...

Selection¶

Selections can be made in several ways: by word index, integer index, slices, boolean numpy arrays.

In [7]:

ppmi['rock']  # Row selection

Out[7]:

[1, 356916]  metius  synechism  hermandiederik  wendt  settheory  elderselect  kolkha  ...
rock         0.0     0.0        0.0             0.0    0.0        0.0          0.0     ...

In [8]:

ppmi[['correa', 'rock']]  # Multiple rows slection

Out[8]:

[2, 356916]  metius  synechism  hermandiederik  wendt  settheory  elderselect  kolkha  ...
correa       0.0     0.0        0.0             0.0    0.0        0.0          0.0     ...
rock         0.0     0.0        0.0             0.0    0.0        0.0          0.0     ...

In [9]:

ppmi['good', 'bad']  # Value selection

Out[9]:

3.7391712801558179

In [10]:

ppmi[100:105]  # Rows 100 to 104

Out[10]:

[5, 356916]  metius  synechism  hermandiederik  wendt  settheory  elderselect  kolkha  ...
bond         0.0     0.0        0.0             0.0    0.0        0.0          0.0     ...
jails        0.0     0.0        0.0             0.0    0.0        0.0          0.0     ...
bilton       0.0     0.0        0.0             0.0    0.0        0.0          0.0     ...
kasama       0.0     0.0        0.0             0.0    0.0        0.0          0.0     ...
iceland      0.0     0.0        0.0             0.0    0.0        0.0          0.0     ...

In [11]:

ppmi[:, ['good', 'bad']]  # Column selection

Out[11]:

[356916, 2]     good  bad
igazság         0.0   0.0
kformula        0.0   0.0
synechism       0.0   0.0
hermandiederik  0.0   0.0
settheory       0.0   0.0
elderselect     0.0   0.0
kolkha          0.0   0.0
chávezpernell   0.0   0.0
exoticism       0.0   0.0
nezek           0.0   0.0
...             ...   ...

In [12]:

ppmi[9000:9010, [0,4,5]]  # Mixing

Out[12]:

[10, 3]       metius  settheory  elderselect
skåne         0.0     0.0        0.0
tausha        0.0     0.0        0.0
delusions     0.0     0.0        0.0
maître        0.0     0.0        0.0
guichard      0.0     0.0        0.0
prig          0.0     0.0        0.0
lindigénat    0.0     0.0        0.0
wzo           0.0     0.0        0.0
manatees      0.0     0.0        0.0
yorkshiremen  0.0     0.0        0.0

Visualize¶

In [13]:

%matplotlib inline
import pydsm.visualization as visualization
visualization.hexbin(ppmi)

In [14]:

visualization.pcolormesh(ppmi)

In [15]:

visualization.heatmap(ppmi)

In [16]:

visualization.plot_vector(ppmi['model'])

IndexMatrix operations¶

The IndexMatrix is the core of PyDSM. You have already seen the selection functionality, which is a part of the IndexMatrix. Here are some common operations.

In [17]:

from pydsm import IndexMatrix
mat = ppmi.matrix
mat.sum(axis=1)  # Sum the matrix row-wise

Out[17]:

[356916, 1]
igazság         18.683354595341434
kformula        40.96849948072415
synechism       45.74842250518874
hermandiederik  43.97133364222774
settheory       25.393081207234648
elderselect     18.049973280566153
kolkha          55.663220171576725
chávezpernell   44.70748059925722
exoticism       118.33632351892375
nezek           48.6589983850974
...             ...

In [26]:

srted = mat.sum(axis=1).sort(ascending=False)  # Sum the matrix along the rows, and sort it.
srted

Out[26]:

[356916, 1]
and          161771.25962282284
the          93471.16046956484
in           92074.9645661122
of           89557.50082065431
a            87442.98215496707
by           77974.95155686601
as           77812.6773046294
is           76331.9051760016
or           73657.61317815434
with         67192.71160663095
...          ...

In [19]:

deleted = mat.delete(srted[:100], axis=0)  # Delete the 100 first rows in srted. 
deleted.sort(axis=0, key=IndexMatrix.sum, ascending=False)  # Sort the matrix according to to sum function. 
                                                            #Note that the hundred first rows are removed.

Out[19]:

[356816, 356916]  metius  synechism  hermandiederik  wendt  settheory  elderselect  kolkha  ...
under             0.0     0.0        0.0             0.0    0.0        0.0          0.0     ...
la                0.0     0.0        0.0             0.0    0.0        0.0          0.0     ...
along             0.0     0.0        0.0             0.0    0.0        0.0          0.0     ...
son               0.0     0.0        0.0             0.0    0.0        0.0          0.0     ...
several           0.0     0.0        0.0             0.0    0.0        0.0          0.0     ...
will              0.0     0.0        0.0             0.0    0.0        0.0          0.0     ...
if                0.0     0.0        0.0             0.0    0.0        0.0          0.0     ...
group             0.0     0.0        0.0             0.0    0.0        0.0          0.0     ...
among             0.0     0.0        0.0             0.0    0.0        0.0          0.0     ...
so                0.0     0.0        0.0             0.0    0.0        0.0          0.0     ...
...               ...     ...        ...             ...    ...        ...          ...     ...

In [20]:

std = deleted.std(axis=1)  # Return the rowwise standard deviation
std

Out[20]:

[356816, 1]
igazság         0.025172048090453064
kformula        0.03520336028415321
synechism       0.031798419122155516
hermandiederik  0.039468230971881554
settheory       0.023419946322277188
elderselect     0.019982330351386298
kolkha          0.043269778486696
chávezpernell   0.0374793532399878
exoticism       0.05115875151354828
nezek           0.03695091072200877
...             ...

In [21]:

std + 5

Out[21]:

[356816, 1]
igazság         5.025172048090453
kformula        5.035203360284153
synechism       5.031798419122156
hermandiederik  5.039468230971882
settheory       5.023419946322277
elderselect     5.019982330351386
kolkha          5.043269778486696
chávezpernell   5.037479353239988
exoticism       5.051158751513548
nezek           5.036950910722009
...             ...

In [22]:

std / 0.5

Out[22]:

[356816, 1]
igazság         0.05034409618090613
kformula        0.07040672056830642
synechism       0.06359683824431103
hermandiederik  0.07893646194376311
settheory       0.046839892644554376
elderselect     0.039964660702772596
kolkha          0.086539556973392
chávezpernell   0.0749587064799756
exoticism       0.10231750302709657
nezek           0.07390182144401754
...             ...

In [23]:

(std * std).sqrt()  # Multply elementwise by itself, and return the square root

Out[23]:

[356816, 1]
igazság         0.025172048090453064
kformula        0.03520336028415321
synechism       0.031798419122155516
hermandiederik  0.039468230971881554
settheory       0.023419946322277188
elderselect     0.019982330351386298
kolkha          0.043269778486696
chávezpernell   0.0374793532399878
exoticism       0.05115875151354828
nezek           0.03695091072200877
...             ...

In [33]:

srted.append(ppmi[:,'word'], axis=1)  # Append another matrix column-wise and match index row

Out[33]:

[356916, 2]                      word
and          161771.25962282284  0.0
the          93471.16046956484   0.826469869794669
in           92074.9645661122    0.0
of           89557.50082065431   0.0
a            87442.98215496707   0.49578159524844034
by           77974.95155686601   0.0
as           77812.6773046294    0.0
is           76331.9051760016    1.134261702619963
or           73657.61317815434   0.48493322980887776
with         67192.71160663095   0.0
...          ...                 ...