In [1]:
from collections import Counter,defaultdict
import random
import gzip
import textwrap
In [2]:
def make_trigrams(filename):

    #returns list of words in file
    with gzip.open(filename) as f: words = f.read().split()

    trigrams = defaultdict(list)
    
    bigram=tuple(words[:2])
    startwords=[bigram]
    
    for w in words[2:] + words[:2]:
    #keys of trigram dict are tuples, values are lists
        trigrams[bigram].append(w)
        if bigram[0].endswith('.') and bigram[1][0].isupper():
            startwords.append((bigram[1],w))
        bigram=(bigram[1],w)

    return trigrams,startwords
In [3]:
def random_text(trigrams, startwords, num_words=100):
     
    current_pair = random.choice(startwords)
    random_text = list(current_pair)
    
    # continue past num_words until ends in .    
    while len(random_text)< num_words or not random_text[-1].endswith('.'):
        next = random.choice(trigrams[current_pair])
        random_text.append(next)
        current_pair = (current_pair[1], next)
        # avoid long loops if too few periods in training text
        if len(random_text) > 2*num_words: random_text[-1] += '.'
        
    return textwrap.fill(' '.join(random_text))

The source text files for the below can be found here:

  1. sherlock.txt.gz
  2. oz.txt.gz
  3. di.txt.gz
  4. shakespeare.txt.gz
In [4]:
trigrams_sh,startwords_sh=make_trigrams('sherlock.txt.gz')
In [5]:
#most common bigram keys
sorted([(bi,len(trigrams_sh[bi])) for bi in trigrams_sh],key=lambda x: x[1],reverse=True)[:10]
Out[5]:
[(('of', 'the'), 700),
 (('in', 'the'), 479),
 (('to', 'the'), 297),
 (('I', 'have'), 247),
 (('that', 'I'), 245),
 (('at', 'the'), 224),
 (('upon', 'the'), 195),
 (('and', 'I'), 191),
 (('to', 'be'), 189),
 (('and', 'the'), 186)]
In [6]:
Counter(startwords_sh).most_common(10)  #starts of sentences
Out[6]:
[(('It', 'was'), 91),
 (('It', 'is'), 88),
 (('I', 'have'), 49),
 (('He', 'was'), 38),
 (('There', 'was'), 31),
 (('There', 'is'), 28),
 (('I', 'am'), 27),
 (('I', 'was'), 27),
 (('I', 'had'), 26),
 (('On', 'the'), 22)]
In [7]:
len(set(trigrams_sh[('of','the')])) #distinct continuations from "of the ..."
Out[7]:
548
In [25]:
print random_text(trigrams_sh,startwords_sh) #random sherlock
We both sprang in, and away we went to the ventilator and to carry out
my orders to the end of our depositors. One of the windows on either
side of the matter struck me that the affair of the garden. Perhaps
the villain was softened by the incident, as far as we entered he made
friends in the Museum itself during the last item. "Well," he said,
but at that deadly black shadow wavering down upon me, and Godfrey
Norton was evidently an important factor in the habit of standing on
it, which seemed quite exaggerated in its details that it had dropped
asleep, and indeed was nodding myself, when he was still, as ever,
deeply attracted by my friend's subtle powers of observation and
inference.
In [10]:
trigrams_di,startwords_di=make_trigrams('di.txt.gz')
trigrams_oz,startwords_oz=make_trigrams('oz.txt.gz')
In [30]:
print random_text(trigrams_oz,startwords_oz) #random Oz
Then the Tin Woodman swung his arm and managed to scramble to the King
of the Emerald City. But my action angered the Wicked Witch is Wicked
--tremendously Wicked -and ought to be a coward," said the Scarecrow.
"And I want him to give me courage?" asked the girl. "That will not
send me back to say to me." "This is the great Kansas prairies, with
Uncle Henry, who was weeping in a cheerful voice and went back to
Kansas unless you do not want to go any other country, be it ever so
much disappointed; and the country of the East? Dorothy was tired by
the brilliancy of the East blue was the happiest man on earth; but no
one has ever destroyed her before, so I suppose there never will be,
for they will fit me," she said sadly, "for Oz will send you back to
Kansas and Aunt Em had just started to cross the hill and set herself
to sleep.
In [27]:
print random_text(trigrams_di,startwords_di)  #random declaration of independence
But when a long time, after such dissolutions, to cause others to
encourage their migrations hither, and raising the conditions of new
Appropriations of Lands. He has dissolved Representative Houses
repeatedly, for opposing with manly firmness his invasions on the
Inhabitants of these States: For cutting off our Trade with all parts
of the United States of America, in General Congress, Assembled,
appealing to the separation. We hold these truths to be tried for
pretended offences: For abolishing the free System of English Laws in
a neighbouring Province, establishing therein an Arbitrary government,
and enlarging its Boundaries so as to them and the amount and payment
of their offices, and the State of Great Britain, is and ought to be
self-evident, that all political connection between them and the State
of Great Britain, is and ought to be Free and Independent States, they
have full Power to levy War, conclude Peace contract Alliances,
establish Commerce, and to assume among the powers of the good People
of these Oppressions We have appealed to their Acts of pretended
Legislation: For quartering large bodies of armed troops among us: For
protecting them, by a mock Trial from punishment for any Murders which
they should.
In [32]:
trigrams_oz_di= dict(trigrams_oz.items()+trigrams_di.items())
for bigram in set(trigrams_oz) & set (trigrams_di):
    trigrams_oz_di[bigram] = trigrams_oz[bigram] + trigrams_di[bigram]
startwords_oz_di = startwords_oz + startwords_di
    
print random_text(trigrams_oz_di,startwords_di)  #mashup of declaration and oz
He has abdicated Government here, by declaring us out of sight. The
Wicked Witch of the Mice had given up at last, he walked to the little
company set off upon the bank and gazed wistfully at the end of her
nose, and having said one of those great whirlwinds arose, mighty
enough to look at all afraid, although he was made of silk into proper
shape the girl they left a round, shining mark, as Dorothy was
thinking again, and these tears did not even bark in return. In this
country alone, and cannot love. I pray you to any place in order to
start, for she did not take her long walk in, for they immediately
took out their substance.
In [17]:
trigrams_sk,startwords_sk=make_trigrams('shakespeare.txt.gz')  
In [34]:
print random_text(trigrams_sk,startwords_sk) #random shakespeare
Enter servants Take hence this body, consecrate to thee, Thine by thy
waggon wheel Trot, like a rough colt; he knows you not. MENENIUS. I
think of that joy; Or in the earth. DOLABELLA. Most noble Antony, Let
not the double majesties This friendly treaty of our success. Some
dishonour we had all such wives, that the affrighted globe Should yawn
at alteration. EMILIA. [Within.] What, ho! Peace here; grace and
faults are loved of more value Than stamps in gold, like heathen gods,
Shone down the gate? VINCENTIO. Is Signior Lucentio that his fault I
should meet the King.
In [19]:
trigrams_sh_sk= dict(trigrams_sh.items()+trigrams_sk.items())
for bigram in set(trigrams_sh) & set (trigrams_sk):
    trigrams_sh_sk[bigram] = trigrams_sh[bigram] + trigrams_sk[bigram]
startwords_sh_sk = startwords_sh + startwords_sk
In [35]:
print random_text(trigrams_sh_sk,startwords_sh)  #mashup of shakespeare and sherlock
Oakshott, of Brixton Road, where she comes, and I'll pay thee for thy
ruler. SOMERSET. O monstrous villain! Re-enter BIONDELLO, with
LUCENTIO disguised as drawers FALSTAFF. Peace, good Doll! Do not you
Sir John and Robert, be ready here hard by with stiff unbowed knee,
Disdaining duty that I lack'd it. But look, the morn, And, having
sworn too hard-a-keeping oath, Study to break promise with me. [To
ARIEL] Thou shalt not choose but branch now. Since their more piquant
details have drawn a net Than amply to imbar their crooked tides
Usurp'd from you as much more, and ten miles of town, chemistry
eccentric, anatomy unsystematic, sensational literature and crime
records unique, violin-player, boxer, swordsman, lawyer, and self-
poisoner by cocaine and ambition, The soldier's pole is fall'n! Young
boys and wenches must I select from all; the rest TALBOT.

Below the simulation of this example:

In [2]:
from IPython.display import Image
Image(open('m7.png').read())
Out[2]:
In [49]:
mchain={0:{1:.5, 7:.5}, 1:{}, 2:{3:.5}, 3:{1:.5},
        4:{6:.8}, 5:{}, 6:{5:.3}, 7:{2:.2, 4:.4}}
In [50]:
T=array([[mchain[m][n] if n in mchain[m]
     else round(1.-sum(mchain[m].values()),4) if m==n
     else 0.
          for n in mchain ]  for m in mchain])
In [51]:
T
Out[51]:
array([[ 0. ,  0.5,  0. ,  0. ,  0. ,  0. ,  0. ,  0.5],
       [ 0. ,  1. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ],
       [ 0. ,  0. ,  0.5,  0.5,  0. ,  0. ,  0. ,  0. ],
       [ 0. ,  0.5,  0. ,  0.5,  0. ,  0. ,  0. ,  0. ],
       [ 0. ,  0. ,  0. ,  0. ,  0.2,  0. ,  0.8,  0. ],
       [ 0. ,  0. ,  0. ,  0. ,  0. ,  1. ,  0. ,  0. ],
       [ 0. ,  0. ,  0. ,  0. ,  0. ,  0.3,  0.7,  0. ],
       [ 0. ,  0. ,  0.2,  0. ,  0.4,  0. ,  0. ,  0.4]])
In [21]:
sum(T,axis=1)
Out[21]:
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])
In [22]:
T[7].cumsum()
Out[22]:
array([ 0. ,  0. ,  0.2,  0.2,  0.6,  0.6,  0.6,  1. ])
In [23]:
digitize(rand(10),T[7].cumsum())
Out[23]:
array([4, 7, 4, 7, 4, 7, 4, 2, 4, 7])
In [24]:
from collections import Counter
Counter(digitize(rand(100000),T[7].cumsum()))
Out[24]:
Counter({4: 40182, 7: 39752, 2: 20066})
In [25]:
n=len(T)
results={1:[],5:[]}
for t in xrange(100000): #trials
    m=0 #start at 0
    for k in xrange(1,10000):  #just much bigger than likely path
    #pull next state from probability distribution given by m'th row
        m=digitize([rand()],T[m].cumsum())[0]
        if m==5 or m==1:
            results[m].append(k)
            break
In [26]:
len(results[1]),len(results[5])
Out[26]:
(66670, 33330)
In [27]:
[round(m,3) for m in mean(results[1]),mean(results[5])]
Out[27]:
[2.41, 7.254]

Finally, some examples of powers of matrices

take nth power of matrix either with

linalg.matrix_power(T,n)

or

reduce(dot,[T]*n)

(gives same result, but linalg.matrix_power() is optimized so will be much faster on big matrices). The pylab inline option in the EPD python distribution seems to call in the linalg module by default, so it's possible to say just matrix_power(), as below.

In [52]:
around(matrix_power(T,30),3)
Out[52]:
array([[ 0.   ,  0.667,  0.   ,  0.   ,  0.   ,  0.333,  0.   ,  0.   ],
       [ 0.   ,  1.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ],
       [ 0.   ,  1.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ],
       [ 0.   ,  1.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ],
       [ 0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  1.   ,  0.   ,  0.   ],
       [ 0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  1.   ,  0.   ,  0.   ],
       [ 0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  1.   ,  0.   ,  0.   ],
       [ 0.   ,  0.333,  0.   ,  0.   ,  0.   ,  0.667,  0.   ,  0.   ]])

Note that the first row shows that starting from state 0, ultimately state 1 is reached 2/3 of the time and state 5 is reached 1/3 of the time, as expected. From the next three rows we see that starting any of states 1,2,3 state 1 is always reached, and so on.

Finally noticed that T**n just takes the nth power of the entries of the matrix one by one, which is usually (but not always) not what is wanted:

In [53]:
around(T**30,3)
Out[53]:
array([[ 0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ],
       [ 0.   ,  1.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ],
       [ 0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ],
       [ 0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ],
       [ 0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.001,  0.   ],
       [ 0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  1.   ,  0.   ,  0.   ],
       [ 0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ],
       [ 0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ,  0.   ]])

To see the effect of taking powers of a matrix, consider a random 4x4 matrix:

In [55]:
T=rand(4,4)  #random 4x4 matrix
T
Out[55]:
array([[ 0.37191866,  0.19514006,  0.59574904,  0.21311382],
       [ 0.22453282,  0.44280837,  0.32729572,  0.25502435],
       [ 0.06934085,  0.50455362,  0.09645712,  0.12621279],
       [ 0.47506432,  0.14128696,  0.3994524 ,  0.22011466]])
In [56]:
sum(T,axis=1)  #sum along the rows
Out[56]:
array([ 1.37592158,  1.24966127,  0.79656438,  1.23591834])
In [57]:
T /= sum(T,axis=1)[:,None]   #divide by the sums along the rows so they sum to 1
T
Out[57]:
array([[ 0.27030513,  0.14182499,  0.43298182,  0.15488806],
       [ 0.17967495,  0.35434272,  0.26190755,  0.20407478],
       [ 0.0870499 ,  0.63341223,  0.12109143,  0.15844644],
       [ 0.38438164,  0.11431739,  0.3232029 ,  0.17809806]])
In [58]:
sum(T,axis=1)  #verify
Out[58]:
array([ 1.,  1.,  1.,  1.])
In [59]:
T.dot(T)  # this gives T^2
Out[59]:
array([[ 0.19577441,  0.38055306,  0.2566729 ,  0.16699963],
       [ 0.21347519,  0.3402659 ,  0.26827334,  0.17798556],
       [ 0.20878324,  0.33160484,  0.26945995,  0.19015196],
       [ 0.22103271,  0.32010284,  0.29306977,  0.16579469]])
In [60]:
T.dot(T).dot(T) # this gives T^3
Out[60]:
array([[ 0.20782962,  0.34428263,  0.26949214,  0.17839562],
       [ 0.21060825,  0.34112132,  0.27156014,  0.17671029],
       [ 0.21256365,  0.33952935,  0.27133612,  0.17657088],
       [ 0.20650086,  0.34936126,  0.26861406,  0.17552382]])
In [61]:
T.dot(T).dot(T).dot(T) # this gives T^4
Out[61]:
array([[ 0.21006764,  0.34256281,  0.27044784,  0.17692171],
       [ 0.20978292,  0.34295394,  0.27052868,  0.17673446],
       [ 0.20995235,  0.34250933,  0.2708862 ,  0.17665212],
       [ 0.20944067,  0.34328946,  0.27016814,  0.17710173]])

We see that the rows are already converging after a few iterations (recall the largest eigenvalue of the matrix is always 1, and the convergence is governed by the size of the second eigenvalue, see these notes. The first row gives the probability of landing at the other states starting from the first state, the second row for starting at the second state, and so on. Equal rows means that the probabilities of where one ends up become independent of where one started (that's what's meant by the steady state distribution). After twelve iterations, the probabilities are all the same to eight significant digits:

In [62]:
matrix_power(T,12)
Out[62]:
array([[ 0.20982798,  0.34281091,  0.27054469,  0.17681641],
       [ 0.20982798,  0.34281091,  0.27054469,  0.17681641],
       [ 0.20982798,  0.34281091,  0.27054469,  0.17681641],
       [ 0.20982798,  0.34281091,  0.27054469,  0.17681641]])

From here it's possible to take any row as the stationary state v (corresponds to starting from a single state), or we could imagine starting from an equiprobability state w=[1./4,1./4,1./4,1./4] (gives the same result since all the rows are the same at this point):

In [67]:
w=ones(4)/4  #all components equal to 1/4.
v=w.dot(matrix_power(T,12))
v
Out[67]:
array([ 0.20982798,  0.34281091,  0.27054469,  0.17681641])
In [63]:
for n in range(20):  #see how little it changes from one iteration to the next
  print n,norm(matrix_power(T,n)-matrix_power(T,n-1))
0 25.1509253232
1 1.86295622688
2 0.543004876818
3 0.0616561134596
4 0.00908178780218
5 0.00110233172968
6 0.000208809494365
7 2.95128192228e-05
8 3.44895318113e-06
9 6.85237417189e-07
10 7.61096014459e-08
11 1.57132110039e-08
12 1.43155680143e-09
13 3.49483081912e-10
14 3.48567470089e-11
15 7.27357460449e-12
16 8.38966022033e-13
17 1.48519413403e-13
18 2.10786374484e-14
19 2.82984393717e-15
In [64]:
w=array([0.,1.,0.,0.]) # see the result of n steps starting from 2nd state
for n in range(6):
    print 'step',n,w
    w = w.dot(T)
step 0 [ 0.  1.  0.  0.]
step 1 [ 0.17967495  0.35434272  0.26190755  0.20407478]
step 2 [ 0.21347519  0.3402659   0.26827334  0.17798556]
step 3 [ 0.21060825  0.34112132  0.27156014  0.17671029]
step 4 [ 0.20978292  0.34295394  0.27052868  0.17673446]
step 5 [ 0.20980861  0.34283569  0.27053421  0.17682149]
In [65]:
w=rand(4) #eventually doesn't matter where started
w /= sum(w)  #normalize as probability
w
Out[65]:
array([ 0.19934442,  0.50442549,  0.00408142,  0.29214866])
In [66]:
for n in range(6):
    print 'step',n,w
    w = w.dot(T)
step 0 [ 0.19934442  0.50442549  0.00408142  0.29214866]
step 1 [ 0.25716831  0.24299442  0.31334288  0.18649439]
step 2 [ 0.21213541  0.34237096  0.27320994  0.17228369]
step 3 [ 0.20886236  0.3441523   0.27028629  0.17669905]
step 4 [ 0.20974038  0.34297218  0.27040869  0.17687875]
step 5 [ 0.2098454   0.34277661  0.27055268  0.17682531]
In [ ]: