Modelando Textos Probabilisticamente¶

Nesta prática, vamos usar redes neurais para estimar as probabilidades condicionais de textos, caractere a caractere. Para uma discussão interessante sobre o assunto, veja o seguinte blog: http://karpathy.github.io/2015/05/21/rnn-effectiveness/.

Vamos usar a biblioteca Keras adaptando um de seus exemplos.

In [1]:

from __future__ import print_function
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM
from keras.utils.data_utils import get_file
import numpy as np
import scipy.stats as st
import random
import sys

Primeiro vamos utilizar o mesmo texto usado no exemplo original

In [2]:

path = get_file('nietzsche.txt', origin="https://s3.amazonaws.com/text-datasets/nietzsche.txt")

In [3]:

try: 
    text = open(path).read().lower()
except UnicodeDecodeError:
    import codecs
    text = codecs.open(path, encoding='utf-8').read().lower()

In [11]:

print('Comprimento do corpus:', len(text))

Comprimento do corpus: 600893

In [14]:

print(text[:150])

preface


supposing that truth is a woman--what then? is there not ground
for suspecting that all philosophers, in so far as they have been
dogmatists

Como o modelo vai se basear em caracteres, precisamos definir o conjunto de caracteres do texto:

In [15]:

chars = set(text)
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

total chars: 57

In [16]:

chars

Out[16]:

{'\n',
 ' ',
 '!',
 '"',
 "'",
 '(',
 ')',
 ',',
 '-',
 '.',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 ';',
 '=',
 '?',
 '[',
 ']',
 '_',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 'ä',
 'æ',
 'é',
 'ë'}

O modelo envolve probabilidades condicionais entre caracteres consecutivos, então precisamos alimentar o modelo com sequências de caracteres, com sobreposição.

In [17]:

maxlen = 20
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('num sequences:', len(sentences))

num sequences: 200291

In [7]:

print('Vetorizando...')
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

Vetorizando...

In [8]:

print('Construindo o modelo...')
model = Sequential()
model.add(LSTM(512, return_sequences=True, input_shape=(maxlen, len(chars))))
model.add(Dropout(0.2))
model.add(LSTM(512, return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

Construindo o modelo...

In [9]:

def sample(a, temperature=1.0):
    # helper function to sample an index from a probability array
    a = np.exp(np.log(a) / temperature)
    a /= a.sum() +.001
    try:
        sp = np.argmax(st.multinomial.rvs(1,a,1))
        
#         sp = np.argmax(np.random.multinomial(1, a, 1))
    except ValueError as e:
        print(a[:-1].sum(), len(a),a)
        raise(e)
    return  sp

In [10]:

model.fit(X, y, batch_size=1024, epochs=5)

Epoch 1/5
196/196 [==============================] - 817s 4s/step - loss: 2.8547
Epoch 2/5
196/196 [==============================] - 902s 5s/step - loss: 2.2621
Epoch 3/5
196/196 [==============================] - 880s 4s/step - loss: 2.0057
Epoch 4/5
196/196 [==============================] - 958s 5s/step - loss: 1.8301
Epoch 5/5
196/196 [==============================] - 980s 5s/step - loss: 1.6950

Out[10]:

<tensorflow.python.keras.callbacks.History at 0x7ff3cc7dc100>

In [18]:

    start_index = random.randint(0, len(text) - maxlen - 1)

    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print()
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x[0, t, char_indices[char]] = 1.

            preds = model.predict(x, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

----- diversity: 0.2
----- Generating with seed: "ave piped to him far"
ave piped to him far44 44 4ne4s the 444--4ne 44444444 the 4ar44 of 4ne4s 4now 4ne44s 4nou4s 44 4n44 4444 4ne 44444 4now 4ne4s the 444--the 44444 4f the 44ti4n of the 444--4noth44g 4444 44 4ne 444--the 444--4n4444 4ne4s 4ne 44ough 4ne4s and 44444 4ne 4ut444
4n44 4ne4s 4ne4s and 444--the 4a4t of 4444444444444 4n44 the 44444 4444 44 44 4ne 4one44 44 4n4444444444 4ne4s 444 4444444444444 4ne 4444--the 444 44t of 4ne4s and

----- diversity: 0.5
----- Generating with seed: "ave piped to him far"
ave piped to him far many his prowed and the bart of other is the sance of the same the same of presist strunged the sciunce of the grom of the stranges and regarding to other and strong have and pleass and the 4man his one an will would the not one strong of the can of the strung the can of the post of the sensition, of the post of the and intellect and consting to the manning and presention of the great the onding 

----- diversity: 1.0
----- Generating with seed: "ave piped to him far"
ave piped to him farte, hence and mlowure treins craricble dees 4o the grest in the reast the
modere to be loge on
utbuts, (in freed be my of accorspales "out impreberons not itseln
and
naces.;
and there is obst-and cansing, of perhaps they cermand willd; with
who or the obdict
of whines, in other tod of ourmel senk in unhers- i kas the oness har obbercance sides of its obn oneple
senvition of pood.
7] decn: vood=--t

----- diversity: 1.2
----- Generating with seed: "ave piped to him far"
ave piped to him farwing of denind:--withligging in"ither elep." enmortus someridicas. they than good and slywitht as intever, longuryers liktem, which he _eppecemang: at even qhivan. "up,--is gourd wordde are lay--ars and spinboud not le evenince recepsation
to same by leng! un, in more plone as 
wicses thatcsucal usazing of the once, ons plingul xoder wifl thesef in midder tan icprainant andind to too, permineds: f

In [58]:

st.multinomial.rvs(1,[0.5,0.5,0],1)

Out[58]:

array([[0, 1, 0]])

In [ ]: