Nesta prática, vamos usar redes neurais para estimar as probabilidades condicionais de textos, caractere a caractere. Para uma discussão interessante sobre o assunto, veja o seguinte blog: http://karpathy.github.io/2015/05/21/rnn-effectiveness/.
Vamos usar a biblioteca Keras adaptando um de seus exemplos.
from __future__ import print_function
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM
from keras.utils.data_utils import get_file
import numpy as np
import scipy.stats as st
import random
import sys
Primeiro vamos utilizar o mesmo texto usado no exemplo original
path = get_file('nietzsche.txt', origin="https://s3.amazonaws.com/text-datasets/nietzsche.txt")
try:
text = open(path).read().lower()
except UnicodeDecodeError:
import codecs
text = codecs.open(path, encoding='utf-8').read().lower()
print('Comprimento do corpus:', len(text))
Comprimento do corpus: 600893
print(text[:150])
preface supposing that truth is a woman--what then? is there not ground for suspecting that all philosophers, in so far as they have been dogmatists
Como o modelo vai se basear em caracteres, precisamos definir o conjunto de caracteres do texto:
chars = set(text)
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))
total chars: 57
chars
{'\n', ' ', '!', '"', "'", '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '?', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'ä', 'æ', 'é', 'ë'}
O modelo envolve probabilidades condicionais entre caracteres consecutivos, então precisamos alimentar o modelo com sequências de caracteres, com sobreposição.
maxlen = 20
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
sentences.append(text[i: i + maxlen])
next_chars.append(text[i + maxlen])
print('num sequences:', len(sentences))
num sequences: 200291
print('Vetorizando...')
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
for t, char in enumerate(sentence):
X[i, t, char_indices[char]] = 1
y[i, char_indices[next_chars[i]]] = 1
Vetorizando...
print('Construindo o modelo...')
model = Sequential()
model.add(LSTM(512, return_sequences=True, input_shape=(maxlen, len(chars))))
model.add(Dropout(0.2))
model.add(LSTM(512, return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
Construindo o modelo...
def sample(a, temperature=1.0):
# helper function to sample an index from a probability array
a = np.exp(np.log(a) / temperature)
a /= a.sum() +.001
try:
sp = np.argmax(st.multinomial.rvs(1,a,1))
# sp = np.argmax(np.random.multinomial(1, a, 1))
except ValueError as e:
print(a[:-1].sum(), len(a),a)
raise(e)
return sp
model.fit(X, y, batch_size=1024, epochs=5)
Epoch 1/5 196/196 [==============================] - 817s 4s/step - loss: 2.8547 Epoch 2/5 196/196 [==============================] - 902s 5s/step - loss: 2.2621 Epoch 3/5 196/196 [==============================] - 880s 4s/step - loss: 2.0057 Epoch 4/5 196/196 [==============================] - 958s 5s/step - loss: 1.8301 Epoch 5/5 196/196 [==============================] - 980s 5s/step - loss: 1.6950
<tensorflow.python.keras.callbacks.History at 0x7ff3cc7dc100>
start_index = random.randint(0, len(text) - maxlen - 1)
for diversity in [0.2, 0.5, 1.0, 1.2]:
print()
print('----- diversity:', diversity)
generated = ''
sentence = text[start_index: start_index + maxlen]
generated += sentence
print('----- Generating with seed: "' + sentence + '"')
sys.stdout.write(generated)
for i in range(400):
x = np.zeros((1, maxlen, len(chars)))
for t, char in enumerate(sentence):
x[0, t, char_indices[char]] = 1.
preds = model.predict(x, verbose=0)[0]
next_index = sample(preds, diversity)
next_char = indices_char[next_index]
generated += next_char
sentence = sentence[1:] + next_char
sys.stdout.write(next_char)
sys.stdout.flush()
print()
----- diversity: 0.2 ----- Generating with seed: "ave piped to him far" ave piped to him far44 44 4ne4s the 444--4ne 44444444 the 4ar44 of 4ne4s 4now 4ne44s 4nou4s 44 4n44 4444 4ne 44444 4now 4ne4s the 444--the 44444 4f the 44ti4n of the 444--4noth44g 4444 44 4ne 444--the 444--4n4444 4ne4s 4ne 44ough 4ne4s and 44444 4ne 4ut444 4n44 4ne4s 4ne4s and 444--the 4a4t of 4444444444444 4n44 the 44444 4444 44 44 4ne 4one44 44 4n4444444444 4ne4s 444 4444444444444 4ne 4444--the 444 44t of 4ne4s and ----- diversity: 0.5 ----- Generating with seed: "ave piped to him far" ave piped to him far many his prowed and the bart of other is the sance of the same the same of presist strunged the sciunce of the grom of the stranges and regarding to other and strong have and pleass and the 4man his one an will would the not one strong of the can of the strung the can of the post of the sensition, of the post of the and intellect and consting to the manning and presention of the great the onding ----- diversity: 1.0 ----- Generating with seed: "ave piped to him far" ave piped to him farte, hence and mlowure treins craricble dees 4o the grest in the reast the modere to be loge on utbuts, (in freed be my of accorspales "out impreberons not itseln and naces.; and there is obst-and cansing, of perhaps they cermand willd; with who or the obdict of whines, in other tod of ourmel senk in unhers- i kas the oness har obbercance sides of its obn oneple senvition of pood. 7] decn: vood=--t ----- diversity: 1.2 ----- Generating with seed: "ave piped to him far" ave piped to him farwing of denind:--withligging in"ither elep." enmortus someridicas. they than good and slywitht as intever, longuryers liktem, which he _eppecemang: at even qhivan. "up,--is gourd wordde are lay--ars and spinboud not le evenince recepsation to same by leng! un, in more plone as wicses thatcsucal usazing of the once, ons plingul xoder wifl thesef in midder tan icprainant andind to too, permineds: f
st.multinomial.rvs(1,[0.5,0.5,0],1)
array([[0, 1, 0]])