import json import os import gensim import pymysql import nltk # para tokenização de sentenças from pymongo import MongoClient from string import punctuation, digits import bs4 from sklearn import mixture import datetime import pandas as pd %pylab inline def sphinx_query(index_name, query="", facet=None): """ Search Sphinx index using a simple match via SphinxQL :param index_name: Name of the index to search on :param query: String with the query expression :param facet: Attribute name to facet by. Must be a list :return: JSON (array of objects) """ try: assert index_name in ['mediacloud_articles', 'mediacloud_feeds', 'mediacloud_tweets'] except AssertionError: return json.dumps({"error": "Bad index name: {}".format(index_name)}) # Setup Sphinxsearch SphinxQL connection sphinx_conn = pymysql.connect(host='200.20.164.152', port=9306) cursor = sphinx_conn.cursor(pymysql.cursors.DictCursor) if facet is None: cursor.execute("SELECT * from " + index_name + " WHERE MATCH(%(query)s) " "LIMIT %(limit)s OPTION max_matches=%(limit)s", {'query': query, 'limit': 100000}) else: cursor.execute("SELECT * from "+index_name+" WHERE MATCH(%s) " + " ".join(["FACET {}".format(f) for f in facet]), (query,)) results = cursor.fetchall() cursor.close() return json.dumps(results) consulta = 'Marina silva' res = sphinx_query("mediacloud_articles",'"{}" '.format(consulta)) res = json.loads(res) # retemos apenas as entradas que contém sumários #filtramos também por datas data_ini = datetime.datetime(2014,8,10,0,0,0) data_fim = datetime.datetime(2014,8,28,0,0,0) res = [d for d in res if datetime.datetime.fromtimestamp(d['published'])> data_ini and datetime.datetime.fromtimestamp(d['published']) <= data_fim] print("{} resultados retidos".format(len(res))) res[:3] ts = pd.TimeSeries(data=np.ones(len(res)),index=[datetime.datetime.fromtimestamp(d['published']) for d in res]) #ts['Count']=1 rts=ts.resample('h',how="sum") rts.plot(); #rts.to_csv('{}+filha;filho;renata;miguel_porhora_noticias.csv'.format(consulta)) docs = [bs4.BeautifulSoup(d['summary']).get_text() for d in res] nltk.tokenize.sent_tokenize(docs[1]) #criando um banco com as frases: client = MongoClient() db = client.word2vec db.drop_collection('frases') frases = db.frases for n,doc in enumerate(docs): frases.insert({'doc': n, 'frases':nltk.tokenize.sent_tokenize(doc)}) sw = nltk.corpus.stopwords.words('portuguese') + list(punctuation) + ['r',u'não',u'é', u'à','quarta','feira', u'até', u'já', ')','(','"','\'','...', 'nesta', 'leia','quinta', 'foto',u'terça', 'dub','diz','dia',u'está','sexta', u'\u2022', ''] def get_sentences(): for doc in frases.find({}): for f in doc['frases']: yield [w.strip().strip(punctuation).strip(digits).lower() for w in f.split() if w not in sw] sentences = get_sentences() model = gensim.models.Word2Vec(sentences, min_count=15, size=5000, workers=8) model[u'campos'] model.most_similar(positive=[u'eduardo', u'marina' ], negative=[u'aécio'], topn=10) model.doesnt_match(['eduardo', 'marina', 'psb', 'avião', 'futebol']) model.similarity('marina','psb') scatter(model.syn0[:,1],model.syn0[:,2]); len(model.vocab) model.table model.vocab dpgmm = mixture.DPGMM(n_components=10,n_iter=5, covariance_type='diag') dpgmm.fit(model.syn0) dpgmm.converged_ plot(dpgmm.predict(model.syn0)); print dpgmm.means_.shape plot(dpgmm.means_.T); gmm = mixture.GMM(n_components=3, covariance_type='diag') gmm.fit(model.syn0) print gmm.means_.shape plot(gmm.means_.T); plot(gmm.predict(model.syn0)); gmm.converged_ mixture.GMM? from __future__ import print_function from pprint import pprint from time import time from sklearn.decomposition import TruncatedSVD from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.feature_extraction.text import HashingVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.pipeline import make_pipeline from sklearn.preprocessing import Normalizer from sklearn import metrics from sklearn.cluster import KMeans, MiniBatchKMeans t0 = time() hasher = HashingVectorizer(n_features=10000, stop_words=sw, non_negative=True, norm=None, binary=False) vectorizer = make_pipeline(hasher, TfidfTransformer()) X = vectorizer.fit_transform(docs) print("done in %fs" % (time() - t0)) print("n_samples: %d, n_features: %d" % X.shape) print() n_clusters = 3 dpgmm = mixture.DPGMM(covariance_type='diag') km = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=100, n_init=1, verbose=True, n_jobs=8) #dpgmm.fit(X) labels = range(X.shape[0]-1) print("Clustering sparse data with %s" % km) t0 = time() km.fit(X) print("done in %0.3fs" % (time() - t0)) print() print("Top terms per cluster:") order_centroids = km.cluster_centers_.argsort()[:, ::-1] k_means_labels = km.labels_ k_means_cluster_centers = km.cluster_centers_ k_means_labels_unique = np.unique(k_means_labels) # KMeans colors = ['#4EACC5', '#FF9C34', '#4E9A06'] for k, col in zip(range(n_clusters), colors): my_members = k_means_labels == k cluster_center = k_means_cluster_centers[k] #plot(X[my_members, 0], X[my_members, 1], 'w', markerfacecolor=col, marker='.') plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=6) title('KMeans') #set_xticks(()) #set_yticks(()) pylab.text(-3.5, 1.8, 'inertia: %f' % (km.inertia_)); mixture.DPGMM? X.shape