#!/usr/bin/env python # coding: utf-8 # The project I worked on this summer was to develop a method to algorithmically generate timelines around a given subject. A "subject" can be any topic - maybe a developing event such as the Sony hacks or the FIFA corruption scandal or an ongoing news source such as Donald Trump or the tax policy of presidential candidates. The goal is then to determine the key events over a specified time window that indicate new or significant developments in the story. The result is a retrospective look at how events unfolded. # # The data for this project was derived from the Twitter firehose, obtaining Tweets containing URLs that are related to the subject by a keyword search. (The timeline approach is not limited to data of this type; ultimately the only requirement is textual data with timestamps). # # One signal that is available to help determine events is the number of links about a topic over time. This time series indicates when news coverage of a topic has peaked. However, although this signal can be used to determine when something happened, it does not indicate what actually happened. Fortunately, we have lexical information available as well - the words in the Tweet and "slug". The slug is the section of the URL that both identifies and describes a link and usually contains keywords selected to promote the link in search results (i.e. nytimes.com/politics/first-draft/2015/07/07/marco-rubio-attacks-higher-education-cartel-and-jabs-rivals). # # Using the lexical signal, we can represent an event as a collection of words that occur together often and around the same time. Ideally we want to identify events that are new and significant so we need to balance uniqueness with frequency. To model this problem I represent the counts of word co-occurrence over time as a tensor and then do a [PARAFAC decomposition][1]. # # This is a similar approach to [Latent Semantic Analysis][2] but we are also considering the temporal element. In LSA, we have a matrix of terms and documents that we can use to determine the latent factors. For this data, we have terms, documents, and time, but each document is only associated with a single timestamp so just adding temporal information would not provide additional benefit. Thus we can think of one term as a "document" and each term is represented by a matrix of context vectors over time. The tensor $\mathcal{X}$ is a $|W| x |W| x |T|$ tensor where $x_{ijt}$ is the number of times word $i$ and word $j$ appear together at time $t$. # # To prepare the data for use in tensor decomposition we need to filter and weight the terms in our data: # 1. Remove any non-English by using a language classifier on the Tweet # 2. Extract words from the slug by splitting on '/', then splitting each component on '-' or '_' and taking the one with the most words, and finally removing anything non-alphabetic. # 3. Stem, lowercase, and remove stopwords from the slug and Tweet. # 4. Take the intersection of the slug words and the Tweet words. Assuming that the slug contains useful words but occasionally numeric or hexadecimal identifiers and the Tweet contains a description of the link, the intersection should identify the most important lexical information. Furthermore, this reduces the number of elements in the tensor and gives us a [sparse matrix that makes the tensor decomposition feasible][3] in terms of runtime. # 5. Ignore all words appearing less than twice. # 6. Weight the counts using TF-IDF. Each term $x_{ijt} = \log(tf_{ijt}+1)idf_j$ where tf is the raw term frequency and $idf_j = \log(\frac{|W|}{|\{w_i \in W: \exists d \in D s.t. w_i, w_j \in d \}|})$, where $D$ is the set of Tweets/articles. The reason for using the globally-weighted IDF rather than an IDF weighting based only on daily occurrences is to promote the topics that are unique for the entire data set. Because of the TF-IDF weighting, the $|W|x|W|$ matrices are not symmetric. # # The algorithm used is a [non-negative decomposition][4]. # We use alternating least squares to find an approximation to the tensor $\mathcal{X}$: # # $$ \min_{A,B,C} || \mathcal{X} - \sum^R_{r=1} \lambda_r \circ A_r \circ B_r \circ C_r ||_F^2 $$ # $$ s.t. \forall_{i,j,k,r} A_{ir},B_{jr},C_{kr} \ge 0$$ # # The matrices $A$, $B$, and $C$ are initialized with [HOSVD][5]. # # The resulting tensor decomposition in PARAFAC/CANDECOMP form consists 3 matrices of element weights and a vector of topic weights. The matrix $A$ is an $R x |W|$ matrix that indicates the importance of each "document" for each topic and similarly the matrix $B$ is a $R x |W|$ matrix indicating the importance of each word. # The matrix $C$ is an $R x |T|$ matrix that indicates the importance of each day for each topic. # To rank the topics, we can calculate the weight $\lambda_r = ||A_r||*||B_r||*||C_r||$ if we only want to show the top $K$ topics. # # Finally, to find an article most representative of a topic, we first identify the highest weighted time for a specific topic, that is, $wt_r = \arg\max_{0 \le t \le |T|} C_{rt}$. Then, for each article, filter and process the data as above to obtain a query vector. The cosine similarity between the query vector $q$ and the term weight matrix $B$ is used to assign the article to a topic, so that $wt_a = \arg\max_{0 \le r \le R} \frac{q^TB_r}{||q||*||B_r||}$. # If the time of the article is the same as the time of this topic, we consider this article as a possible representative for the topic, otherwise it is ignored. Out of all the articles assigned to a topic, the most representative article is then the one with the highest cosine similarity among all articles assigned to that topic. # # [1]: https://en.wikipedia.org/wiki/Tensor_rank_decomposition # [2]: https://en.wikipedia.org/wiki/Latent_semantic_analysis # [3]: http://www.cs.rit.edu/~amt/linkanalysis06/accepted/21.pdf # [4]: http://www.cc.gatech.edu/grads/b/bkkumar/papers/ntf_siamtm.pdf # [5]: https://en.wikipedia.org/wiki/Higher-order_singular_value_decomposition