CS194-16 Introduction to Data Science
NOTE click near here to select this cell, esc-Enter will get you into cell edit mode, shift-Enter gets you back
Name: Please put your name
Student ID: Please put your student ID
This homework explores the use of synthetic (XML) and natural language parsing for data preparation. It comprises 3 parts:
We assume you have copied this HW notebook, the stanford parser archive, and the reviews archive into the same directory. Unpack the later two:
tar xvzf reviews.tar.gz tar xvzf stanfordparser.tar.gz
and then copy the parser into /opt:
sudo mv StanfordParser /opt
finally, if you havent already done it, create a personal bin directory:
mkdir ~/bin
scripts or links in that directory will then be in your path. This will be useful for using the Stanford parser (and other tools) later. The path is set in your login script. To make it find the new bin directory you have to log out and log back in again. In the top right hand corner of the VM window you will find a gear-shaped icon. Clicking it yields a drop-down menu with a logout option. Logout, an then back in when you see the login screen.
We will be using Python's ElementTree API which you can read about here:
https://docs.python.org/2/library/xml.etree.elementtree.html
Start by loading some XML data.
from lxml import etree
parser = etree.XMLParser(recover=True)
tree = etree.parse('reviews/video/reviews.xml',parser)
btw, the data is actually far-from-perfect XML. To see some of the defects, remove the argument "parser" from the last line, so that it tries instead to parse with a (default) strict parser. You will see it crash at an invalid char string somewhere in the file. You can fix this and find the next problem... But its better to use an auto-recovering parser like the one above.
TODO: What kinds of error did you see in the file? (hit esc-Enter to edit this, then ctl-Enter to save)
Now lets look at the contents of this tree
root=tree.getroot()
root
Nodes have a tag (a name), and possibly attributes (empty in this case)
root.tag
root.attrib
The children of the root node are accessible using square bracket notation. They will be individual reviews. You can then examine each review node's children by adding additional square bracket fields. Do this now and explore the parse tree. Compare with the file contents (use a text editor to see it).
root[2]
You can also use the colon notation to retrieve all the children of a node:
root[22][:]
Notice that same-named elements can occur multiple times, e.g. unique_id and product_type
The "contents" of a node are usually held in its text field, which you access like this:
root[2][0]
root[2][0].text
Now we can look at the contents of the other "unique_id" node:
root[2][1].text
the find() and findall() methods allow you to find one or (respectively) all the children of a node with a particular tag.
root[10].find('product_name').text
root.findall('review')[:10]
root[3].find('review_text').text
Use the ElementTree methods to construct a dataframe containing 11 columns corresponding to the 11 distinct children node types of each review node. Each row should represent a single review. For nodes that may be repeated like "unique_id", include a list of the node values in that field.
TODO: What fraction of the XML review records have two "unique_id" nodes? What fraction have two "product_type" nodes?
Finally save the dataFrame as a csv file (you can use a Pandas builtin to do this).
For the review text, you should create one file with a unique name per review containing only the review text. The names should be review_text#####.txt where ##### is the number of the review.
In the preamble for this HW, you put the Stanford Parser in the /opt directory, and you also created a ~/bin directory. You can use these to put Stanford Parser commands in your path without having to add several new directories to your $PATH variable. There are three commands we will need initially.
Open a terminal window and create symlinks like this:
ln -s /opt/StanfordParser/lexparser.sh ~/bin/lexparser.sh ln -s /opt/StanfordParser/lexparser-gui.sh ~/bin/lexparser-gui.sh ln -s /opt/StanfordParser/dependencyviewer/dependencyviewer.sh ~/bin/dependencyviewer.sh
and then type:
lexparser-gui.sh
This brings up a GUI interface to the Stanford parser. To use it, click on "Load Parser" which brings up a file selection dialog. Navigate to
/opt/StanfordParser/stanford-parser-3.4.1-models.jar
and open it.
Then you will see a list of parsers to use. Select
englishPCFG.ser.gz
You're now ready to parse some text!
Click on "Load File" and navigate back to your HW3 directory (you'll have to go all the way up to "/", and down through "/home"). Load your review text file
review_text00000.txt
which will display the text with the first sentence highlighted. Now click on "Parse" which will bring up a graphical display of the parsed sentence.
TODO: Did the sentence parse correctly?
Parse the other sentences from this file. Notice that the yellow highlight is for standard sentences (broken at periods) but that some of these sentences are broken into sentence subparts.
This parse tree shows a standard (constituency) tree. Usually we will want to work with dependency trees. To view a dependency tree for the sentences in this file, do
dependencyviewer.sh -in review_text00000.txt
(note the extra "-in" option for this parser). This brings up a window with tabs for each of the sentences. click through each sentence and contrast the dependency parse tree with the constituency tree in the other window.
Note: Both parsers consume quite a bit of memory so you may need to close the constituency tree viewer before starting the dependency viewer.
TODO: What are the root nodes for each sentence-like fragment in sentence 5 ?
The parser also contains scripts for parsing text into structured output. Now run
lexparser.sh review_text00000.txt
You will see both constituency and dependency tree output for each sentence. These formats are ad-hoc though, and not easy for a machine to work with. You can customize the parser startup script. In the main parser directory you will find a script:
/opt/StanfordParser/lexparser.sh
Make your own copy of this script in the same directory, say call it:
/opt/StanfordParser/dependencyparser.sh
This file may not be executable, depending on how you copied it. To make sure it is, do:
chmod 755 dependencyparser.sh
in the Stanford Parser directory. Now open the script in an editor. It contains an invocation of the parser with the option
-outputFormat "penn,typedDependencies"
we wont need the penn format output, so you can remove "penn" from the options. We need XML output instead of the standard output however. To do that add this option:
-outputFormatOptions "xml"
after the -outputFormat option (yes the names are confusing). Save the file.
Now from a terminal prompt, create a new symlink from your ~/bin directory to the dependencyparser.sh script. You should now be able to change to the directory containing your sentences and type:
dependencyparser.sh review_text00000.txt
You will see some diagnostic messages, and the XML data. The parser actually sends the XML only to stdout and the diagnostics to stderr. To get just the XML in a file you can do:
dependencyparser.sh review_text00000.txt > review_parsed00000.xml
Now write a bash script (or do in python if you know how to invoke shell commands) to iterate over the input files and produce parsed copies, i.e. by replacing "00000" in the filenames above with a series of integer indices. HINT: the bash command for integer iteration is
for i in `seq 0 xxx` do ... done
and to get a fixed-length integer string in a file name do:
fname=`printf "review_text%05d.txt" $i`
NOTE: Parsing is very time-consuming. You dont have to parse all the reviews, but do at least say the first 100.
TODO: Give the total of file sizes (e.g. using "du" on the directory containing them) for the unparsed text files and the total for the XML parsed files.
Use the ElementTree API to read an XML dependency parse tree from the files that you just created.
Write a function to recognize targets and associated sentiment. The simplest pattern is to look for a word (a noun) with an amod modifier. e.g. X amod Y, where X is the governer, and Y is the dependent. In the first sentence we have "remake" amod "bloated" for example:
"remake" amod "bloated"
You also have
"thriller" amod "quick"
"thriler" amod "clean"
"thriller" amod "near-perfect"
"thriller" amod "minimalist"
All of this is useful sentiment information that you can put in the table. There are also more complicated relationships. e.g "remake" connects to the film, and "thriller" connects to "remake". So the sentiments attached to "thriller" could be inherited by "remake" and thence "film" using those links (i.e. looking for patterns of three connected nodes). I want you to think about whether those are good patterns or not. You should look at more sample sentences to decide. The sentiment connection doesnt have to be perfect. i.e. a remake doesnt have to inherit the attributes of the original film. But to some extent it does, and putting more (even noisy) sentiment connections in the table gives you more data over which to look for patterns.
Write one more function that finds a pattern of (target, sentiment words or phrases). This time, define your own pattern by looking through the dependency trees output from part 2.
Apply these two functions to each parsed sentence, and concatenate their outputs. Finally concatenate the lists from all sentences. From the final list, construct a dataFrame with "target" and "sentiment" columns. In the space below cut and paste the first 100 rows of this table (or less if you dont have 100 rows from all the sentences from part 2.
Save this notebook and submit it here.
TODO: Put your analysis code here.
TODO: Put <=100 rows of your target/sentiment table below: