#!/usr/bin/env python # coding: utf-8 # # Overview # # We're now switching focus away from the Network Science (for a little bit), beginning to think about _Natural Language Processing_ instead. In other words, today will be all about teaching your computer to "understand" text. This ties in nicely with our work on wikipedia, since wikipedia is a network of connected pieces of text. We've looked at the network so far - now, let's see if we can include the text. Today is about # # * Installing the _natural language toolkit_ (NLTK) package and learning the basics of how it works (Chapter 1) # * Figuring out how to make NLTK to work with other types of text (Chapter 2). # > **_Video Lecture_**. Today is all about working with NLTK, so not much lecturing - you can get my perspective and a little pep-talk # In[2]: from IPython.display import YouTubeVideo YouTubeVideo("Ph0EHmFT3n4",width=800, height=450) # # Installing and the basics # # > _Reading_ # > The reading for today is Natural Language Processing with Python, first edition (NLPP1e) Chapter 1, Sections 1.1, 1.2, 1.3\. [It's free online](http://www.nltk.org/book_1ed/). # > # > * **Important**: Do not use the newest version of this book. Use the first edition. (The newest version is based on on Python 3). # > * **Important**: Seriously, remember that we're using the *first edition*. # > # > _Exercises_: NLPP1e Chapter 1\. # > # > * First, install `nltk` if it isn't installed already (there are some tips below that I recommend checking out before doing installing) # > * Second, work through chapter 1. The book is set up as a kind of tutorial with lots of examples for you to work through. I recommend you read the text with an open IPython Notebook and type out the examples that you see. ***It becomes much more fun if you to add a few variations and see what happens***. Some of those examples might very well be due as assignments (see below the install tips), so those ones should definitely be in a `notebook`. # # ### NLTK Install tips # # Check to see if `nltk` is installed on your system by typing `import nltk` in a `notebook`. If it's not already installed, install it as part of _Anaconda_ by typing # # conda install nltk # # at the command prompt. If you don't have them, you can download the various corpora using a command-line version of the downloader that runs in Python notebooks: In the iPython notebook, run the code # # import nltk # nltk.download() # # Now you can hit `d` to download, then type "book" to fetch the collection needed today's `nltk` session. Now that everything is up and running, let's get to the actual exercises. # > _Exercises_: NLPP1e Chapter 1 (the stuff that might be due in an upcoming assignment). # > # > The following exercises from Chapter 1 are what might be due in an assignment later on. # > # > * Try out the `concordance` method, using another text and a word of your own choosing. # > * Also try out the `similar` and `common_context` methods for a few of your own examples. # > * Create your own version of a dispersion plot ("your own version" means another text and different word). # > * Explain in your own words what aspect of language _lexical diversity_ describes. # > * Create frequency distributions for `text2`, including the cumulative frequency plot for the 75 most common words. # > * What is a bigram? How does it relate to `collocations`. Explain in your own words. # > * Work through ex 2-12 in NLPP's section 1.8\. # > * Work through exercise 15, 17, 19, 22, 23, 26, 27, 28 in section 1.8\. # # Working with NLTK and other types of text # # So far, we've worked with text from Wikipedia. But that's not the only source of text in the universe. In fact, it's far from it. Chapter 2 in NLPP1e is all about getting access to nicely curated texts that you can find built into NLTK. # > # > _Reading_: NLPP1e Chapter 2.1 - 2.4\. # > # > _Exercises_: NLPP1e Chapter 2\. # > # > * Solve exercise 4, 8, 11, 15, 16, 17, 18 in NLPP1e, section 2.8\. As always, I recommend you write up your solutions nicely in a `notebook`. # > * Work through exercise 2.8.23 on Zipf's law. [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law) connects to a property of the Barabasi-Albert networks. Which one? Take a look at [this article](http://www.hpl.hp.com/research/idl/papers/ranking/adamicglottometrics.pdf) and write a paragraph or two describing other important instances of power-laws found on the internet. # > # In[ ]: