# special IPython command to prepare the notebook for matplotlib
%matplotlib inline
import urllib2 # module to read in HTML
import bs4 # BeautifulSoup: module to parse HTML and XML
import json #
import datetime as dt # module for manipulating dates and times
import pandas as pd
import numpy as np
Previously discussed:
urllib2 is a useful module to get information about and retrieving data from the web. The function urlopen()
opens a URL (similar to opening a file). The file-like object has some of the methods as a file object. For example, to read the entire HTML of the webpage into a single string, use the method read()
. readlines()
can read in the text line by line. While read()
reads in the HTML code and and close()
closes the URL connection.
x = urllib2.urlopen("http://www.google.com")
htmlSource = x.read()
x.close()
type(htmlSource)
print htmlSource
Once you have the HTML source code, you have to parse it and clean it up.
BeautifulSoup is a really useful python module for parsing HTML and XML files. Let's try a few examples.
For this section, we will be working with the HTML code from Reddit.
x = urllib2.urlopen("http://www.reddit.com") # Opens URLS
htmlSource = x.read()
x.close()
print htmlSource
### prettify()
Beautiful Soup gives us a `BeautifulSoup` object, which represents the document as a nested data structure. We can use the `prettify()` function to show the different levels of the HTML code.
soup = bs4.BeautifulSoup(htmlSource)
print soup.prettify()
The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the <head>
tag, just say soup.head
:
print soup.head.prettify()
A tag’s children are available in a list called .contents
which returns a list.
soup.head.contents
len(soup.head.contents)
# Extract first three elements from the list of contents
soup.head.contents[0:3]
Instead of getting them as a list, you can iterate over a tag’s children using the .children generator:
soup.head.children
for child in soup.head.children:
print(child)
# print the title of reddit
soup.head.title
# print the string in the title
soup.head.title.string
Attribute lets you iterate over all of a tag’s children, recursively: its direct children, the children of its direct children, and so on:
for child in soup.head.descendants:
print child
If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator
for string in soup.strings:
print(repr(string))
These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings generator instead
for string in soup.stripped_strings:
print(repr(string))
You can access an element’s parent with the .parent
attribute. In the example “three sisters” document, the <head>
tag is the parent of the <title>
tag:
soup.title
soup.title.string
soup.title.string.parent
Now, let's consider examples of different filters you can use to search this nested tree of HTML. These filters show up again and again, throughout the search API. You can use them to filter based on a tag’s name, on its attributes, on the text of a string, or on some combination of these.
find_all()
to find all tags¶One common task is extracting all the URLs found within a page's tags:
# search for all <a> tags; returns a list
soup.find_all('a')
# your turn
# search for all the paragragh tags
# your turn
# search for all the table tags
Other arguments to the .find_all()
function include limit
and text
. What do those do?
# your turn
# search for all the <a> tags and use the limit argument
# your turn
# What does the using the text argument do?
.get()
to extract an attribute¶soup.find_all('a')[1].get('href')
# your turn
# write a for loop printing all the links from reddit
# your turn
# write a for loop, but use a list comprehension this time
# show the first 5 elements
# your turn
# split the first url by "/"
Another common task is extracting all the text from a page:
print(soup.get_text())
a = {'a': 1, 'b':2}
s = json.dumps(a)
a2 = json.loads(s)
a # a dictionary
s # s is a string containing a in JSON encoding
a2 # reading back the keys are now in unicode
The 2014 FIFA World Cup was held this summer in Brazil at several different venues. There was an API created for the World Cup that scraped current match results and output match data as JSON. Possible output includes events such as goals, substitutions, and cards. The actual matches are listed here in JSON.
url = "http://worldcup.sfg.io/matches"
data = urllib2.urlopen(url).read()
wc = json.loads(data.decode('utf-8'))
"Number of matches in 2014 World Cup: %i" % len(wc)
'Number of matches in 2014 World Cup: 64'
# Print keys in first match
gameIndex = 60
wc[gameIndex].keys()
[u'status', u'match_number', u'home_team', u'away_team', u'winner_code', u'winner', u'away_team_events', u'datetime', u'location', u'home_team_events']
wc[gameIndex]['status']
u'completed'
wc[gameIndex]['match_number']
61
wc[gameIndex]['away_team']
{u'code': u'GER', u'country': u'Germany', u'goals': 7}
wc[gameIndex]['away_team_events']
[{u'id': 1354, u'player': u'M\xdcller', u'time': u'11', u'type_of_event': u'goal'}, {u'id': 1355, u'player': u'Klose', u'time': u'23', u'type_of_event': u'goal'}, {u'id': 1356, u'player': u'Kroos', u'time': u'24', u'type_of_event': u'goal'}, {u'id': 1357, u'player': u'Kroos', u'time': u'26', u'type_of_event': u'goal'}, {u'id': 1358, u'player': u'Khedira', u'time': u'29', u'type_of_event': u'goal'}, {u'id': 1363, u'player': u'Hummels', u'time': u'46', u'type_of_event': u'substitution-out halftime'}, {u'id': 1364, u'player': u'Mertesacker', u'time': u'46', u'type_of_event': u'substitution-in halftime'}, {u'id': 1365, u'player': u'Klose', u'time': u'58', u'type_of_event': u'substitution-out'}, {u'id': 1366, u'player': u'Sch\xdcrrle', u'time': u'58', u'type_of_event': u'substitution-in'}, {u'id': 1370, u'player': u'Sch\xdcrrle', u'time': u'69', u'type_of_event': u'goal'}, {u'id': 1372, u'player': u'Draxler', u'time': u'76', u'type_of_event': u'substitution-in'}, {u'id': 1371, u'player': u'Khedira', u'time': u'76', u'type_of_event': u'substitution-out'}, {u'id': 1373, u'player': u'Sch\xdcrrle', u'time': u'79', u'type_of_event': u'goal'}]
wc[gameIndex]['home_team']
{u'code': u'BRA', u'country': u'Brazil', u'goals': 1}
The [Brazil v Germany (2014 FIFA World Cup)](http://en.wikipedia.org/wiki/Brazil_v_Germany_(2014_FIFA_World_Cup) match on July 8, 2014 where Germany score the most goals in World Cup tournament history. Germany led 5–0 at half time, with 4 goals scored in a span of 6 minutes, and subsequently brought the score up to 7–0 in the second half. Brazil scored a goal at the last minute, ending the match 7–1.
Print the team names and goals scored for each match
for elem in wc:
print elem['home_team']['country'], elem['home_team']['goals'], elem['away_team']['country'], elem['away_team']['goals']
Brazil 3 Croatia 1 Mexico 1 Cameroon 0 Spain 1 Netherlands 5 Chile 3 Australia 1 Colombia 3 Greece 0 Ivory Coast 2 Japan 1 Uruguay 1 Costa Rica 3 England 1 Italy 2 Switzerland 2 Ecuador 1 France 3 Honduras 0 Argentina 2 Bosnia and Herzegovina 1 Iran 0 Nigeria 0 Germany 4 Portugal 0 Ghana 1 USA 2 Belgium 2 Algeria 1 Russia 1 Korea Republic 1 Brazil 0 Mexico 0 Cameroon 0 Croatia 4 Spain 0 Chile 2 Australia 2 Netherlands 3 Colombia 2 Ivory Coast 1 Japan 0 Greece 0 Uruguay 2 England 1 Italy 0 Costa Rica 1 Switzerland 2 France 5 Honduras 1 Ecuador 2 Argentina 1 Iran 0 Nigeria 1 Bosnia and Herzegovina 0 Germany 2 Ghana 2 USA 2 Portugal 2 Belgium 1 Russia 0 Korea Republic 2 Algeria 4 Cameroon 1 Brazil 4 Croatia 1 Mexico 3 Australia 0 Spain 3 Netherlands 2 Chile 0 Japan 1 Colombia 4 Greece 2 Ivory Coast 1 Italy 0 Uruguay 1 Costa Rica 0 England 0 Honduras 0 Switzerland 3 Ecuador 0 France 0 Nigeria 2 Argentina 3 Bosnia and Herzegovina 3 Iran 1 USA 0 Germany 1 Portugal 2 Ghana 1 Korea Republic 0 Belgium 1 Algeria 1 Russia 1 Brazil 1 Chile 1 Colombia 2 Uruguay 0 Netherlands 2 Mexico 1 Costa Rica 1 Greece 1 France 2 Nigeria 0 Germany 2 Algeria 1 Argentina 1 Switzerland 0 Belgium 2 USA 1 Brazil 2 Colombia 1 France 0 Germany 1 Netherlands 0 Costa Rica 0 Argentina 1 Belgium 0 Brazil 1 Germany 7 Netherlands 0 Argentina 0 Brazil 0 Netherlands 3 Germany 1 Argentina 0
data = pd.DataFrame(wc, columns = ['match_number', 'location', 'datetime', 'home_team', 'away_team', 'winner', 'home_team_events', 'away_team_events'])
data.head()
match_number | location | datetime | home_team | away_team | winner | home_team_events | away_team_events | |
---|---|---|---|---|---|---|---|---|
0 | 1 | Arena de Sao Paulo | 2014-06-12T17:00:00.000-03:00 | {u'country': u'Brazil', u'code': u'BRA', u'goa... | {u'country': u'Croatia', u'code': u'CRO', u'go... | Brazil | [{u'type_of_event': u'goal-own', u'player': u'... | [{u'type_of_event': u'substitution-in', u'play... |
1 | 2 | Estadio das Dunas | 2014-06-13T13:00:00.000-03:00 | {u'country': u'Mexico', u'code': u'MEX', u'goa... | {u'country': u'Cameroon', u'code': u'CMR', u'g... | Mexico | [{u'type_of_event': u'yellow-card', u'player':... | [{u'type_of_event': u'substitution-in halftime... |
2 | 3 | Arena Fonte Nova | 2014-06-13T16:00:00.000-03:00 | {u'country': u'Spain', u'code': u'ESP', u'goal... | {u'country': u'Netherlands', u'code': u'NED', ... | Netherlands | [{u'type_of_event': u'goal-penalty', u'player'... | [{u'type_of_event': u'yellow-card', u'player':... |
3 | 4 | Arena Pantanal | 2014-06-13T19:00:00.000-03:00 | {u'country': u'Chile', u'code': u'CHI', u'goal... | {u'country': u'Australia', u'code': u'AUS', u'... | Chile | [{u'type_of_event': u'goal', u'player': u'Alex... | [{u'type_of_event': u'goal', u'player': u'Cahi... |
4 | 5 | Estadio Mineirao | 2014-06-14T13:00:00.000-03:00 | {u'country': u'Colombia', u'code': u'COL', u'g... | {u'country': u'Greece', u'code': u'GRE', u'goa... | Colombia | [{u'type_of_event': u'goal', u'player': u'P. A... | [{u'type_of_event': u'yellow-card', u'player':... |
Here we pandas DatetimeIndex to convert the datetime
column to two seperate columns: a date and a time for each match.
data['gameDate'] = pd.DatetimeIndex(data.datetime).date
data['gameTime'] = pd.DatetimeIndex(data.datetime).time
data.head()
match_number | location | datetime | home_team | away_team | winner | home_team_events | away_team_events | gameDate | gameTime | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Arena de Sao Paulo | 2014-06-12T17:00:00.000-03:00 | {u'country': u'Brazil', u'code': u'BRA', u'goa... | {u'country': u'Croatia', u'code': u'CRO', u'go... | Brazil | [{u'type_of_event': u'goal-own', u'player': u'... | [{u'type_of_event': u'substitution-in', u'play... | 2014-06-12 | 20:00:00 |
1 | 2 | Estadio das Dunas | 2014-06-13T13:00:00.000-03:00 | {u'country': u'Mexico', u'code': u'MEX', u'goa... | {u'country': u'Cameroon', u'code': u'CMR', u'g... | Mexico | [{u'type_of_event': u'yellow-card', u'player':... | [{u'type_of_event': u'substitution-in halftime... | 2014-06-13 | 16:00:00 |
2 | 3 | Arena Fonte Nova | 2014-06-13T16:00:00.000-03:00 | {u'country': u'Spain', u'code': u'ESP', u'goal... | {u'country': u'Netherlands', u'code': u'NED', ... | Netherlands | [{u'type_of_event': u'goal-penalty', u'player'... | [{u'type_of_event': u'yellow-card', u'player':... | 2014-06-13 | 19:00:00 |
3 | 4 | Arena Pantanal | 2014-06-13T19:00:00.000-03:00 | {u'country': u'Chile', u'code': u'CHI', u'goal... | {u'country': u'Australia', u'code': u'AUS', u'... | Chile | [{u'type_of_event': u'goal', u'player': u'Alex... | [{u'type_of_event': u'goal', u'player': u'Cahi... | 2014-06-13 | 22:00:00 |
4 | 5 | Estadio Mineirao | 2014-06-14T13:00:00.000-03:00 | {u'country': u'Colombia', u'code': u'COL', u'g... | {u'country': u'Greece', u'code': u'GRE', u'goa... | Colombia | [{u'type_of_event': u'goal', u'player': u'P. A... | [{u'type_of_event': u'yellow-card', u'player':... | 2014-06-14 | 16:00:00 |