Beautiful Soup tutorial¶

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Beautiful Soup Urls¶

Installation¶

Open bash shell and run:

pip install beautifulsoup4

In [ ]:

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc)

Web scraping disclaimer¶

Read the terms of service for the web site before attempting to scrape data. Here is an example from Tropicos that forbids scraping of their web site that degrades service.

Tropicos terms of service

If the site provides a web service or api, make use of that instead of trying to parse the interface designed for web browsers.

Taxonomic Name Resolution Service API

A recent example of the legalities around web scraping is the JSTOR/Aaron Swartz prosecution

http://en.wikipedia.org/wiki/Aaron_Swartz#JSTOR

Web scraping with Beautiful Soup¶

Suppose we want to get of list of scientific names for a common name from FishBase.

FishBase search page

In [ ]:

import urllib2

search_url = 'http://fishbase.se/search.php'
search_page = urllib2.urlopen(search_url)

print search_page.read(200)

We can now use Beautiful Soup to parse and explore the search page

In [ ]:

soup = BeautifulSoup(search_page)

Explore the search page in Beautiful Soup but avoid printing form tag contents.

Bad:

print soup.find_all('form')

Good:

for form in soup.find_all('form'):
    print form.text

In [ ]:

The search page contains a form to query Fish Base on common name. The form does a form post request to relative url

/ComNames/CommonNameSearchList.php

with the common name in the id

CommonName

Post requests appends form-data inside the body of the HTTP request (data is not shown is in URL). Get requests appends form-data into the URL in name/value pairs and user is able to bookmark the page. See if we can do a get request with the CommonName query.

http://www.w3schools.com/tags/att_form_method.asp

Print out all the form actions urls contained on the search page

In [ ]:

We want to get the results of a common name search. We can attempt to do a http get request to the url that is in the action attribute since we can see the get request syntax at the bottom of the search result page.

http://fishbase.se/ComNames/CommonNameSearchList.php?resultPage=2&CommonName=Tuna

In [ ]:

result_url = 'http://fishbase.se/ComNames/CommonNameSearchList.php?CommonName=Tuna'
result_page = urllib2.urlopen(result_url)

soup = BeautifulSoup(result_page)

There is a problem however. Does this really give us the full list of names for Tuna?

In [ ]: