Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
Open bash shell and run:
pip install beautifulsoup4
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc)
Read the terms of service for the web site before attempting to scrape data. Here is an example from Tropicos that forbids scraping of their web site that degrades service.
If the site provides a web service or api, make use of that instead of trying to parse the interface designed for web browsers.
A recent example of the legalities around web scraping is the JSTOR/Aaron Swartz prosecution
Suppose we want to get of list of scientific names for a common name from FishBase.
import urllib2
search_url = 'http://fishbase.se/search.php'
search_page = urllib2.urlopen(search_url)
print search_page.read(200)
We can now use Beautiful Soup to parse and explore the search page
soup = BeautifulSoup(search_page)
Explore the search page in Beautiful Soup but avoid printing form tag contents.
Bad:
print soup.find_all('form')
Good:
for form in soup.find_all('form'):
print form.text
The search page contains a form to query Fish Base on common name. The form does a form post request to relative url
/ComNames/CommonNameSearchList.php
with the common name in the id
CommonName
Post requests appends form-data inside the body of the HTTP request (data is not shown is in URL). Get requests appends form-data into the URL in name/value pairs and user is able to bookmark the page. See if we can do a get request with the CommonName query.
Print out all the form actions urls contained on the search page
We want to get the results of a common name search. We can attempt to do a http get request to the url that is in the action attribute since we can see the get request syntax at the bottom of the search result page.
http://fishbase.se/ComNames/CommonNameSearchList.php?resultPage=2&CommonName=Tuna
result_url = 'http://fishbase.se/ComNames/CommonNameSearchList.php?CommonName=Tuna'
result_page = urllib2.urlopen(result_url)
soup = BeautifulSoup(result_page)
There is a problem however. Does this really give us the full list of names for Tuna?