Easy Scraping

Prerequisites:

  • Python3
  • pip install -r reuiqrements.txt

Useful trick in IPython notebook

In [1]:
import pprint
from IPython.core.display import HTML
In [2]:
HTML('Logo of Initium Lab: <img src="%s">' % 'http://initiumlab.com/favicon-32x32.png')
Out[2]:
Logo of Initium Lab:

A small hack to allow longer output area

In [3]:
%%javascript
//IPython.OutputArea.auto_scroll_threshold = 9999;
IPython.OutputArea.prototype._should_scroll = function(){return false;}

Readability

We use a version ported to Python3: https://github.com/hyperlinkapp/python-readability (already included in the reuqirements.txt file)

In [4]:
from readability.readability import Document
import requests
html = requests.get('http://initiumlab.com/').content
readable_article = Document(html).summary()
readable_title = Document(html).short_title()
In [5]:
print(readable_article)
<html><body><div><div class="post-body">

      
      

      
        
          <video controls="" poster="./blog/20150922-jackathon3-review/jackathon3-timelapse-poster.png"><br/>  <source src="./blog/20150922-jackathon3-review/jackathon3-timelapse.mp4" type="video/mp4"/><br/>  <source src="./blog/20150922-jackathon3-review/jackathon3-timelapse.webm" type="video/webm"/><br/>  Sorry, you browser does not support HTML5 video.<br/></video>

<p>The video is also available on <a href="https://youtu.be/zFeSh2W1_C8">YouTube</a> and <a href="http://v.youku.com/v_show/id_XMTM0MzM1MjEwMA==.html?from=y1.7-2">Youku</a>.</p>
<h2 id="What_did_we_do?">What did we do?</h2><p>Jackathon is short for “Journalism-Hackathon”. At Initium Lab, we aim to push limits of Journalism with Technology. We hold regular Jackathons to advance our knowledge and skills in using new technology for obtaining data, analysing information, and reporting.</p>
<p>This week, the goal for each participant is to read one the the <a href="http://www.kdnuggets.com/2015/09/free-data-science-books.html">60 Data Science books collected by KDnuggets</a> within 8 hours.<br/>Participants could pick one or two books to finish reading in 8 hours and present findings / insights to the others.</p>
          
        
      
    </div>

    </div></body></html>
In [6]:
HTML(readable_article)
Out[6]:

The video is also available on YouTube and Youku.

What did we do?

Jackathon is short for “Journalism-Hackathon”. At Initium Lab, we aim to push limits of Journalism with Technology. We hold regular Jackathons to advance our knowledge and skills in using new technology for obtaining data, analysing information, and reporting.

This week, the goal for each participant is to read one the the 60 Data Science books collected by KDnuggets within 8 hours.
Participants could pick one or two books to finish reading in 8 hours and present findings / insights to the others.

PyQuery

Let's fix the above URL problems

In [7]:
import pyquery
r = pyquery.PyQuery(readable_article)
r('p')
Out[7]:
[<p>, <p>, <p>]
In [8]:
r('video').attr('poster')
Out[8]:
'./blog/20150922-jackathon3-review/jackathon3-timelapse-poster.png'
In [9]:
r('video source').attr('src')
Out[9]:
'./blog/20150922-jackathon3-review/jackathon3-timelapse.mp4'
In [10]:
r('video').attr('poster', 'http://initiumlab.com/%s' % r('video').attr('poster'))
Out[10]:
[<video>]
In [11]:
r('video').attr('poster')
Out[11]:
'http://initiumlab.com/./blog/20150922-jackathon3-review/jackathon3-timelapse-poster.png'
In [12]:
r('video source').attr('src', 'http://initiumlab.com/%s' % r('video source').attr('src'))
Out[12]:
[<source>, <source>]
In [13]:
r('video source').attr('src')
Out[13]:
'http://initiumlab.com/./blog/20150922-jackathon3-review/jackathon3-timelapse.mp4'
In [14]:
r.html()
Out[14]:
'<body><div><div class="post-body">\n\n      \n      \n\n      \n        \n          <video controls="" poster="http://initiumlab.com/./blog/20150922-jackathon3-review/jackathon3-timelapse-poster.png"><br/>  <source src="http://initiumlab.com/./blog/20150922-jackathon3-review/jackathon3-timelapse.mp4" type="video/mp4"/><br/>  <source src="http://initiumlab.com/./blog/20150922-jackathon3-review/jackathon3-timelapse.mp4" type="video/webm"/><br/>  Sorry, you browser does not support HTML5 video.<br/></video>\n\n<p>The video is also available on <a href="https://youtu.be/zFeSh2W1_C8">YouTube</a> and <a href="http://v.youku.com/v_show/id_XMTM0MzM1MjEwMA==.html?from=y1.7-2">Youku</a>.</p>\n<h2 id="What_did_we_do?">What did we do?</h2><p>Jackathon is short for “Journalism-Hackathon”. At Initium Lab, we aim to push limits of Journalism with Technology. We hold regular Jackathons to advance our knowledge and skills in using new technology for obtaining data, analysing information, and reporting.</p>\n<p>This week, the goal for each participant is to read one the the <a href="http://www.kdnuggets.com/2015/09/free-data-science-books.html">60 Data Science books collected by KDnuggets</a> within 8 hours.<br/>Participants could pick one or two books to finish reading in 8 hours and present findings / insights to the others.</p>\n          \n        \n      \n    </div>\n\n    </div></body>'
In [15]:
%%javascript
//IPython.OutputArea.auto_scroll_threshold = 9999;
IPython.OutputArea.prototype._should_scroll = function(){return false;}
In [16]:
HTML(r.html())
Out[16]:

The video is also available on YouTube and Youku.

What did we do?

Jackathon is short for “Journalism-Hackathon”. At Initium Lab, we aim to push limits of Journalism with Technology. We hold regular Jackathons to advance our knowledge and skills in using new technology for obtaining data, analysing information, and reporting.

This week, the goal for each participant is to read one the the 60 Data Science books collected by KDnuggets within 8 hours.
Participants could pick one or two books to finish reading in 8 hours and present findings / insights to the others.

Scrapely

In [17]:
from scrapely import Scraper
s = Scraper()
In [18]:
help(s.train)
Help on method train in module scrapely:

train(url, data, encoding=None) method of scrapely.Scraper instance

In [19]:
from urllib import parse
def get_localhost_url(url):
    filename = parse.quote_plus(url)
    fullpath = 'tmp/%s' % filename
    html = requests.get(url).content
    open(fullpath, 'wb').write(html)
    return 'http://localhost:8888/files/%s?download=1' % parse.quote_plus(fullpath)
In [20]:
training_url = 'http://initiumlab.com/blog/20150916-legco-eng/'
training_data = {'title': 'Legco Matrix Brief (English)', 
                 'author': 'Initium Lab', 
                 'date': '2015-09-16'}
s.train(get_localhost_url(training_url), training_data)
In [21]:
testing_url = 'http://initiumlab.com/blog/20150901-data-journalism-for-the-blind/'
s.scrape(get_localhost_url(testing_url))
Out[21]:
[{'author': ['Andy Shu'],
  'date': ['\n            2015-09-01\n          '],
  'title': ['\n          \n          \n            \n              可視化火了 盲人怎麼辦\n            \n          \n        ']}]
In [22]:
testing_url = 'http://initiumlab.com/blog/20150922-jackathon3-review/'
s.scrape(get_localhost_url(testing_url))
Out[22]:
[{'author': ['Initium Lab'],
  'date': ['\n            2015-09-22\n          '],
  'title': ['\n          \n          \n            \n              Jackathon #3 -- Read a data science book in 8 hours\n            \n          \n        ']}]
In [23]:
!ls -1
Easy Scraping.html
Easy Scraping.ipynb
README.md
requirements.txt
tmp
venv
In [24]:
a = !ls -1
In [25]:
a
Out[25]:
['Easy Scraping.html',
 'Easy Scraping.ipynb',
 'README.md',
 'requirements.txt',
 'tmp',
 'venv']
In [26]:
%%sh
http get 'http://httpbin.org/get' name==hupili at=='Hardcore scraping workshop!'
{
  "args": {
    "at": "Hardcore scraping workshop!", 
    "name": "hupili"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "HTTPie/0.9.2"
  }, 
  "origin": "118.140.67.2", 
  "url": "http://httpbin.org/get?at=Hardcore+scraping+workshop!&name=hupili"
}
In [27]:
%%sh
http get 'http://httpbin.org/get' name==hupili 'User-Agent: Arbitrarily name your user agent!'
{
  "args": {
    "name": "hupili"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "Arbitrarily name your user agent!"
  }, 
  "origin": "118.140.67.2", 
  "url": "http://httpbin.org/get?name=hupili"
}

HTTPie request construction. From http --help

      ':' HTTP headers:
          Referer:http://httpie.org  Cookie:foo=bar  User-Agent:bacon/1.0

      '==' URL parameters to be appended to the request URI:
          search==httpie

      '=' Data fields to be serialized into a JSON object (with --json, -j)
          or form data (with --form, -f):
          name=HTTPie  language=Python  description='CLI HTTP client'

      ':=' Non-string JSON data fields (only with --json, -j):
          awesome:=true  amount:=42  colors:='["red", "green", "blue"]'

      '@' Form file fields (only with --form, -f):
          [email protected]~/Documents/CV.pdf

      '[email protected]' A data field like '=', but takes a file path and embeds its content:
           [email protected]/essay.txt

      ':[email protected]' A raw JSON field like ':=', but takes a file path and embeds its content:
          package:[email protected]/package.json

      You can use a backslash to escape a colliding separator in the field name:
          field-name-with\:colon=value
In [28]:
%%sh
http --body 'http://www.kdnuggets.com/2015/09/free-data-science-books.html' | head -n 5
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US">
<head profile="http://gmpg.org/xfn/11">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="generator" content="WordPress 3.8.11">
In [29]:
%%sh
http --body 'http://www.kdnuggets.com/2015/09/free-data-science-books.html' |\
pquery '.three_ul li strong a' -p text |\
head -n 5
An Introduction to Data Science
School of Data Handbook
Data Jujitsu: The Art of Turning Data into Product
The Data Science Handbook
The Data Analytics Handbook
In [30]:
%%sh
http --body 'http://www.kdnuggets.com/2015/09/free-data-science-books.html' |\
pquery '.three_ul li strong a' -p href |\
head -n 5
https://docs.google.com/file/d/0B6iefdnF22XQeVZDSkxjZ0Z5VUE/edit?pli=1
http://schoolofdata.org/handbook/
http://www.oreilly.com/data/free/data-jujitsu.csp
http://www.thedatasciencehandbook.com/#get-the-book
https://www.teamleada.com/handbook
In [31]:
%%sh
http --body 'http://www.kdnuggets.com/2015/09/free-data-science-books.html' |\
pquery '.three_ul li strong a' -f '"{text}",{href}' |\
head -n 5
"An Introduction to Data Science",https://docs.google.com/file/d/0B6iefdnF22XQeVZDSkxjZ0Z5VUE/edit?pli=1
"School of Data Handbook",http://schoolofdata.org/handbook/
"Data Jujitsu: The Art of Turning Data into Product",http://www.oreilly.com/data/free/data-jujitsu.csp
"The Data Science Handbook",http://www.thedatasciencehandbook.com/#get-the-book
"The Data Analytics Handbook",https://www.teamleada.com/handbook