Scrape More with Less Codes¶

Meta Info.¶

Author: [Pili Hu](http://hupili.net/)
Repo: [Easy Scraping in Python](https://github.com/hupili/workshop-easy-scraping)
Demo: scrapely, python-readability, pyQuery, pandas, httpie, etc

Prerequisites:

Python3
pip install -r reuiqrements.txt

FAQ about the Env¶

Q: What is this webpage you are using?

A: IPython Notebook

In [1]:

# This is the input block -- a full bone Python Shell
print('Look: I will be shown on output block')

Look: I will be shown on output block

Useful tricks in IPython notebook¶

In [2]:

import pprint
from IPython.core.display import HTML

In [3]:

HTML('Logo of Initium Lab: <img src="%s">' % 'http://initiumlab.com/favicon-32x32.png')

Out[3]:

Logo of Initium Lab:

In [4]:

# Display any HTML easily
my_html = '''
I'm going to show you:
<ul>
    <li> PyReadability </li>
    <li> PyQuery </li>
    <li> ... </li>
</ul>
'''
HTML(my_html)

Out[4]:

I'm going to show you:

PyReadability
PyQuery
...

A small hack to allow longer output area¶

In [5]:

%%javascript
//IPython.OutputArea.auto_scroll_threshold = 9999;
IPython.OutputArea.prototype._should_scroll = function(){return false;}

Why Scraping?¶

In [6]:

# I'm going to insert some slides here
from IPython.core.display import Image

Hong Kong Legislative Council:

https://theinitium.com/article/20150812-hongkong-legcoanalysis/

http://legco.initiumlab.com/matrix

Hong Kong District Council Election:

https://theinitium.com/project/20151012-hk-district-council-elections/

https://theinitium.com/project/20151019-hk-district-council-elections-2/

https://theinitium.com/project/20151029-hk-district-council-elections-3/

In [7]:

Image('assets/venn-skillset.png')

Out[7]:

In [8]:

Image('assets/workflow-highlight-data-collection.png')

Out[8]:

About collecting data¶

Open Data (easy to find; machine readable; free to use) -- Good.
Public data (but not "open data") -- Needs scraping
Private data -- an issue of manpower
- Inputing
- Labeling

In [9]:

print('screenshot from: https://cerg1.ugc.edu.hk/cergprod/scrrm00541.jsp')
Image('assets/rgc-official-site.png')

screenshot from: https://cerg1.ugc.edu.hk/cergprod/scrrm00541.jsp

Out[9]:

Combine public and private data to get more insights¶

In [10]:

print('e.g. # of Hong Kong v.s. Non Hong Kong studies (Social Science)')
print('(Just draft labeling! -- cite the figure at your own risk)')
Image('assets/hk-non-hk-studies-humanities.png')

e.g. # of Hong Kong v.s. Non Hong Kong studies (Social Science)
(Just draft labeling! -- cite the figure at your own risk)

Out[10]:

Major Steps of Scraping¶

Download -- Get the raw materials (HTML/ PDF/ XLS)
Parse -- Extract useful information from raw materials and put into structured format

Download -- A lot of command line utilities for quick hacks¶

In [11]:

%%sh
ls -1

Easy Scraping.html
Easy Scraping.ipynb
README.md
Scrape More with Less Codes.ipynb
assets
hosts
output
path-list.txt
requirements.txt
tmp
venv

In [12]:

%%sh
curl -s 'http://initiumlab.com/' | head -n 8

<!doctype html>
<html class="theme-next use-motion ">
<head>
  

<meta charset="UTF-8"/>
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1"/>

Download -- Human friendly in Python¶

In [13]:

import requests
html = requests.get('http://initiumlab.com/').content
html[:500]

Out[13]:

b'<!doctype html>\n<html class="theme-next use-motion ">\n<head>\n  \n\n<meta charset="UTF-8"/>\n<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />\n<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1"/>\n\n\n<meta http-equiv="Cache-Control" content="no-transform" />\n<meta http-equiv="Cache-Control" content="no-siteapp" />\n\n\n\n\n\n\n  <link rel="stylesheet" type="text/css" href="./vendors/fancybox/source/jquery.fancybox.css?v=2.1.5"/>\n\n\n\n  <link href=\'//fonts.google'

Parse¶

Not an easy task, generally:

Need many trials and erros in practice.
Could get something you don't want.

In [14]:

%%sh
curl -s 'http://initiumlab.com/' | grep title

  <link rel="alternate" href="./blog/feed.xml" title="Initium Lab" type="application/atom+xml" />
<meta property="og:title" content="Initium Lab">
<meta name="twitter:title" content="Initium Lab">
  <title> Initium Lab </title>
    <div class='subtitle' id="titleWorks">
      <div class='subtitle' id='titleBlogs'>
        <h1 class="post-title" itemprop="name headline">
              <a class="post-title-link" href="./blog/20151025-jackathon-no-5/" itemprop="url">
        </h1> <!-- h1.post-title -->
        <h1 class="post-title" itemprop="name headline">
              <a class="post-title-link" href="./blog/20151015-3d-infographic-user-testing/" itemprop="url">
        </h1> <!-- h1.post-title -->
        <h1 class="post-title" itemprop="name headline">
              <a class="post-title-link" href="./blog/20151005-read-journalism/" itemprop="url">
        </h1> <!-- h1.post-title -->
        <h1 class="post-title" itemprop="name headline">
              <a class="post-title-link" href="./blog/20150925-react-in-1-hour-cuhk/" itemprop="url">
        </h1> <!-- h1.post-title -->
        <h1 class="post-title" itemprop="name headline">
              <a class="post-title-link" href="./blog/20150922-jackathon3-review/" itemprop="url">
        </h1> <!-- h1.post-title -->
        <h1 class="post-title" itemprop="name headline">
              <a class="post-title-link" href="./blog/20150916-legco-eng/" itemprop="url">
        </h1> <!-- h1.post-title -->
      var disqus_title = '';

In [15]:

%%sh
curl -s 'http://initiumlab.com/' | grep '<title'

  <title> Initium Lab </title>

Main problems in scraping¶

Download
- Robot detection
- Authentication/ Authorisation
- Transfer error
- Encoding
- Incremental Crawling
- Get the right seed
- [Y] Scale-out
Parse (focus on HTML)
- [Y] Find pattern
  - [Y] Manual
  - [Y] Machine learning
- [Y] Leverage pattern
- Deal with anomaly (e.g. broken page)

[Y] items will be involved in this talk.

For mature project, you usually loop between Download and Parse, e.g. scrapy is a widely used framework.

Keywords of this demo:

Quick
Dirty Hack

HTTPie¶

Human-friendly command tool written in Python

In [16]:

%%sh
http get http://initiumlab.com | head -n 50 | tail -n 10

<meta property="og:type" content="website">
<meta property="og:title" content="Initium Lab">
<meta property="og:url" content="http://initiumlab.com/blog/index.html">
<meta property="og:site_name" content="Initium Lab">
<meta property="og:description" content="The Website of Initium Lab, the exploratory arm of Initium Media">
<meta name="twitter:card" content="summary">
<meta name="twitter:title" content="Initium Lab">
<meta name="twitter:description" content="The Website of Initium Lab, the exploratory arm of Initium Media">

Command-line (text-interface; stdin/stdout) not good?¶

No. Nearly seamless integration:

In [17]:

lines = !http get http://initiumlab.com
lines[0:10]

Out[17]:

['<!doctype html>',
 '<html class="theme-next use-motion ">',
 '<head>',
 '  ',
 '',
 '<meta charset="UTF-8"/>',
 '<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />',
 '<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1"/>',
 '',
 '']

The parameters¶

      ':' HTTP headers:
          Referer:http://httpie.org  Cookie:foo=bar  User-Agent:bacon/1.0

      '==' URL parameters to be appended to the request URI:
          search==httpie

      '=' Data fields to be serialized into a JSON object (with --json, -j)
          or form data (with --form, -f):
          name=HTTPie  language=Python  description='CLI HTTP client'

      ':=' Non-string JSON data fields (only with --json, -j):
          awesome:=true  amount:=42  colors:='["red", "green", "blue"]'

      '@' Form file fields (only with --form, -f):
          cs@~/Documents/CV.pdf

      '=@' A data field like '=', but takes a file path and embeds its content:
           essay=@Documents/essay.txt

      ':=@' A raw JSON field like ':=', but takes a file path and embeds its content:
          package:=@./package.json

      You can use a backslash to escape a colliding separator in the field name:
          field-name-with\:colon=value

Test requests¶

In [18]:

%%sh
http get 'http://httpbin.org/get' name==hupili at=='Scrape more with less codes!'

{
  "args": {
    "at": "Scrape more with less codes!", 
    "name": "hupili"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "HTTPie/0.9.2"
  }, 
  "origin": "118.140.67.6", 
  "url": "http://httpbin.org/get?at=Scrape+more+with+less+codes!&name=hupili"
}

In [19]:

%%sh
http post 'http://httpbin.org/post' name==hupili at=='Scrape more with less codes!'

{
  "args": {
    "at": "Scrape more with less codes!", 
    "name": "hupili"
  }, 
  "data": "", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "0", 
    "Host": "httpbin.org", 
    "User-Agent": "HTTPie/0.9.2"
  }, 
  "json": null, 
  "origin": "118.140.67.6", 
  "url": "http://httpbin.org/post?at=Scrape+more+with+less+codes!&name=hupili"
}

Caveats: --ignore-stdin is required in IPython notebook

Not a problem in command-line env.

In [20]:

%%sh
http --form --ignore-stdin post 'http://httpbin.org/post' name==hupili at='Scrape more with less codes!'

{
  "args": {
    "name": "hupili"
  }, 
  "data": "", 
  "files": {}, 
  "form": {
    "at": "Scrape more with less codes!"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "33", 
    "Content-Type": "application/x-www-form-urlencoded; charset=utf-8", 
    "Host": "httpbin.org", 
    "User-Agent": "HTTPie/0.9.2"
  }, 
  "json": null, 
  "origin": "118.140.67.6", 
  "url": "http://httpbin.org/post?name=hupili"
}

Note: "Content-Type": "application/x-www-form-urlencoded; charset=utf-8",

In [21]:

%%sh
http --form --ignore-stdin post 'http://httpbin.org/post' name=hupili at='Scrape more with less codes!'

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "at": "Scrape more with less codes!", 
    "name": "hupili"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "45", 
    "Content-Type": "application/x-www-form-urlencoded; charset=utf-8", 
    "Host": "httpbin.org", 
    "User-Agent": "HTTPie/0.9.2"
  }, 
  "json": null, 
  "origin": "118.140.67.6", 
  "url": "http://httpbin.org/post"
}

Note: "Content-Type": "application/json",

In [22]:

%%sh
http --ignore-stdin post 'http://httpbin.org/post' name=hupili at='Scrape more with less codes!'

{
  "args": {}, 
  "data": "{\"name\": \"hupili\", \"at\": \"Scrape more with less codes!\"}", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "application/json", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "56", 
    "Content-Type": "application/json", 
    "Host": "httpbin.org", 
    "User-Agent": "HTTPie/0.9.2"
  }, 
  "json": {
    "at": "Scrape more with less codes!", 
    "name": "hupili"
  }, 
  "origin": "118.140.67.6", 
  "url": "http://httpbin.org/post"
}

Back to our Hong Kong RGC example¶

In [23]:

%%sh
http get https://cerg1.ugc.edu.hk/cergprod/scrrm00541.jsp | head -n 10


<HTML>
<HEAD>
<script language="JavaScript" src="validation.js"></script>


<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<META name="GENERATOR" content="IBM WebSphere Studio">
<META http-equiv="Content-Style-Type" content="text/css">

Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
BrokenPipeError: [Errno 32] Broken pipe

In [24]:

%%sh
http get https://cerg1.ugc.edu.hk/cergprod/scrrm00541.jsp | grep 155

Needs to find out why it doesn't give us the links

In [25]:

Image('assets/rgc-search-network-trace.png')

Out[25]:

Now use HTTPie to easily construct the query

In [26]:

%%sh
http post https://cerg1.ugc.edu.hk/cergprod/scrrm00541.jsp | grep 155

In [27]:

%%sh
http --ignore-stdin --form post 'https://cerg1.ugc.edu.hk/cergprod/scrrm00541.jsp' 'mode=search' 'sScheme=1' | grep 155

	<A HREF="scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=155">155</A>

In [28]:

html_lines = !http --ignore-stdin --form post 'https://cerg1.ugc.edu.hk/cergprod/scrrm00541.jsp' 'mode=search' 'sScheme=1'

In [29]:

html_lines_with_a = list(filter(lambda l: '<A' in l, html_lines))
html_lines_with_a[:5]

Out[29]:

['\t<A HREF="scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=2">[Next Page]</A>',
 '\t<A HREF="scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=906">[Last Page]</A>',
 '\t<A HREF="scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=2">2</A>',
 '\t<A HREF="scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=3">3</A>',
 '\t<A HREF="scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=4">4</A>']

pQuery -- Grep for HTML¶

A wrap around pyQuery -- a Python library that allow you manipulate HTML in jQuery style.

Try plain grep first¶

In [30]:

%%sh
http --ignore-stdin --form post 'https://cerg1.ugc.edu.hk/cergprod/scrrm00541.jsp' 'mode=search' 'sScheme=1' \
| grep 155

	<A HREF="scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=155">155</A>

In [31]:

%%sh
http --ignore-stdin --form post 'https://cerg1.ugc.edu.hk/cergprod/scrrm00541.jsp' 'mode=search' 'sScheme=1' \
| grep '<a'

Ignore case

In [32]:

%%sh
http --ignore-stdin --form post 'https://cerg1.ugc.edu.hk/cergprod/scrrm00541.jsp' 'mode=search' 'sScheme=1' \
| grep -i '<a' | head -n 5

	<A HREF="scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=2">[Next Page]</A>
	<A HREF="scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=906">[Last Page]</A>
	<A HREF="scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=2">2</A>
	<A HREF="scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=3">3</A>
	<A HREF="scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=4">4</A>

Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
BrokenPipeError: [Errno 32] Broken pipe

In [33]:

%%sh
http --ignore-stdin --form post 'https://cerg1.ugc.edu.hk/cergprod/scrrm00541.jsp' 'mode=search' 'sScheme=1' \
| grep -i '<a' | grep -o 'HREF=".*"' | head -n 5

HREF="scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=2"
HREF="scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=906"
HREF="scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=2"
HREF="scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=3"
HREF="scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=4"

Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
BrokenPipeError: [Errno 32] Broken pipe

In [34]:

%%sh
http --ignore-stdin --form post 'https://cerg1.ugc.edu.hk/cergprod/scrrm00541.jsp' 'mode=search' 'sScheme=1' \
| grep -i '<a' | grep -o 'HREF=".*"' | cut -d'"' -f2 | head -n 5

scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=2
scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=906
scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=2
scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=3
scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=4

Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
BrokenPipeError: [Errno 32] Broken pipe

Try pQuery¶

In [35]:

%%sh
http --ignore-stdin --form post 'https://cerg1.ugc.edu.hk/cergprod/scrrm00541.jsp' 'mode=search' 'sScheme=1' \
| pquery 'a' | head -n 5

{'tag': 'a', 'text': '[Next Page]', 'html': '[Next Page]', 'href': 'scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=2'}
{'tag': 'a', 'text': '[Last Page]', 'html': '[Last Page]', 'href': 'scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=906'}
{'tag': 'a', 'text': '2', 'html': '2', 'href': 'scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=2'}
{'tag': 'a', 'text': '3', 'html': '3', 'href': 'scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=3'}
{'tag': 'a', 'text': '4', 'html': '4', 'href': 'scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=4'}

Traceback (most recent call last):
  File "/Users/hupili/Dropbox/Desktop-iMAC-initium/project/workshop-easy-scraping/venv/bin/pquery", line 121, in <module>
    array_output(data)
  File "/Users/hupili/Dropbox/Desktop-iMAC-initium/project/workshop-easy-scraping/venv/bin/pquery", line 56, in array_output
    sys.stdout.write(str(i) + '\n')
BrokenPipeError: [Errno 32] Broken pipe
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
BrokenPipeError: [Errno 32] Broken pipe

In [36]:

%%sh
http --ignore-stdin --form post 'https://cerg1.ugc.edu.hk/cergprod/scrrm00541.jsp' 'mode=search' 'sScheme=1' \
| pquery 'a' -p href | head -n 5

scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=2
scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=906
scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=2
scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=3
scrrm00541.jsp?subject=&panel=&sScheme=1&mode=search&sStatus=&subject=&proj_id=&Old_proj_id=&proj_title=&isname=&ioname=&institution=&Year=&pages=4

Traceback (most recent call last):
  File "/Users/hupili/Dropbox/Desktop-iMAC-initium/project/workshop-easy-scraping/venv/bin/pquery", line 121, in <module>
    array_output(data)
  File "/Users/hupili/Dropbox/Desktop-iMAC-initium/project/workshop-easy-scraping/venv/bin/pquery", line 56, in array_output
    sys.stdout.write(str(i) + '\n')
BrokenPipeError: [Errno 32] Broken pipe
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
BrokenPipeError: [Errno 32] Broken pipe

In [37]:

%%sh
http --ignore-stdin --form post 'https://cerg1.ugc.edu.hk/cergprod/scrrm00541.jsp' 'mode=search' 'sScheme=1' \
| pquery 'a' -p href | wc -l

In [38]:

Image('assets/rgc-index-list.png')

Out[38]:

In [39]:

%%sh
http --ignore-stdin --form post 'https://cerg1.ugc.edu.hk/cergprod/scrrm00541.jsp' 'mode=search' 'sScheme=1' \
| pquery "table td[align='right'] a" -p href | wc -l

Further read: a case using HTTPie and pQuery¶

Scrape the info of 60 data science books and visualise their connection: http://www.kdnuggets.com/2015/09/free-data-science-books.html

Demo repo: https://github.com/hupili/60-data-science-book-visualisation
HTTPie: https://github.com/jkbrzt/httpie
pquery: https://github.com/hupili/pquery (CLI wrapper of PyQuery)

In [40]:

%%sh
http --body 'http://www.kdnuggets.com/2015/09/free-data-science-books.html' |\
pquery '.three_ul li strong a' -f '"{text}",{href}' |\
head -n 8

"An Introduction to Data Science",https://docs.google.com/file/d/0B6iefdnF22XQeVZDSkxjZ0Z5VUE/edit?pli=1
"School of Data Handbook",http://schoolofdata.org/handbook/
"Data Jujitsu: The Art of Turning Data into Product",http://www.oreilly.com/data/free/data-jujitsu.csp
"The Data Science Handbook",http://www.thedatasciencehandbook.com/#get-the-book
"The Data Analytics Handbook",https://www.teamleada.com/handbook
"Data Driven: Creating a Data Culture",http://www.oreilly.com/data/free/data-driven.csp
"Building Data Science Teams",http://www.oreilly.com/data/free/building-data-science-teams.csp
"Understanding the Chief Data Officer",http://www.oreilly.com/data/free/files/understanding-chief-data-officer.pdf

Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
BrokenPipeError: [Errno 32] Broken pipe

In [41]:

Image('assets/data-science-books-graph.png')

Out[41]:

Scale-out -- all in command-line¶

Check the downloaded data¶

In [42]:

%%sh
http --ignore-stdin --form post 'https://cerg1.ugc.edu.hk/cergprod/scrrm00541.jsp' 'mode=search' 'sScheme=1' \
| pquery "table td[align='right'] a" -p href > path-list.txt

Next, let's download them all

In [43]:

%%sh
tail -n 1 path-list.txt | xargs -I{} http "https://cerg1.ugc.edu.hk/cergprod/{}" \
| pquery 'table.styleTableContent' -p html | head -n 10

      
    <tr class="styleTableHeader">&#13;
       <td nowrap="nowrap" width="10%" align="center"><b>Project Number</b></td>&#13;
       <td nowrap="nowrap" width="50%" align="center"><b>Project Title</b></td>&#13;
       <td nowrap="nowrap" width="15%" align="center"><b>Principal Investigator</b></td>&#13;
       <td nowrap="nowrap" width="15%" align="center"><b>Status</b></td>&#13;
    </tr>    &#13;
<br/>
<br/>
&#13;

Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
BrokenPipeError: [Errno 32] Broken pipe

In [44]:

%%sh
tail -n 1 path-list.txt | xargs -I{} http "https://cerg1.ugc.edu.hk/cergprod/{}" \
| pquery 'table.styleTableContent' -p html | pquery 'td' -p text

	
Revolution, Commercialism and Chineseness: The Reception and Appropriation of the Socialist Opera Films in Captialist-Colonial Hong Kong, 1954-1966
Dr Hui, Kwok Wai

	On-going

	

	
Age Differences in Cognitive Control and Daily Control Strategies and Emotional Experiences: Implications on Physical and Emotional Health
Dr Hou, Wai Kai

	On-going

	

	
The Chinese Healthcare Reform in Provincial Perspective: A Comparative Study of Fujian and Shanxi
Dr He, Jingwei Alex

	On-going

	

	
The identification, abundance and sources of microplastics in the fluvial, littoral and marine environments of Hong Kong
Dr Fok, Lincoln

	On-going

	

	
Decoding the Role and Efficacy of Verbal Imagery in the Teaching and Learning of Singing: Case Studies in Greater China towards a Holistic Approach
Dr Chen, Ti Wei

	On-going

	

	
Linguistic Analysis of Mid-20th Century Hong Kong Cantonese by Constructing an Annotated Spoken Corpus
Dr Chin, Chi On

	On-going

	

	
 Chinese morality: When propriety is part of the picture, what does morality mean? Testing and extending moral theory to fit lay concepts of a Confucian moral system
Dr Buchtel, Emma Ellen Kathrina

	On-going

Use xargs -P for local multi-processing¶

In [45]:

%time page_lines = !tail -n 10 path-list.txt | xargs -I{} http "https://cerg1.ugc.edu.hk/cergprod/{}"

CPU times: user 84.1 ms, sys: 51.5 ms, total: 136 ms
Wall time: 7.45 s

In [46]:

%time page_lines = !tail -n 10 path-list.txt | xargs -I{} -P5 http "https://cerg1.ugc.edu.hk/cergprod/{}"

CPU times: user 78.4 ms, sys: 47.4 ms, total: 126 ms
Wall time: 4.36 s

Other quick parallel execution tools¶

A) My early dirty work: https://github.com/hupili/Lightweight-Distributing-Toolset

In Perl. 4 years ago. Do not use

B) GNU Parallel: http://www.gnu.org/software/parallel/

Written in Perl. Only need SSH access to remote (or local machine)

Cool, but...

C) PSSH: https://code.google.com/p/parallel-ssh/

written in Python, good.
Can brew install, good.

In [47]:

%%file hosts
localhost

Overwriting hosts

In [48]:

%%sh
cat hosts

localhost

In [49]:

%%sh
pssh -h hosts -o output/ 'echo hello PSSH'

[1] 14:14:43 [SUCCESS] localhost

In [50]:

%%sh
ls output/

localhost

In [51]:

%%sh
cat output/localhost

hello PSSH

Pandas¶

Easier for tabulared data.

In [ ]:

In [52]:

%%sh
tail -n 1 path-list.txt | xargs -I{} http "https://cerg1.ugc.edu.hk/cergprod/{}" \
| pquery 'table.styleTableContent' -p html | head -n 5

      
    <tr class="styleTableHeader">&#13;
       <td nowrap="nowrap" width="10%" align="center"><b>Project Number</b></td>&#13;
       <td nowrap="nowrap" width="50%" align="center"><b>Project Title</b></td>&#13;
       <td nowrap="nowrap" width="15%" align="center"><b>Principal Investigator</b></td>&#13;

Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
BrokenPipeError: [Errno 32] Broken pipe

In [53]:

table_html = !tail -n 1 path-list.txt | xargs -I{} http "https://cerg1.ugc.edu.hk/cergprod/{}" | pquery 'table.styleTableContent' -p html

In [54]:

import pandas as pd
df_projects = pd.read_html('<table>%s</table>' % '\n'.join(table_html))

In [55]:

df_projects[0]

Out[55]:

	0	1	2	3
0	Project Number	Project Title	Principal Investigator	Status
1	NaN	Revolution, Commercialism and Chineseness: The...	Dr Hui, Kwok Wai	On-going
2	NaN	Age Differences in Cognitive Control and Daily...	Dr Hou, Wai Kai	On-going
3	NaN	The Chinese Healthcare Reform in Provincial Pe...	Dr He, Jingwei Alex	On-going
4	NaN	The identification, abundance and sources of m...	Dr Fok, Lincoln	On-going
5	NaN	Decoding the Role and Efficacy of Verbal Image...	Dr Chen, Ti Wei	On-going
6	NaN	Linguistic Analysis of Mid-20th Century Hong K...	Dr Chin, Chi On	On-going
7	NaN	Chinese morality: When propriety is part of th...	Dr Buchtel, Emma Ellen Kathrina	On-going

Readability¶

We use a version ported to Python3: https://github.com/hyperlinkapp/python-readability (already included in the reuqirements.txt file)

In [56]:

from readability.readability import Document
import requests
html = requests.get('http://initiumlab.com/blog/20150922-jackathon3-review/').content
readable_article = Document(html).summary()
readable_title = Document(html).short_title()

In [57]:

print(readable_article[:1000])

<html><body><div><span itemprop="articleBody"><video controls="" poster="../../blog/20150922-jackathon3-review/jackathon3-timelapse-poster.png"><br/>  <source src="../../blog/20150922-jackathon3-review/jackathon3-timelapse.mp4" type="video/mp4"><br/>  <source src="../../blog/20150922-jackathon3-review/jackathon3-timelapse.webm" type="video/webm"><br/>  Sorry, you browser does not support HTML5 video.<br/></source></source></video>

<p>The video is also available on <a href="https://youtu.be/zFeSh2W1_C8" target="_blank" rel="external">YouTube</a> and <a href="http://v.youku.com/v_show/id_XMTM0MzM1MjEwMA==.html?from=y1.7-2" target="_blank" rel="external">Youku</a>.</p>
<h2 id="What_did_we_do?">What did we do?</h2><p>Jackathon is short for “Journalism-Hackathon”. At Initium Lab, we aim to push limits of Journalism with Technology. We hold regular Jackathons to advance our knowledge and skills in using new technology for obtaining data, analysing information, and reporting.</p>
<p>This wee

In [58]:

HTML(readable_article[:1000])

Out[58]:

The video is also available on YouTube and Youku.

What did we do?

Jackathon is short for “Journalism-Hackathon”. At Initium Lab, we aim to push limits of Journalism with Technology. We hold regular Jackathons to advance our knowledge and skills in using new technology for obtaining data, analysing information, and reporting.

This wee

PyQuery¶

Let's fix the above URL problems

In [59]:

import pyquery
r = pyquery.PyQuery(readable_article)
r('p')

Out[59]:

[<p>, <p>, <p>, <p>, <p>, <p>, <p>, <p>, <p>, <p>, <p>, <p>, <p>, <p>, <p>, <p>, <p>]

In [60]:

r('video').attr('poster')

Out[60]:

'../../blog/20150922-jackathon3-review/jackathon3-timelapse-poster.png'

In [61]:

r('video source').attr('src')

Out[61]:

'../../blog/20150922-jackathon3-review/jackathon3-timelapse.mp4'

In [62]:

r('video').attr('poster', 'http://initiumlab.com/%s' % r('video').attr('poster'))

Out[62]:

[<video>]

In [63]:

r('video').attr('poster')

Out[63]:

'http://initiumlab.com/../../blog/20150922-jackathon3-review/jackathon3-timelapse-poster.png'

In [64]:

r('video source').attr('src', 'http://initiumlab.com/%s' % r('video source').attr('src'))

Out[64]:

[<source>, <source>]

In [65]:

r('video source').attr('src')

Out[65]:

'http://initiumlab.com/../../blog/20150922-jackathon3-review/jackathon3-timelapse.mp4'

In [66]:

r.html()[:1000]

Out[66]:

'<body><div><span itemprop="articleBody"><video controls="" poster="http://initiumlab.com/../../blog/20150922-jackathon3-review/jackathon3-timelapse-poster.png"><br/>  <source src="http://initiumlab.com/../../blog/20150922-jackathon3-review/jackathon3-timelapse.mp4" type="video/mp4"><br/>  <source src="http://initiumlab.com/../../blog/20150922-jackathon3-review/jackathon3-timelapse.mp4" type="video/webm"><br/>  Sorry, you browser does not support HTML5 video.<br/></source></source></video>\n\n<p>The video is also available on <a href="https://youtu.be/zFeSh2W1_C8" target="_blank" rel="external">YouTube</a> and <a href="http://v.youku.com/v_show/id_XMTM0MzM1MjEwMA==.html?from=y1.7-2" target="_blank" rel="external">Youku</a>.</p>\n<h2 id="What_did_we_do?">What did we do?</h2><p>Jackathon is short for “Journalism-Hackathon”. At Initium Lab, we aim to push limits of Journalism with Technology. We hold regular Jackathons to advance our knowledge and skills in using new technology for obtaining '

In [67]:

%%javascript
//IPython.OutputArea.auto_scroll_threshold = 9999;
IPython.OutputArea.prototype._should_scroll = function(){return false;}

In [68]:

HTML(r.html()[:1000])

Out[68]:

The video is also available on YouTube and Youku.

What did we do?

Jackathon is short for “Journalism-Hackathon”. At Initium Lab, we aim to push limits of Journalism with Technology. We hold regular Jackathons to advance our knowledge and skills in using new technology for obtaining

Scrapely¶

In [69]:

from scrapely import Scraper
s = Scraper()

In [70]:

help(s.train)

Help on method train in module scrapely:

train(url, data, encoding=None) method of scrapely.Scraper instance

In [71]:

from urllib import parse
def get_localhost_url(url):
    filename = parse.quote_plus(url)
    fullpath = 'tmp/%s' % filename
    html = requests.get(url).content
    open(fullpath, 'wb').write(html)
    return 'http://localhost:8888/files/%s?download=1' % parse.quote_plus(fullpath)

In [72]:

training_url = 'http://initiumlab.com/blog/20150916-legco-eng/'
training_data = {'title': 'Legco Matrix Brief (English)', 
                 'author': 'Initium Lab', 
                 'date': '2015-09-16'}
s.train(get_localhost_url(training_url), training_data)

In [73]:

testing_url = 'http://initiumlab.com/blog/20150901-data-journalism-for-the-blind/'
s.scrape(get_localhost_url(testing_url))

Out[73]:

[{'date': ['\n            2015-09-01\n          '],
  'title': ['\n          \n          \n            \n              可視化火了 盲人怎麼辦\n            \n          \n        ']}]

In [74]:

testing_url = 'http://initiumlab.com/blog/20150922-jackathon3-review/'
s.scrape(get_localhost_url(testing_url))

Out[74]:

[{'author': ['Initium Lab'],
  'date': ['\n            2015-09-22\n          '],
  'title': ['\n          \n          \n            \n              Jackathon #3 -- Read a data science book in 8 hours\n            \n          \n        ']}]

In [75]:

testing_url = 'http://initiumlab.com/blog/20151015-3d-infographic-user-testing/'
s.scrape(get_localhost_url(testing_url))

Out[75]:

[{'date': ['\n            2015-10-15\n          '],
  'title': ['\n          \n          \n            \n              Infographic for the Blind: We Tried 3D Printing That Almost Worked\n            \n          \n        ']}]

In [76]:

blogs = !http get http://initiumlab.com/blog/ | pquery 'a.post-title-link' -p href
blogs

Out[76]:

['../blog/20151025-jackathon-no-5/',
 '../blog/20151015-3d-infographic-user-testing/',
 '../blog/20151015-Facebook-Signal-Review/',
 '../blog/20151012-visualization-via-jobs/',
 '../blog/20151012-what-is-colour/',
 '../blog/20151005-read-journalism/',
 '../blog/20150930-google-sheets-explore/',
 '../blog/20150925-react-in-1-hour-cuhk/',
 '../blog/20150922-jackathon3-review/',
 '../blog/20150916-legco-eng/']

In [77]:

infos = []
for b in blogs:
    infos.extend(s.scrape(get_localhost_url('http://initiumlab.com/blog/' + b)))

In [78]:

infos

Out[78]:

[{'author': ['Initium Lab'],
  'date': ['\n            2015-10-25\n          '],
  'title': ['\n          \n          \n            \n              Jackathon #5 -- Read a journalism book in 8 hours\n            \n          \n        ']},
 {'date': ['\n            2015-10-15\n          '],
  'title': ['\n          \n          \n            \n              Infographic for the Blind: We Tried 3D Printing That Almost Worked\n            \n          \n        ']},
 {'date': ['\n            2015-10-15\n          '],
  'title': ['\n          \n          \n            \n              在Facebook找新聞線索？FB Signal搶鮮試用\n            \n          \n        ']},
 {'date': ['\n            2015-10-12\n          '],
  'title': ['\n          \n          \n            \n              一張圖讀懂喬布斯數據化妝術\n            \n          \n        ']},
 {'date': ['\n            2015-10-12\n          '],
  'title': ['\n          \n          \n            \n              數據新聞人，今夜我們談色\n            \n          \n        ']},
 {'author': ['Initium Lab'],
  'date': ['\n            2015-10-05\n          '],
  'title': ['\n          \n          \n            \n              Jackathon #5: Read Journalism\n            \n          \n        ']},
 {'author': ['Chao Tianyi'],
  'date': ['\n            2015-09-30\n          '],
  'title': ['\n          \n          \n            \n              整日做表沒思路？Google幫你開腦洞\n            \n          \n        ']},
 {'author': ['Initium Lab'],
  'date': ['\n            2015-09-25\n          '],
  'title': ['\n          \n          \n            \n              React in One Hour\n            \n          \n        ']},
 {'author': ['Initium Lab'],
  'date': ['\n            2015-09-22\n          '],
  'title': ['\n          \n          \n            \n              Jackathon #3 -- Read a data science book in 8 hours\n            \n          \n        ']},
 {'author': ['Initium Lab'],
  'date': ['\n            2015-09-16\n          '],
  'title': ['\n          \n          \n            \n              Legco Matrix Brief (English)\n            \n          \n        ']}]

In [79]:

import pandas as pd
df_blogs = pd.DataFrame(infos)
df_blogs['title'] = df_blogs['title'].apply(lambda x: x[0].strip())
df_blogs

Out[79]:

	author	date	title
0	[Initium Lab]	[\n 2015-10-25\n ]	Jackathon #5 -- Read a journalism book in 8 hours
1	NaN	[\n 2015-10-15\n ]	Infographic for the Blind: We Tried 3D Printin...
2	NaN	[\n 2015-10-15\n ]	在Facebook找新聞線索？FB Signal搶鮮試用
3	NaN	[\n 2015-10-12\n ]	一張圖讀懂喬布斯數據化妝術
4	NaN	[\n 2015-10-12\n ]	數據新聞人，今夜我們談色
5	[Initium Lab]	[\n 2015-10-05\n ]	Jackathon #5: Read Journalism
6	[Chao Tianyi]	[\n 2015-09-30\n ]	整日做表沒思路？Google幫你開腦洞
7	[Initium Lab]	[\n 2015-09-25\n ]	React in One Hour
8	[Initium Lab]	[\n 2015-09-22\n ]	Jackathon #3 -- Read a data science book in 8 ...
9	[Initium Lab]	[\n 2015-09-16\n ]	Legco Matrix Brief (English)

Summary¶

Theme: scrape more with less codes

Keywords: quick and dirty hacks

Environment:

Mostly command-line
Some in Python shell

Human friendly HTTP interface:

CLI: HTTPie
Python REPL: requests

Scale-out

CLI:
- Single machine: xargs -P
- Multiple machine: pssh
Python REPL:
- Better to wrap as individual scripts and do multi process

Manual parse:

CLI: pQuery
Python REPL:
- pyQuery for FE people
- pandas useful for tabulared data

Automatic parse, in Python REPL:

PyReadability: the main body of a page
scraply: learn patterns from your labelling

In [ ]:

Scrape More with Less Codes¶

Meta Info.¶

FAQ about the Env¶

Useful tricks in IPython notebook¶

A small hack to allow longer output area¶

Why Scraping?¶

Related works @ Initium Media¶

About collecting data¶

Combine public and private data to get more insights¶

Major Steps of Scraping¶

Download -- A lot of command line utilities for quick hacks¶

Download -- Human friendly in Python¶

Parse¶

Main problems in scraping¶

HTTPie¶

Command-line (text-interface; stdin/stdout) not good?¶

The parameters¶

Test requests¶

Back to our Hong Kong RGC example¶

pQuery -- Grep for HTML¶

Try plain grep first¶

Try pQuery¶

Further read: a case using HTTPie and pQuery¶

Scale-out -- all in command-line¶

Check the downloaded data¶

Use xargs -P for local multi-processing¶

Other quick parallel execution tools¶

Pandas¶

Readability¶

What did we do?

PyQuery¶

What did we do?

Scrapely¶

Summary¶