ScrapyDo is a crochet-based blocking API for Scrapy. It allows the usage of Scrapy as a library, mainly aimed to be used in spiders prototyping and data exploration in IPython notebooks.
In this notebook we are going to show how to use scrapydo
and how it helps to rapidly crawl and explore data. Our main premise is that we want to crawl the internet as a mean to analysis data and not as an end.
The function setup
must be called before any call to other functions.
import scrapydo
scrapydo.setup()
fetch
function and highlight helper¶The fetch
function returns a scrapy.Response
object for a given URL.
response = scrapydo.fetch("http://httpbin.org/get?show_env=1")
response
<200 http://httpbin.org/get?show_env=1>
The highlight
function is a helper to highlight text content using the pygments module. It is very useful to inspect text content.
from scrapydo.utils import highlight
highlight(response.body, 'json')
{ "args": { "show_env": "1" }, "headers": { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Encoding": "gzip,deflate", "Accept-Language": "en", "Host": "httpbin.org", "Runscope-Service": "httpbin", "User-Agent": "Scrapy/1.0.1 (+http://scrapy.org)", "X-Forwarded-For": "181.114.87.105", "X-Real-Ip": "181.114.87.105" }, "origin": "181.114.87.105", "url": "http://httpbin.org/get?show_env=1" }
response = scrapydo.fetch("http://httpbin.org")
highlight(response.body[:300])
<!DOCTYPE html> <html> <head> <meta http-equiv='content-type' value='text/html;charset=utf8'> <meta name='generator' value='Ronn/v0.7.3 (http://github.com/rtomayko/ronn/tree/0.7.3)'> <title>httpbin(1): HTTP Client Testing Service</title> <style type='text/css' media='all'> /* style: man */
highlight(response.css('p').extract())
[u'<p>Freely hosted in <a href="http://httpbin.org">HTTP</a>, <a href="https://httpbin.org">HTTPS</a> & <a href="http://eu.httpbin.org/">EU</a> flavors by <a href="https://www.runscope.com/">Runscope</a></p>', u'<p>Testing an HTTP Library can become difficult sometimes. <a href="http://requestb.in">RequestBin</a> is fantastic for testing POST requests, but doesn\'t let you control the response. This exists to cover all kinds of HTTP scenarios. Additional endpoints are being considered.</p>', u'<p>All endpoint responses are JSON-encoded.</p>', u'<p>You can install httpbin as a library from PyPI and run it as a WSGI app. For example, using Gunicorn:</p>', u'<p>A <a href="https://www.runscope.com/community">Runscope Community Project</a>.</p>', u'<p>Originally created by <a href="http://kennethreitz.com/">Kenneth Reitz</a>.</p>', u'<p><a href="https://hurl.it">Hurl.it</a> - Make HTTP requests.</p>', u'<p><a href="http://requestb.in">RequestBin</a> - Inspect HTTP requests.</p>', u'<p><a href="http://python-requests.org" data-bare-link="true">http://python-requests.org</a></p>']
highlight(response.headers, 'python')
{'Access-Control-Allow-Credentials': ['true'], 'Access-Control-Allow-Origin': ['*'], 'Content-Type': ['text/html; charset=utf-8'], 'Date': ['Mon, 27 Jul 2015 04:27:22 GMT'], 'Server': ['nginx']}
crawl
function or how to do spider-less crawling¶Here we are going to show to crawl an URL without defining a spider class and only using callback functions. This is very useful for quick crawling and data exploration.
# Some additional imports for our data exploration.
%matplotlib inline
import matplotlib.pylab as plt
import pandas as pd
import seaborn as sns
sns.set(context='poster', style='ticks')
We replicate the example in scrapy.org, by defining two callbacks functions to crawl the website http://blog.scrapinghub.com.
The function parse_blog(response)
is going to extract the listing URLs and the function parse_titles(response)
is going to extract the post titles from each listing page.
import scrapy
def parse_blog(response):
for url in response.css('ul li a::attr("href")').re(r'/\d\d\d\d/\d\d/$'):
yield scrapy.Request(response.urljoin(url), parse_titles)
def parse_titles(response):
for post_title in response.css('div.entries > ul > li a::text').extract():
yield {'title': post_title}
Once we have our callback functions for our target website, we simply call to scrapydo.crawl
:
items = scrapydo.crawl('http://blog.scrapinghub.com', parse_blog)
Now that we have our data, we can start doing the fun part! Here we show the posts title length distribution.
df = pd.DataFrame(items)
df['length'] = df['title'].apply(len)
df[:5]
title | length | |
---|---|---|
0 | EuroPython 2015 on | 18 |
1 | StartupChats Remote Working Q&A on | 34 |
2 | PyCon Philippines 2015 on | 25 |
3 | Why MongoDB Is a Bad Choice for Storing Our Sc... | 59 |
4 | Introducing Crawlera, a Smart Page Downloader on | 48 |
ax = df['length'].plot(kind='hist', bins=11)
ax2 = df['length'].plot(kind='kde', secondary_y=True, ax=ax)
ax2.set(ylabel="density")
ax.set(title="Title length distribution", xlim=(10, 80), ylabel="posts", xlabel="length");
run_spider
function and running spiders from an existing project¶The previous section showed how to do quick crawls to retrieve data. In this section we are going to show how to run spiders from existing scrapy projects, which can be useful for rapid spider prototyping as well as analysing the crawled data from a given spider.
We use a modified dirbot project, which is already accesible through the PYTHONPATH
.
import os
os.environ['SCRAPY_SETTINGS_MODULE'] = 'dirbot.settings'
We want to see the logging output, just as the scrapy crawl
command would do. Hence we set the log level to INFO
.
import logging
logging.root.setLevel(logging.INFO)
The function run_spider
allows to run any spider class and provide custom settings.
from dirbot.spiders import dmoz
items = scrapydo.run_spider(dmoz.DmozSpider, settings={'CLOSESPIDER_ITEMCOUNT': 500})
INFO:scrapy.utils.log:Scrapy 1.0.1 started (bot: scrapybot) INFO:scrapy.utils.log:Optional features available: ssl, http11 INFO:scrapy.utils.log:Overridden settings: {'DEFAULT_ITEM_CLASS': 'dirbot.items.Website', 'CLOSESPIDER_ITEMCOUNT': 500, 'SPIDER_MODULES': ['dirbot.spiders'], 'NEWSPIDER_MODULE': 'dirbot.spiders'} INFO:scrapy.middleware:Enabled extensions: CoreStats, TelnetConsole, LogStats, CloseSpider, SpiderState INFO:scrapy.middleware:Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats INFO:scrapy.middleware:Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware INFO:scrapy.middleware:Enabled item pipelines: FilterWordsPipeline, DefaultFields INFO:scrapy.core.engine:Spider opened INFO:scrapy.extensions.logstats:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) INFO:twisted:TelnetConsole starting on 6023 INFO:scrapy.core.engine:Closing spider (closespider_itemcount) INFO:scrapy.extensions.logstats:Crawled 703 pages (at 703 pages/min), scraped 521 items (at 521 items/min) INFO:scrapy.statscollectors:Dumping Scrapy stats: {'downloader/request_bytes': 359046, 'downloader/request_count': 765, 'downloader/request_method_count/GET': 765, 'downloader/response_bytes': 4258383, 'downloader/response_count': 765, 'downloader/response_status_count/200': 704, 'downloader/response_status_count/302': 61, 'dupefilter/filtered': 7365, 'finish_reason': 'closespider_itemcount', 'finish_time': datetime.datetime(2015, 7, 27, 4, 28, 55, 693945), 'item_scraped_count': 521, 'log_count/INFO': 9, 'request_depth_max': 6, 'response_received_count': 704, 'scheduler/dequeued': 765, 'scheduler/dequeued/memory': 765, 'scheduler/enqueued': 837, 'scheduler/enqueued/memory': 837, 'start_time': datetime.datetime(2015, 7, 27, 4, 27, 53, 505040)} INFO:scrapy.core.engine:Spider closed (closespider_itemcount) INFO:twisted:(TCP Port 6023 Closed)
In this way, we have less friction to use scrapy to data mine the web and quickly start exploring our data.
highlight(items[:3], 'python')
[{'crawled': datetime.datetime(2015, 7, 27, 4, 27, 55, 80723), 'description': u'- A remote debugger and IDE that can also be used for local debugging.', 'name': u'Hap Python Remote Debugger', 'spider': 'dmoz', 'url': u'http://hapdebugger.sourceforge.net/'}, {'crawled': datetime.datetime(2015, 7, 27, 4, 27, 55, 86720), 'description': u'- An enhanced interactive Python shell with many features for object introspection, system shell access, and its own special command system for adding functionality when working interactively. [Open Source, LGPL]', 'name': u'IPython', 'spider': 'dmoz', 'url': u'http://ipython.scipy.org/'}, {'crawled': datetime.datetime(2015, 7, 27, 4, 27, 55, 87918), 'description': u'- An interactive, graphical Python shell written in Python using wxPython.', 'name': u'PyCrust - The Flakiest Python Shell', 'spider': 'dmoz', 'url': u'http://sourceforge.net/projects/pycrust/'}]
from urlparse import urlparse
dmoz_items = pd.DataFrame(items)
dmoz_items['domain'] = dmoz_items['url'].apply(lambda url: urlparse(url).netloc.replace('www.', ''))
ax = dmoz_items.groupby('domain').apply(len).sort(inplace=False)[-10:].plot(kind='bar')
ax.set(title="Top 10 domains")
plt.setp(ax.xaxis.get_majorticklabels(), rotation=30);