Although we're putting a lot of emphasis in WwOD on doing the basic task of counting files and bytes in Common Crawl, this notebook will show you how to look at the content of the files in Common Crawl.

Setup¶

In [1]:

# this key, secret access to aws-publicdatasets only -- created for WwOD 13 student usage

# turns out there is an anonymous mode in boto for public data sets:
# https://github.com/keiw/common_crawl_index/commit/ad341d0a41a828f260c9c08419dadff0dac6cf5b#L0R33
#  conn=S3Connection(anon=True) will work instead of conn= S3Connection(KEY, SECRET) -- but there seems to be 
# a bug in how S3Connection gets pickled for anon=True -- so for now, just use the KEY, SECRET

KEY = 'AKIAJH2FD7572FCTVSSQ'
SECRET = '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'

In [2]:

# http://boto.s3.amazonaws.com/s3_tut.html

import boto
from boto.s3.connection import S3Connection

from itertools import islice

conn = S3Connection(KEY,SECRET)

# turns out there is an anonymous mode in boto for public data sets:
# https://github.com/keiw/common_crawl_index/commit/ad341d0a41a828f260c9c08419dadff0dac6cf5b#L0R33
#conn=S3Connection(anon=True)

bucket = conn.get_bucket('aws-publicdatasets')

Integration with url index¶

Look at http://urlsearch.commoncrawl.org/

For example, let's look up ischool.berkeley.edu in the URL index:

http://urlsearch.commoncrawl.org/?q=ischool.berkeley.edu

You can also download the results as a json file

http://urlsearch.commoncrawl.org/download?q=edu.berkeley.ischool

which can be parsed:

In [3]:

import requests
import json
s = requests.get("http://urlsearch.commoncrawl.org/download?q=edu.berkeley.ischool")
data = [json.loads(row) for row in s.content.split("\n") if row]
print len(data)

In [4]:

# http://urlsearch.commoncrawl.org/page/1346876860493/1346901517112/422/320051/596
u = data[0]
u

Out[4]:

{u'arcFileDate': 1346901517112L,
 u'arcFileOffset': 320051,
 u'arcFileParition': 422,
 u'arcSourceSegmentId': 1346876860493L,
 u'compressedSize': 596,
 u'url': u'http://people.ischool.berkeley.edu/~rosario/papers.html'}

In [5]:

# form the urlsearch url from the information returned
urlsearch_url = "http://urlsearch.commoncrawl.org/page/{arcSourceSegmentId}/{arcFileDate}/{arcFileParition}/{arcFileOffset}/{compressedSize}".format(**u)
urlsearch_url

Out[5]:

'http://urlsearch.commoncrawl.org/page/1346876860493/1346901517112/422/320051/596'

In [6]:

# can also look up the corresponding arc.gz file in S3

!s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output/segment/1346876860493/1346901517112_422.arc.gz 

2012-09-06 04:03 100067216   s3://aws-publicdatasets/common-crawl/parse-output/segment/1346876860493/1346901517112_422.arc.gz

Grabbing pieces of the .arc.gz files¶

I think it's possible to use the Python module warc to parse out the .arc.gz files but if we have the offset and size (provided by the URL index), we don't have to grab the entire file -- but just the piece we want.

Range specification in S3

http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectGET.html -> Downloads the specified range bytes of an object. For more information about the HTTP Range header, go to http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35

try:

bytes={offset}-499

OK -- the offset and compressed size can still be used even with gzip compression -- see https://github.com/trivio/common_crawl_index#retrieving-a-page

In [7]:

#https://github.com/trivio/common_crawl_index#retrieving-a-page

from StringIO import StringIO
from gzip import GzipFile


def arc_file(s3, bucket, info):

    bucket = s3.lookup(bucket)
    keyname = "/common-crawl/parse-output/segment/{arcSourceSegmentId}/{arcFileDate}_{arcFileParition}.arc.gz".format(**info)
    key = bucket.lookup(keyname)
    
    start = info['arcFileOffset']
    end = start + info['compressedSize'] - 1
    
    headers={'Range' : 'bytes={}-{}'.format(start, end)}
    
    chunk = StringIO(
         key.get_contents_as_string(headers=headers)
    )
    
    return GzipFile(fileobj=chunk).read()

In [8]:

Out[8]:

{u'arcFileDate': 1346901517112L,
 u'arcFileOffset': 320051,
 u'arcFileParition': 422,
 u'arcSourceSegmentId': 1346876860493L,
 u'compressedSize': 596,
 u'url': u'http://people.ischool.berkeley.edu/~rosario/papers.html'}

In [9]:

s = arc_file(conn, 'aws-publicdatasets', u)

In [10]:

# voila

print s

http://people.ischool.berkeley.edu/~rosario/papers.html 128.32.78.16 20120522225235 text/html 821
HTTP/1.1 200 OK
Date:Tue, 22 May 2012 22:53:48 GMT
Server:Apache/2.2.22 (Fedora)
Last-Modified:Mon, 08 Apr 2002 18:25:30 GMT
ETag:"5a1d0a0-208-39e213165da80"
Accept-Ranges:bytes
Content-Length:520
Connection:close
Content-Type:text/html; charset=UTF-8
x-commoncrawl-DetectedCharset:UTF-8

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html> <head>
<title>Barbara Rosario. Publications</title>
</head>


<frameset cols ="20%,*" frameborder="NO" border="0" framespacing="0"> 
  <frame src="navigation_research_papers.html" frameborder="NO" name="navigation">
  <frame src="papers_frame.html" frameborder="NO" name="view_window">
</frameset>
<noframes>
Sorry, this document can be viewed only with a frame-capable browser.
Back to the <a href="index.html">Home Page</a> 
</noframes>
</html>

Parsing the metadata and textdata files¶

Where to go next: tempting to go implement the index crawling described in:

http://commoncrawl.org/common-crawl-url-index/

index itself: The index itself is located public datasets bucket at s3://aws-publicdatasets/common-crawl/projects/url-index/url-index.1356128792.

Lot more to explore at https://github.com/trivio/common_crawl_index

metadata and text files¶

In [11]:

# example -- let's look at a the structure of a metadata file
# grab 'common-crawl/parse-output/segment/1346823845675/metadata-00000'

k = bucket.get_key('common-crawl/parse-output/segment/1346823845675/metadata-00000')
k.size

Out[11]:

41857708

public URLs -- don't need to generate signature: https://aws-publicdatasets.s3.amazonaws.com/common-crawl/parse-output/segment/1346823845675/metadata-00000

In [ ]:

# easiest way to get file into the local directory -- warning file size is 41857708

!wget https://aws-publicdatasets.s3.amazonaws.com/common-crawl/parse-output/segment/1346823845675/metadata-00000

In [ ]:

# alternative -- use boto to download to local file -- this method will be useful if you want grab content from S3
fp = open('metadata-00000', 'wb')
k.get_file(fp)
fp.close()

The metadata and textdata files in Common Crawl are Hadoop sequences files, specifically, https://github.com/matteobertozzi/Hadoop/blob/master/python-hadoop/examples/SequenceFileReader.py. To parse them, we will use library from https://github.com/matteobertozzi/Hadoop -- here are some installation instructions. (I've installed these libraries on the PiCloud rdhyee/Working_with_Open_Data environment

git clone git://github.com/matteobertozzi/Hadoop.git
cd Hadoop/python-hadoop
python setup.py install

In [12]:

import sys
import json
from hadoop.io import SequenceFile
from itertools import islice


def SequenceFileIterator(path):
    reader = SequenceFile.Reader(path)

    key_class = reader.getKeyClass()
    value_class = reader.getValueClass()

    key = key_class()
    value = value_class()

    position = reader.getPosition()

    while reader.next(key, value):
        yield (position, key.toString(), value.toString())
        position = reader.getPosition()

    reader.close()    
    

path = "metadata-00000"

# read parts of the metdata-0000 file 
for (i, (pos, k, v)) in enumerate(islice(SequenceFileIterator(path), 1)):
    v = json.loads(v)
    archiveInfo = v.get('archiveInfo', None)
    print i, k, archiveInfo
    print "metadata available:", v.keys()
    

0 http://www.museo-cb.com/museo-cb/audio-y-video/frequency-vhs/ {u'arcFileParition': 0, u'compressedSize': 9801, u'arcSourceSegmentId': 1346823845675L, u'arcFileDate': 1346864469604L, u'arcFileOffset': 157}
metadata available: [u'download_size', u'disposition', u'http_headers', u'charset_detector', u'content_len', u'attempt_time', u'http_result', u'content', u'archiveInfo', u'server_ip', u'text_simhash', u'charset_detected', u'parsed_as', u'mime_type', u'md5']