For us to learn:
boto
libraryWe start by writing a function to calculate stats on one given "valid segment" in the Common Crawl. Then we'll learn how to calculate the stats for all valid segments and aggregate the data.
Although strictly speaking, you can do all the work directly on PiCloud (where I'm handling the dependencies), you'll likely want to get PiCloud, boto, s3cmd set up locally. See Day 19 notes and Day 16 PiCloud intro for a refresher. One big reason for working locally is that you'll get charged for the time you are running a PiCloud notebook server -- and when you are thinking, it's nice to not have to worry about the time (even if it is $0.05/hour for a running a c1 PiCloud instance.)
Also ask for help if you are having problems.
To see what's in your PiCloud bucket
import cloud
cloud.bucket.list()
[u'notebook/Day_02_class_starter.ipynb', u'notebook/Day_02_completed.ipynb', u'notebook/Day_04_completed.ipynb', u'notebook/Day_04_starter.ipynb', u'notebook/Day_05_plotting.ipynb', u'notebook/Day_07_array_len_and_multiply.ipynb', u'notebook/Day_08_basemap_globe_example.ipynb', u'notebook/Day_08_completed.ipynb', u'notebook/Day_08_freebase_intro.ipynb', u'notebook/Day_08_starter.ipynb', u'notebook/Day_10_A_fixed_width_parsing_completed.ipynb', u'notebook/Day_10_freebase_cursor_completed.ipynb', u'notebook/Day_10_requests_lxml.ipynb', u'notebook/Day_14_PfDA_revisited.ipynb', u'notebook/Day_14_PfDA_starter.ipynb', u'notebook/Day_14_basemap_redux.ipynb', u'notebook/Day_14_date_time.ipynb', u'notebook/Day_15_Sample_Python_Questions.ipynb', u'notebook/Day_16_PiCloud_intro.ipynb', u'notebook/Day_17_Midterm.ipynb', u'notebook/Day_17_Midterm_with_Key.ipynb', u'notebook/Day_18_Common_Crawl.ipynb', u'notebook/Day_19_CC_etc.ipynb', u'notebook/Day_20_CommonCrawl.ipynb', u'notebook/Primer.ipynb', u'notebook/basemap_example.ipynb', u'notebook/notebook_javascript_examples.ipynb', u'notebook/vtk_example.ipynb']
# http://docs.picloud.com/moduledoc.html#module-cloud.bucket
import os
# only if we not running on picloud....
if not os.path.exists('/home/picloud/notebook'):
pass
# normally I keep this line commented to prevent accidental copying if I run the notebook through.
cloud.bucket.put('Day_20_CommonCrawl_Starter.ipynb', 'notebook/Day_20_CommonCrawl_Starter.ipynb')
import os
if not os.path.exists('/home/picloud/notebook'):
pass
# normally I keep this line commented to prevent accidental copying if I run the notebook through.
# note the new local name -- to make it less likely to overwrite something I'm doing locally.
#cloud.bucket.get('notebook/Day_20_CommonCrawl_Starter.ipynb', 'Day_20_CommonCrawl_Starter_from_picloud.ipynb')
Warning: I don't think you'll immediately see the notebook changes reflected in an already running PiCloud notebook server -- at least, that was my experience.
There are other ways to interact with PiCloud -- using picloud ssh-info and scp --See SSH into a job and some rough notes. The following code shows how to use picloud ssh-info JID
to get the right ssh scp commands.
You can read off the job id for your PiCloud notebook server from the upper right corner of https://www.picloud.com/accounts/notebook/:
import re
# put the job id of your notebook server after ssh-info
NOTEBOOK_SERVER_RUNNING = False
NOTEBOOK_SERVER_JID = 501
def to_picloud(nb_name):
scp_to_command = "scp -q -i {identity} -P {port} {nb_name} {username}@{address}:/home/picloud/notebook/".format(nb_name=nb_name, **ssh_info_output)
return scp_to_command
if NOTEBOOK_SERVER_RUNNING:
ssh_info_output = !picloud ssh-info $NOTEBOOK_SERVER_JID
ssh_info_output = dict(zip( *[filter(None, re.split("\s+", l)) for l in ssh_info_output]))
#print ssh_info_output
ssh_command = "ssh -q -i {identity} {username}@{address} -p {port}".format(**ssh_info_output)
print ssh_command
print to_picloud("Day_20_CommonCrawl_Starter.ipynb")
# you can even run the scp command from within iPython notebook -- uncomment following lines
# to_picloud = to_picloud("Day_20_CommonCrawl_Starter.ipynb")
# ! $to_picloud
Running scp to the live notebook server machine will actually update the notebooks.
Good to review Dave Lester's talk: http://www.slideshare.net/davelester/introduction-to-common-crawl
If you need general intro to Common Crawl, watch the Common Crawl Video.
The Common Crawl data structure is documented at https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set. To quote the docs:
The entire Common Crawl data set is stored on Amazon S3 as a Public Data Set:
http://aws.amazon.com/datasets/41740
The data set is divided into three major subsets:
The two archived crawl data sets are stored in folders organized by the year, month, date, and hour the content was crawled. For example:
s3://aws-publicdatasets/common-crawl/crawl-002/2010/01/06/10/1262847572760_10.arc.gz
The current crawl data set is stored in the "parse-output" folder in a similar manner to how Nutch stores archives. Crawl data is stored in a "segments" subfolder, then in a folder that starts with the UNIX timestamp of crawl start time. For example:
s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/1341826131693_45.arc.gz
# this key, secret access to aws-publicdatasets only -- created for WwOD 13 student usage
# turns out there is an anonymous mode in boto for public data sets:
# https://github.com/keiw/common_crawl_index/commit/ad341d0a41a828f260c9c08419dadff0dac6cf5b#L0R33
# conn=S3Connection(anon=True) will work instead of conn= S3Connection(KEY, SECRET) -- but there seems to be
# a bug in how S3Connection gets pickled for anon=True -- so for now, just use the KEY, SECRET
KEY = 'AKIAJH2FD7572FCTVSSQ'
SECRET = '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'
You can use this key/secret pair to configure both boto
and s3cmd
# s3cmd installed in custom PiCloud environment -- and maybe in your local environment too
# confirm s3://aws-publicdatasets/common-crawl/crawl-002/2010/01/06/10/1262847572760_10.arc.gz
# doc for s3cmd: http://s3tools.org/s3cmd
!s3cmd ls s3://aws-publicdatasets/common-crawl/crawl-002/2010/01/06/10/1262847572760_10.arc.gz
2012-01-05 19:19 100001092 s3://aws-publicdatasets/common-crawl/crawl-002/2010/01/06/10/1262847572760_10.arc.gz
s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/1341826131693_45.arc.gz
¶
# looking at parse-output itself
!s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output
DIR s3://aws-publicdatasets/common-crawl/parse-output-test/ DIR s3://aws-publicdatasets/common-crawl/parse-output/ 2012-09-04 05:03 0 s3://aws-publicdatasets/common-crawl/parse-output-test_$folder$ 2012-11-09 11:28 0 s3://aws-publicdatasets/common-crawl/parse-output_$folder$
# looking at what is contained by parse-output "folder"
!s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output/
DIR s3://aws-publicdatasets/common-crawl/parse-output/checkpoint_staging/ DIR s3://aws-publicdatasets/common-crawl/parse-output/checkpoints/ DIR s3://aws-publicdatasets/common-crawl/parse-output/segment/ DIR s3://aws-publicdatasets/common-crawl/parse-output/valid_segments2/ 2012-10-17 00:11 0 s3://aws-publicdatasets/common-crawl/parse-output/checkpoint_staging_$folder$ 2012-11-09 00:10 0 s3://aws-publicdatasets/common-crawl/parse-output/checkpoints_$folder$ 2012-09-05 05:13 0 s3://aws-publicdatasets/common-crawl/parse-output/segment_$folder$ 2012-11-09 11:28 2478 s3://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt 2012-09-05 05:13 0 s3://aws-publicdatasets/common-crawl/parse-output/valid_segments2_$folder$ 2012-07-09 15:07 0 s3://aws-publicdatasets/common-crawl/parse-output/valid_segments_$folder$
There is a list of "valid segments" in
s3://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt
-- a list of segments that are part of the current crawl. Let's download it and study it.
!s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt
2012-11-09 11:28 2478 s3://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt
# we can download it:
!s3cmd get s3://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt
ERROR: Parameter problem: File ./valid_segments.txt already exists. Use either of --force / --continue / --skip-existing or give it a new name.
!head valid_segments.txt
1346823845675 1346823846036 1346823846039 1346823846110 1346823846125 1346823846150 1346823846176 1346876860445 1346876860454 1346876860467
# http://boto.s3.amazonaws.com/s3_tut.html
import boto
from boto.s3.connection import S3Connection
from itertools import islice
conn = S3Connection(KEY,SECRET)
# turns out there is an anonymous mode in boto for public data sets:
# https://github.com/keiw/common_crawl_index/commit/ad341d0a41a828f260c9c08419dadff0dac6cf5b#L0R33
#conn=S3Connection(anon=True)
bucket = conn.get_bucket('aws-publicdatasets')
for key in islice(bucket.list(prefix="common-crawl/parse-output/", delimiter="/"),None):
print key.name.encode('utf-8')
common-crawl/parse-output/checkpoint_staging_$folder$ common-crawl/parse-output/checkpoints_$folder$ common-crawl/parse-output/segment_$folder$ common-crawl/parse-output/valid_segments.txt common-crawl/parse-output/valid_segments2_$folder$ common-crawl/parse-output/valid_segments_$folder$ common-crawl/parse-output/checkpoint_staging/ common-crawl/parse-output/checkpoints/ common-crawl/parse-output/segment/ common-crawl/parse-output/valid_segments2/
# get valid_segments
# https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set
import boto
from boto.s3.connection import S3Connection
conn = S3Connection(KEY, SECRET)
bucket = conn.get_bucket('aws-publicdatasets')
k = bucket.get_key("common-crawl/parse-output/valid_segments.txt")
s = k.get_contents_as_string()
valid_segments = filter(None, s.split("\n"))
print len(valid_segments), valid_segments[0]
177 1346823845675
# valid_segments are Unix timestamps (in ms) -- confirm current crawl is from 2012
import datetime
datetime.datetime.fromtimestamp(float(valid_segments[0])/1000.)
datetime.datetime(2012, 9, 4, 22, 44, 5, 675000)
As of the time of this writing (April 4, 2013), there are 177 valid segments in the current crawl. Now, it's time to figure out how to write a Python function called segment_stats
that takes a segment id and an optional stop
parameter (for the max number of keys to iterate through) of the form
def segment_stats(seg_id, stop=None):
pass
# YOUR EXERCISE TO FILL IN
and returns a dict
with 2 keys:
count
holding the number of keys inside the given valid segmentsize
holding the total number of bytes held in the keysbroken down by file type (there are 3 major types):
arg.gz
for theFor example:
segment_stats('1346823845675', None)
should return:
{
'count': {'arc.gz': 11904, 'metadata': 4377, 'success': 1, 'textData': 4377},
'size': {'arc.gz': 967409519222,
'metadata': 187079951008,
'success': 0,
'textData': 129994977292}
}
Since it can take 10-50 seconds or so to retrieve all the keys in a valid segment, it's worth limiting to say first 10 to get a feel for what you can do with a key. Run the following:
from itertools import islice
import boto
from boto.s3.connection import S3Connection
conn = S3Connection(KEY, SECRET)
bucket = conn.get_bucket('aws-publicdatasets')
for key in islice(bucket.list(prefix="common-crawl/parse-output/segment/1346823845675/", delimiter="/"),10):
print key.name.encode('utf-8')
common-crawl/parse-output/segment/1346823845675/1346864466526_10.arc.gz common-crawl/parse-output/segment/1346823845675/1346864469604_0.arc.gz common-crawl/parse-output/segment/1346823845675/1346864469638_1.arc.gz common-crawl/parse-output/segment/1346823845675/1346864471290_4.arc.gz common-crawl/parse-output/segment/1346823845675/1346864477152_29.arc.gz common-crawl/parse-output/segment/1346823845675/1346864479613_6.arc.gz common-crawl/parse-output/segment/1346823845675/1346864480261_2.arc.gz common-crawl/parse-output/segment/1346823845675/1346864480936_5.arc.gz common-crawl/parse-output/segment/1346823845675/1346864484063_39.arc.gz common-crawl/parse-output/segment/1346823845675/1346864484163_3.arc.gz
# WARNING -- this might take a bit of time to run -- run it to see how long it takes you to get all the keys in this
# segment. time depends on where you are running this code
%time all_files = list(islice(bucket.list(prefix="common-crawl/parse-output/segment/1346823845675/", delimiter="/"),None))
print len(all_files), all_files[0]
CPU times: user 3.64 s, sys: 0.16 s, total: 3.79 s Wall time: 46.83 s 20659 <Key: aws-publicdatasets,common-crawl/parse-output/segment/1346823845675/1346864466526_10.arc.gz>
But it's useful now to have all_files
to hold all the keys under the segment 1346823845675
Note, for example, you can get the size of the file and the name -- and the type of file (boto.s3.key.Key)
# http://boto.readthedocs.org/en/latest/ref/s3.html#module-boto.s3.key
file0 = all_files[0]
type(file0), file0.name, file0.size
(boto.s3.key.Key, u'common-crawl/parse-output/segment/1346823845675/1346864466526_10.arc.gz', 100011998)
import boto
from boto.s3.connection import S3Connection
# this key, secret access to aws-publicdatasets only -- createdd for WwOD 13 student usage
KEY = 'AKIAJH2FD7572FCTVSSQ'
SECRET = '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'
from itertools import islice
from pandas import DataFrame
conn= S3Connection(KEY, SECRET)
bucket = conn.get_bucket('aws-publicdatasets')
# you might find this conversion function between DataFrame and a list of a regular dict useful
#https://gist.github.com/mikedewar/1486027#comment-804797
def df_to_dictlist(df):
return [{k:df.values[i][v] for v,k in enumerate(df.columns)} for i in range(len(df))]
def cc_file_type(path):
fname = path.split("/")[-1]
if fname[-7:] == '.arc.gz':
return 'arc.gz'
elif fname[:9] == 'textData-':
return 'textData'
elif fname[:9] == 'metadata-':
return 'metadata'
elif fname == '_SUCCESS':
return 'success'
else:
return 'other'
def segment_stats(seg_id, stop=None):
# FILL IN WITH YOUR CODE
return {
'count': {'arc.gz': 11904, 'metadata': 4377, 'success': 1, 'textData': 4377},
'size': {'arc.gz': 967409519222,
'metadata': 187079951008,
'success': 0,
'textData': 129994977292}
}
/Users/raymondyee/.virtualenvs/epd1/lib/python2.7/site-packages/pytz/__init__.py:35: UserWarning: Module argparse was already imported from /Library/Frameworks/Python.framework/Versions/7.3/lib/python2.7/argparse.pyc, but /Users/raymondyee/.virtualenvs/epd1/lib/python2.7/site-packages is being added to sys.path from pkg_resources import resource_stream