For us to learn:
boto
libraryThis notebook duplicates some of Day_20_CommonCrawl_Starter.
For moving files between your computer and PiCloud, look at Day_20_Moving_files_to_PiCloud.ipynb.
For understanding the actual content of the files in Common Crawl, we'll look at Day_21_CommonCrawl_Content.ipynb
Good to review Dave Lester's talk: http://www.slideshare.net/davelester/introduction-to-common-crawl
If you need general intro to Common Crawl, watch the Common Crawl Video.
The Common Crawl data structure is documented at https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set. To quote the docs:
The entire Common Crawl data set is stored on Amazon S3 as a Public Data Set:
http://aws.amazon.com/datasets/41740
The data set is divided into three major subsets:
The two archived crawl data sets are stored in folders organized by the year, month, date, and hour the content was crawled. For example:
s3://aws-publicdatasets/common-crawl/crawl-002/2010/01/06/10/1262847572760_10.arc.gz
The current crawl data set is stored in the "parse-output" folder in a similar manner to how Nutch stores archives. Crawl data is stored in a "segments" subfolder, then in a folder that starts with the UNIX timestamp of crawl start time. For example:
s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/1341826131693_45.arc.gz
# this key, secret access to aws-publicdatasets only -- created for WwOD 13 student usage
# turns out there is an anonymous mode in boto for public data sets:
# https://github.com/keiw/common_crawl_index/commit/ad341d0a41a828f260c9c08419dadff0dac6cf5b#L0R33
# conn=S3Connection(anon=True) will work instead of conn= S3Connection(KEY, SECRET) -- but there seems to be
# a bug in how S3Connection gets pickled for anon=True -- so for now, just use the KEY, SECRET
KEY = 'AKIAJH2FD7572FCTVSSQ'
SECRET = '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'
You can use this key/secret pair to configure both boto
and s3cmd
# s3cmd installed in custom PiCloud environment -- and maybe in your local environment too
# confirm s3://aws-publicdatasets/common-crawl/crawl-002/2010/01/06/10/1262847572760_10.arc.gz
# doc for s3cmd: http://s3tools.org/s3cmd
!s3cmd ls s3://aws-publicdatasets/common-crawl/crawl-002/2010/01/06/10/1262847572760_10.arc.gz
2012-01-05 19:19 100001092 s3://aws-publicdatasets/common-crawl/crawl-002/2010/01/06/10/1262847572760_10.arc.gz
s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/1341826131693_45.arc.gz
¶
# looking at parse-output itself
!s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output
# looking at what is contained by parse-output "folder"
!s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output/
There is a list of "valid segments" in
s3://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt
-- a list of segments that are part of the current crawl. Let's download it and study it.
!s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt
# we can download it:
!s3cmd get --force s3://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt
!head valid_segments.txt
# http://boto.s3.amazonaws.com/s3_tut.html
import boto
from boto.s3.connection import S3Connection
from itertools import islice
conn = S3Connection(KEY,SECRET)
# turns out there is an anonymous mode in boto for public data sets:
# https://github.com/keiw/common_crawl_index/commit/ad341d0a41a828f260c9c08419dadff0dac6cf5b#L0R33
#conn=S3Connection(anon=True)
bucket = conn.get_bucket('aws-publicdatasets')
for key in islice(bucket.list(prefix="common-crawl/parse-output/", delimiter="/"),None):
print key.name.encode('utf-8')
common-crawl/parse-output/checkpoint_staging_$folder$ common-crawl/parse-output/checkpoints_$folder$ common-crawl/parse-output/segment_$folder$ common-crawl/parse-output/valid_segments.txt common-crawl/parse-output/valid_segments2_$folder$ common-crawl/parse-output/valid_segments_$folder$ common-crawl/parse-output/checkpoint_staging/ common-crawl/parse-output/checkpoints/ common-crawl/parse-output/segment/ common-crawl/parse-output/valid_segments2/
# get valid_segments
# https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set
import boto
from boto.s3.connection import S3Connection
conn = S3Connection(KEY, SECRET)
bucket = conn.get_bucket('aws-publicdatasets')
k = bucket.get_key("common-crawl/parse-output/valid_segments.txt")
s = k.get_contents_as_string()
valid_segments = filter(None, s.split("\n"))
print len(valid_segments), valid_segments[0]
# valid_segments are Unix timestamps (in ms) -- confirm current crawl is from 2012
import datetime
datetime.datetime.fromtimestamp(float(valid_segments[0])/1000.)
As of the time of this writing (April 4, 2013), there are 177 valid segments in the current crawl. Now, it's time to figure out how to write a Python function called segment_stats
that takes a segment id and an optional stop
parameter (for the max number of keys to iterate through) of the form
def segment_stats(seg_id, stop=None):
pass
# YOUR EXERCISE TO FILL IN
and returns a dict
with 2 keys:
count
holding the number of keys inside the given valid segmentsize
holding the total number of bytes held in the keysbroken down by file type (there are 3 major types):
arg.gz
for theFor example:
segment_stats('1346823845675', None)
should return:
{
'count': {'arc.gz': 11904, 'metadata': 4377, 'success': 1, 'textData': 4377},
'size': {'arc.gz': 967409519222,
'metadata': 187079951008,
'success': 0,
'textData': 129994977292}
}
Since it can take 10-50 seconds or so to retrieve all the keys in a valid segment, it's worth limiting to say first 10 to get a feel for what you can do with a key. Run the following:
from itertools import islice
import boto
from boto.s3.connection import S3Connection
conn = S3Connection(KEY, SECRET)
bucket = conn.get_bucket('aws-publicdatasets')
for key in islice(bucket.list(prefix="common-crawl/parse-output/segment/1346823845675/", delimiter="/"),10):
print key.name.encode('utf-8')
# WARNING -- this might take a bit of time to run -- run it to see how long it takes you to get all the keys in this
# segment. time depends on where you are running this code
%time all_files = list(islice(bucket.list(prefix="common-crawl/parse-output/segment/1346823845675/", delimiter="/"),None))
print len(all_files), all_files[0]
But it's useful now to have all_files
to hold all the keys under the segment 1346823845675
Note, for example, you can get the size of the file and the name -- and the type of file (boto.s3.key.Key)
# http://boto.readthedocs.org/en/latest/ref/s3.html#module-boto.s3.key
file0 = all_files[0]
type(file0), file0.name, file0.size
import boto
from boto.s3.connection import S3Connection
# this key, secret access to aws-publicdatasets only -- createdd for WwOD 13 student usage
KEY = 'AKIAJH2FD7572FCTVSSQ'
SECRET = '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'
from itertools import islice
from pandas import DataFrame
conn= S3Connection(KEY, SECRET)
bucket = conn.get_bucket('aws-publicdatasets')
# you might find this conversion function between DataFrame and a list of a regular dict useful
#https://gist.github.com/mikedewar/1486027#comment-804797
def df_to_dictlist(df):
return [{k:df.values[i][v] for v,k in enumerate(df.columns)} for i in range(len(df))]
def cc_file_type(path):
fname = path.split("/")[-1]
if fname[-7:] == '.arc.gz':
return 'arc.gz'
elif fname[:9] == 'textData-':
return 'textData'
elif fname[:9] == 'metadata-':
return 'metadata'
elif fname == '_SUCCESS':
return 'success'
else:
return 'other'
def segment_stats(seg_id, stop=None):
# FILL IN
return None
# recall the first segment -- let's work on that segment
valid_segments[0]
'1346823845675'
# look at how long it takes to run locally
%time segment_stats(valid_segments[0], None)
CPU times: user 3.94 s, sys: 0.15 s, total: 4.09 s Wall time: 37.76 s
{'count': {'arc.gz': 11904, 'metadata': 4377, 'success': 1, 'textData': 4377}, 'size': {'arc.gz': 967409519222, 'metadata': 187079951008, 'success': 0, 'textData': 129994977292}}
# here's how to run it on PiCloud
# Prerequisite: http://docs.picloud.com/primer.html <--- READ THIS AND STUDY TO REFRESH YOUR MEMORY
import cloud
jid = cloud.call(segment_stats, '1346823845675', None, _env='/rdhyee/Working_with_Open_Data')
# pull up status -- refresh until done
cloud.status(jid)
'processing'
# this will block until job is done or errors out
cloud.join(jid)
# get your result
cloud.result(jid)
{'count': {'arc.gz': 11904, 'metadata': 4377, 'success': 1, 'textData': 4377}, 'size': {'arc.gz': 967409519222, 'metadata': 187079951008, 'success': 0, 'textData': 129994977292}}
# get some basic info
cloud.info(jid)
{708: {'runtime': 27.319, 'status': 'done', 'stderr': None, 'stdout': None}}
# get some specific info
cloud.info(jid, info_requested=['created', 'finished', 'runtime', 'cputime'])
{708: {'cputime.system': 1.6, 'cputime.user': 6.075, 'created': datetime.datetime(2013, 4, 9, 18, 31, 26), 'finished': datetime.datetime(2013, 4, 9, 18, 31, 56), 'runtime': 27.319}}
I had to retry 2 jobs
# now tally everything noting the retries -- might be worth writing this generally
# THIS CODE REFERS SPECIFICALLY TO RAYMOND YEE'S JOBS -- REPLACE WITH YOUR OWN IDS
import cloud
from itertools import izip, ifilter, chain
from matplotlib import pyplot as plt
valid_segments
segment_jids = xrange(319, 496)
retries_seg_ids = ['1346876860789', '1350433106986']
retries_jids = xrange(496, 498)
tally = list(ifilter(lambda x: x[2] == 'done',
izip(chain(valid_segments, retries_seg_ids), chain(segment_jids, retries_jids),
cloud.status(list(chain(segment_jids, retries_jids))))))
result = cloud.result([jid for (seg_id, jid, status) in tally])
# http://docs.picloud.com/moduledoc.html#module-cloud
jobs_info = cloud.info(list(islice(chain(segment_jids, retries_jids),None)),
info_requested=['created', 'finished', 'runtime', 'cputime']
)
started = [{'jid':k, 'time':v['finished'] - datetime.timedelta(seconds=v['runtime']), 'count': 1} for (k,v) in jobs_info.items()]
finished = [{'jid':k, 'time':v['finished'], 'count': -1} for (k,v) in jobs_info.items()]
df = DataFrame(started + finished)
exclude_n = 4
plot(df.sort_index(by='time')['time'][:-exclude_n], df.sort_index(by='time')['count'].cumsum()[:-exclude_n])
[<matplotlib.lines.Line2D at 0x7ffb550>]
# maybe use pickle to serialize results
import pickle
s = pickle.loads(pickle.dumps(dict(zip([seg_id for (seg_id, jid, status) in tally], result))))
# http://docs.picloud.com/moduledoc.html#module-cloud
jobs_info = cloud.info(list(islice(chain(segment_jids, retries_jids),None)),
info_requested=['created', 'finished', 'runtime', 'cputime']
)
from matplotlib import pyplot as plt
started = [{'jid':k, 'time':v['finished'] - datetime.timedelta(seconds=v['runtime']), 'count': 1} for (k,v) in jobs_info.items()]
finished = [{'jid':k, 'time':v['finished'], 'count': -1} for (k,v) in jobs_info.items()]
df = DataFrame(started + finished)
exclude_n = 4
plot(df.sort_index(by='time')['time'][:-exclude_n], df.sort_index(by='time')['count'].cumsum()[:-exclude_n])
[<matplotlib.lines.Line2D at 0x7fc1870>]
run jobs locally using cloud.mp
# http://docs.picloud.com/cloud_cloudmp.html
USE_LOCAL = False
if USE_LOCAL:
CLOUD = cloud.mp
else:
CLOUD = cloud
# try setting n_tasks to something less than # of all segments to test out code
n_tasks = len(valid_segments)
jids = CLOUD.map(segment_stats, valid_segments[:n_tasks], [None]*n_tasks, _env='/rdhyee/Working_with_Open_Data')
jids
xrange(886, 1063)
CLOUD.status(jids)
['done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done', 'done']
from itertools import izip
from collections import Counter
file_counter = Counter()
byte_counter = Counter()
problems = []
for (i, (seg_id, result)) in enumerate(izip(valid_segments[:n_tasks], CLOUD.iresult(jids))):
try:
file_counter.update(result['count'])
byte_counter.update(result['size'])
print i, seg_id, byte_counter['arc.gz']
except Exception as e:
print i, e
problems.append((seg_id, e))
0 1346823845675 967409519222 1 1346823846036 3553203682764 2 1346823846039 3820954411597 3 1346823846110 4029974708066 4 1346823846125 4938023527469 5 1346823846150 6380060821398 6 1346823846176 7315551291172 7 1346876860445 8334659708028 8 1346876860454 9187150699597 9 1346876860467 11896043186281 10 1346876860493 13246916964401 11 1346876860565 13631424495444 12 1346876860567 14552849094730 13 1346876860596 15763174860108 14 1346876860609 16792378536403 15 1346876860611 17908079183527 16 1346876860614 18687986010390 17 1346876860648 19687744362335 18 1346876860765 20473269448488 19 1346876860767 21703240961143 20 1346876860774 22309169351458 21 1346876860777 23298934201330 22 1346876860779 25966145601312 23 1346876860782 27646931258933 24 1346876860786 30150302592003 25 1346876860789 32899405077684 26 1346876860791 34824837406778 27 1346876860795 36305658641145 28 1346876860798 38413069044828 29 1346876860804 39474839617312 30 1346876860807 40378846349842 31 1346876860817 41311134689724 32 1346876860819 42166812307925 33 1346876860828 43243896348334 34 1346876860835 44393423324102 35 1346876860838 44881325004723 36 1346876860840 45437469541448 37 1346876860843 46600817536866 38 1346876860877 47257528037624 39 1346981172137 48236801055067 40 1346981172142 49126078679164 41 1346981172155 50074359855347 42 1346981172184 52421877984584 43 1346981172186 53191394123130 44 1346981172229 53587646117654 45 1346981172231 54407568590805 46 1346981172234 55635338515638 47 1346981172239 56940072470679 48 1346981172250 58647521873731 49 1346981172253 59019879634240 50 1346981172255 59840956881260 51 1346981172258 60950777222782 52 1346981172261 62426055563924 53 1346981172264 63432768843301 54 1346981172266 64087753572980 55 1346981172268 64948840594427 56 1350433106986 64993772524195 57 1350433106987 65049903329490 58 1350433106988 65107642958459 59 1350433106989 65155629215137 60 1350433106990 65189010571432 61 1350433106991 65245327260222 62 1350433106992 65300894860127 63 1350433106993 65357528919644 64 1350433106994 65413280968973 65 1350433106995 65468946001606 66 1350433106996 65516134166545 67 1350433106997 65546030637997 68 1350433106998 65566085292522 69 1350433106999 65592895948031 70 1350433107000 65651682131318 71 1350433107001 65708766433391 72 1350433107002 65764282761683 73 1350433107003 65818061576506 74 1350433107004 65871826478215 75 1350433107005 65922021410922 76 1350433107006 65951294327983 77 1350433107007 65981514305027 78 1350433107008 66014100565148 79 1350433107009 66045390567503 80 1350433107010 66100227420131 81 1350433107011 66155211409152 82 1350433107012 66210522886965 83 1350433107013 66265384445983 84 1350433107014 66315938484916 85 1350433107015 66347090805261 86 1350433107016 66376322305136 87 1350433107017 66403565894545 88 1350433107018 66441935397566 89 1350433107019 66500205895008 90 1350433107020 66557385159877 91 1350433107021 66615604279942 92 1350433107022 66673123534852 93 1350433107023 66730435775699 94 1350433107024 66788598764403 95 1350433107025 66846231689611 96 1350433107026 66899638179359 97 1350433107027 66953250132531 98 1350433107028 67008458035944 99 1350433107029 67063620666288 100 1350433107030 67119441760857 101 1350433107031 67174714098931 102 1350433107032 67229485648427 103 1350433107033 67285429023650 104 1350433107034 67341182493562 105 1350433107035 67394715773406 106 1350433107036 67450388470442 107 1350433107037 67506579324518 108 1350433107038 67563202716872 109 1350433107039 67619164740435 110 1350433107040 67673686039454 111 1350433107041 67727144313179 112 1350433107042 67782946305396 113 1350433107043 67839313951652 114 1350433107044 67895744716941 115 1350433107045 67951630598755 116 1350433107046 68003963040374 117 1350433107047 68060302836609 118 1350433107048 68116998096389 119 1350433107049 68172627665624 120 1350433107050 68228752479508 121 1350433107051 68285076646511 122 1350433107052 68329327651313 123 1350433107053 68378420697174 124 1350433107054 68434859137325 125 1350433107055 68490152728193 126 1350433107056 68544678871135 127 1350433107057 68600886597908 128 1350433107058 68657000938195 129 1350433107059 68713818967443 130 1350433107060 68768956966210 131 1350433107061 68822398420466 132 1350433107062 68877107765261 133 1350433107063 68932377655373 134 1350433107064 68987268920833 135 1350433107065 69024953430220 136 1350433107066 69069102319415 137 1350433107067 69125633967258 138 1350433107068 69181773768234 139 1350433107069 69237958470847 140 1350433107070 69294250936591 141 1350433107071 69345949012163 142 1350433107072 69401728459428 143 1350433107073 69458122391478 144 1350433107074 69514738676565 145 1350433107075 69570842996541 146 1350433107076 69626665474111 147 1350433107077 69682124503218 148 1350433107078 69725238149318 149 1350433107079 69783223611693 150 1350433107080 69822505395198 151 1350433107081 69869478614916 152 1350433107082 69925791407855 153 1350433107083 69980593433946 154 1350433107084 70016563055463 155 1350433107085 70064767551174 156 1350433107086 70118360102607 157 1350433107087 70171890819408 158 1350433107088 70225460974161 159 1350433107089 70259840730405 160 1350433107090 70311861936521 161 1350433107091 70369027326245 162 1350433107092 70422347307315 163 1350433107093 70463358792608 164 1350433107094 70506415334261 165 1350433107095 70560187547795 166 1350433107096 70614477397165 167 1350433107097 70660609372718 168 1350433107098 70716046844854 169 1350433107099 70771466684867 170 1350433107100 70827111399073 171 1350433107101 70880433312611 172 1350433107102 70936093589472 173 1350433107103 70992275114599 174 1350433107104 71047941547468 175 1350433107105 71103070723002 176 1350433107106 71106384571350
jobs_info = CLOUD.info(jids,
info_requested=['created', 'finished', 'runtime', 'cputime']
)
# plot # cores running vs time
started = [{'jid':k, 'time':v['finished'] - datetime.timedelta(seconds=v['runtime']), 'count': 1} for (k,v) in jobs_info.items()]
finished = [{'jid':k, 'time':v['finished'], 'count': -1} for (k,v) in jobs_info.items()]
df = DataFrame(started + finished)
plot(df.sort_index(by='time')['time'], df.sort_index(by='time')['count'].cumsum())
[<matplotlib.lines.Line2D at 0x7f36e30>]
byte_counter
Counter({'arc.gz': 71106384571350, 'metadata': 11010558690874, 'textData': 6978342039325, 'other': 1626219, 'success': 0})
# http://stackoverflow.com/a/1823101/7782
import locale
locale.setlocale(locale.LC_ALL, 'en_US')
locale.format("%d", byte_counter['arc.gz'], grouping=True)
'71,106,384,571,350'