Notebook

Goals¶

For us to learn:

the basics of how to process CommonCrawl data by counting files and tallying file sizes in the CC crawl
how to do this processing in parallel fashion using PiCloud, Amazon AWS (specifically S3), the boto library

This notebook duplicates some of Day_20_CommonCrawl_Starter.

For moving files between your computer and PiCloud, look at Day_20_Moving_files_to_PiCloud.ipynb.

For understanding the actual content of the files in Common Crawl, we'll look at Day_21_CommonCrawl_Content.ipynb

Learning about Common Crawl structure¶

Good to review Dave Lester's talk: http://www.slideshare.net/davelester/introduction-to-common-crawl

If you need general intro to Common Crawl, watch the Common Crawl Video.

Common Crawl data stored in Amazon S3¶

The Common Crawl data structure is documented at https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set. To quote the docs:

The entire Common Crawl data set is stored on Amazon S3 as a Public Data Set:

http://aws.amazon.com/datasets/41740

The data set is divided into three major subsets:

Archived Crawl #1 - s3://aws-publicdatasets/common-crawl/crawl-001/ - crawl data from 2008/2010
Archived Crawl #2 - s3://aws-publicdatasets/common-crawl/crawl-002/ - crawl data from 2009/2010
Current Crawl - s3://aws-publicdatasets/common-crawl/parse-output/ - crawl data from 2012

The two archived crawl data sets are stored in folders organized by the year, month, date, and hour the content was crawled. For example:

s3://aws-publicdatasets/common-crawl/crawl-002/2010/01/06/10/1262847572760_10.arc.gz

The current crawl data set is stored in the "parse-output" folder in a similar manner to how Nutch stores archives. Crawl data is stored in a "segments" subfolder, then in a folder that starts with the UNIX timestamp of crawl start time. For example:

s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/1341826131693_45.arc.gz

Using s3cmd and boto to confirm the examples from the documentation¶

In [18]:

# this key, secret access to aws-publicdatasets only -- created for WwOD 13 student usage

# turns out there is an anonymous mode in boto for public data sets:
# https://github.com/keiw/common_crawl_index/commit/ad341d0a41a828f260c9c08419dadff0dac6cf5b#L0R33
#  conn=S3Connection(anon=True) will work instead of conn= S3Connection(KEY, SECRET) -- but there seems to be 
# a bug in how S3Connection gets pickled for anon=True -- so for now, just use the KEY, SECRET

KEY = 'AKIAJH2FD7572FCTVSSQ'
SECRET = '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'

You can use this key/secret pair to configure both boto and s3cmd

In [19]:

# s3cmd installed in custom PiCloud environment -- and maybe in your local environment too

# confirm s3://aws-publicdatasets/common-crawl/crawl-002/2010/01/06/10/1262847572760_10.arc.gz
# doc for s3cmd: http://s3tools.org/s3cmd

!s3cmd ls s3://aws-publicdatasets/common-crawl/crawl-002/2010/01/06/10/1262847572760_10.arc.gz

2012-01-05 19:19 100001092   s3://aws-publicdatasets/common-crawl/crawl-002/2010/01/06/10/1262847572760_10.arc.gz

EXERCISE: use s3cmd to confirm existence of `s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/1341826131693_45.arc.gz`¶

In [ ]:

using s3cmd to look at parse-output and valid_segments.txt in current crawl¶

In [ ]:

# looking at parse-output itself

!s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output

In [ ]:

# looking at what is contained by parse-output "folder"

!s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output/

There is a list of "valid segments" in

s3://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt

-- a list of segments that are part of the current crawl. Let's download it and study it.

See discussion about valid segments

In [ ]:

!s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt

In [ ]:

# we can download it:

!s3cmd get --force s3://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt

In [ ]:

!head valid_segments.txt

using boto to study parse-output and valid_segments.txt¶

In [20]:

# http://boto.s3.amazonaws.com/s3_tut.html

import boto
from boto.s3.connection import S3Connection

from itertools import islice

conn = S3Connection(KEY,SECRET)

# turns out there is an anonymous mode in boto for public data sets:
# https://github.com/keiw/common_crawl_index/commit/ad341d0a41a828f260c9c08419dadff0dac6cf5b#L0R33
#conn=S3Connection(anon=True)

bucket = conn.get_bucket('aws-publicdatasets')
for key in islice(bucket.list(prefix="common-crawl/parse-output/", delimiter="/"),None):
    print key.name.encode('utf-8')

common-crawl/parse-output/checkpoint_staging_$folder$
common-crawl/parse-output/checkpoints_$folder$
common-crawl/parse-output/segment_$folder$
common-crawl/parse-output/valid_segments.txt
common-crawl/parse-output/valid_segments2_$folder$
common-crawl/parse-output/valid_segments_$folder$
common-crawl/parse-output/checkpoint_staging/
common-crawl/parse-output/checkpoints/
common-crawl/parse-output/segment/
common-crawl/parse-output/valid_segments2/

In [ ]:

# get valid_segments
# https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set

import boto
from boto.s3.connection import S3Connection

conn = S3Connection(KEY, SECRET)
bucket = conn.get_bucket('aws-publicdatasets')

k = bucket.get_key("common-crawl/parse-output/valid_segments.txt")
s = k.get_contents_as_string()

valid_segments = filter(None, s.split("\n"))

print len(valid_segments), valid_segments[0]

In [ ]:

# valid_segments are Unix timestamps (in ms) -- confirm current crawl is from 2012

import datetime
datetime.datetime.fromtimestamp(float(valid_segments[0])/1000.)

Using boto to compile stats on each valid segment¶

As of the time of this writing (April 4, 2013), there are 177 valid segments in the current crawl. Now, it's time to figure out how to write a Python function called segment_stats that takes a segment id and an optional stop parameter (for the max number of keys to iterate through) of the form

def segment_stats(seg_id, stop=None):
    pass
    # YOUR EXERCISE TO FILL IN

and returns a dict with 2 keys:

count holding the number of keys inside the given valid segment
size holding the total number of bytes held in the keys

broken down by file type (there are 3 major types):

arg.gz for the
'metadata' for the metadata files
'textData' for the textdata files
'success' for success files

For example:

segment_stats('1346823845675', None)

should return:

{
 'count': {'arc.gz': 11904, 'metadata': 4377, 'success': 1, 'textData': 4377},
 'size': {'arc.gz': 967409519222,
      'metadata': 187079951008,
      'success': 0,
      'textData': 129994977292}
}

Start by looking at a small subset of keys from valid_segments[0]¶

Since it can take 10-50 seconds or so to retrieve all the keys in a valid segment, it's worth limiting to say first 10 to get a feel for what you can do with a key. Run the following:

In [ ]:

from itertools import islice

import boto
from boto.s3.connection import S3Connection

conn = S3Connection(KEY, SECRET)
bucket = conn.get_bucket('aws-publicdatasets')
for key in islice(bucket.list(prefix="common-crawl/parse-output/segment/1346823845675/", delimiter="/"),10):
    print key.name.encode('utf-8')

In [ ]:

# WARNING -- this might take a bit of time to run -- run it to see how long it takes you to get all the keys in this
# segment.  time depends on where you are running this code

%time all_files = list(islice(bucket.list(prefix="common-crawl/parse-output/segment/1346823845675/", delimiter="/"),None))
print len(all_files), all_files[0]

But it's useful now to have all_files to hold all the keys under the segment 1346823845675 Note, for example, you can get the size of the file and the name -- and the type of file (boto.s3.key.Key)

In [ ]:

# http://boto.readthedocs.org/en/latest/ref/s3.html#module-boto.s3.key

file0 = all_files[0]
type(file0), file0.name, file0.size

In [21]:

import boto
from boto.s3.connection import S3Connection

# this key, secret access to aws-publicdatasets only -- createdd for WwOD 13 student usage
KEY = 'AKIAJH2FD7572FCTVSSQ'
SECRET = '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'

from itertools import islice
from pandas import DataFrame

conn= S3Connection(KEY, SECRET)
bucket = conn.get_bucket('aws-publicdatasets')

# you might find this conversion function between DataFrame and a list of a regular dict useful
#https://gist.github.com/mikedewar/1486027#comment-804797
def df_to_dictlist(df):
    return [{k:df.values[i][v] for v,k in enumerate(df.columns)} for i in range(len(df))]

def cc_file_type(path):

    fname = path.split("/")[-1]
    
    if fname[-7:] == '.arc.gz':
        return 'arc.gz'
    elif fname[:9] == 'textData-':
        return 'textData'
    elif fname[:9] == 'metadata-':
        return 'metadata'
    elif fname == '_SUCCESS':
        return 'success'
    else:
        return 'other'
    
def segment_stats(seg_id, stop=None):
    # FILL IN
    return None
    

Running segment_status locally and on PiCloud¶

In [87]:

# recall the first segment -- let's work on that segment
valid_segments[0]

Out[87]:

'1346823845675'

In [25]:

# look at how long it takes to run locally

%time segment_stats(valid_segments[0], None)

CPU times: user 3.94 s, sys: 0.15 s, total: 4.09 s
Wall time: 37.76 s

Out[25]:

{'count': {'arc.gz': 11904, 'metadata': 4377, 'success': 1, 'textData': 4377},
 'size': {'arc.gz': 967409519222,
  'metadata': 187079951008,
  'success': 0,
  'textData': 129994977292}}

In [88]:

# here's how to run it on PiCloud
# Prerequisite:  http://docs.picloud.com/primer.html <--- READ THIS AND STUDY TO REFRESH YOUR MEMORY

import cloud
jid = cloud.call(segment_stats, '1346823845675', None, _env='/rdhyee/Working_with_Open_Data')

In [101]:

# pull up status -- refresh until done
cloud.status(jid)

Out[101]:

'processing'

In [103]:

# this will block until job is done or errors out

cloud.join(jid)

In [104]:

# get your result
cloud.result(jid)

Out[104]:

{'count': {'arc.gz': 11904, 'metadata': 4377, 'success': 1, 'textData': 4377},
 'size': {'arc.gz': 967409519222,
  'metadata': 187079951008,
  'success': 0,
  'textData': 129994977292}}

In [106]:

# get some basic info
cloud.info(jid)

Out[106]:

{708: {'runtime': 27.319, 'status': 'done', 'stderr': None, 'stdout': None}}

In [107]:

# get some specific info
cloud.info(jid, info_requested=['created', 'finished', 'runtime', 'cputime'])

Out[107]:

{708: {'cputime.system': 1.6,
  'cputime.user': 6.075,
  'created': datetime.datetime(2013, 4, 9, 18, 31, 26),
  'finished': datetime.datetime(2013, 4, 9, 18, 31, 56),
  'runtime': 27.319}}

What I got the first time¶

I had to retry 2 jobs

https://www.picloud.com/accounts/jobs/#/?ujid=344 -> read timed out
375 -> AttributeError: 'Prefix' object has no attribute 'size'

In [110]:

# now tally everything noting the retries -- might be worth writing this generally
# THIS CODE REFERS SPECIFICALLY TO RAYMOND YEE'S JOBS -- REPLACE WITH YOUR OWN IDS

import cloud
from itertools import izip, ifilter, chain

from matplotlib import pyplot as plt

valid_segments
segment_jids = xrange(319, 496)
retries_seg_ids = ['1346876860789', '1350433106986']
retries_jids  = xrange(496, 498)

tally = list(ifilter(lambda x: x[2] == 'done', 
             izip(chain(valid_segments, retries_seg_ids), chain(segment_jids, retries_jids), 
          cloud.status(list(chain(segment_jids, retries_jids))))))

result = cloud.result([jid for (seg_id, jid, status) in tally])

# http://docs.picloud.com/moduledoc.html#module-cloud

jobs_info = cloud.info(list(islice(chain(segment_jids, retries_jids),None)),
                 info_requested=['created', 'finished', 'runtime', 'cputime']
                 )

started = [{'jid':k, 'time':v['finished'] - datetime.timedelta(seconds=v['runtime']), 'count': 1} for (k,v) in jobs_info.items()]
finished = [{'jid':k, 'time':v['finished'], 'count': -1} for (k,v) in jobs_info.items()]

df = DataFrame(started + finished)

exclude_n = 4

plot(df.sort_index(by='time')['time'][:-exclude_n], df.sort_index(by='time')['count'].cumsum()[:-exclude_n])

Out[110]:

[<matplotlib.lines.Line2D at 0x7ffb550>]

In [111]:

# maybe use pickle to serialize results
import pickle
s = pickle.loads(pickle.dumps(dict(zip([seg_id for (seg_id, jid, status) in tally], result))))

picloud job infos¶

In [39]:

# http://docs.picloud.com/moduledoc.html#module-cloud

jobs_info = cloud.info(list(islice(chain(segment_jids, retries_jids),None)),
                 info_requested=['created', 'finished', 'runtime', 'cputime']
                 )

In [40]:

from matplotlib import pyplot as plt

In [41]:

started = [{'jid':k, 'time':v['finished'] - datetime.timedelta(seconds=v['runtime']), 'count': 1} for (k,v) in jobs_info.items()]
finished = [{'jid':k, 'time':v['finished'], 'count': -1} for (k,v) in jobs_info.items()]

df = DataFrame(started + finished)

exclude_n = 4

plot(df.sort_index(by='time')['time'][:-exclude_n], df.sort_index(by='time')['count'].cumsum()[:-exclude_n])

Out[41]:

[<matplotlib.lines.Line2D at 0x7fc1870>]

Try to do this better with automated retry and using cloud.iresult¶

run jobs locally using cloud.mp

In [121]:

# http://docs.picloud.com/cloud_cloudmp.html 

USE_LOCAL = False

if USE_LOCAL:
    CLOUD = cloud.mp
else:
    CLOUD = cloud

# try setting n_tasks to something less than # of all segments to test out code
n_tasks = len(valid_segments)

jids = CLOUD.map(segment_stats, valid_segments[:n_tasks],  [None]*n_tasks, _env='/rdhyee/Working_with_Open_Data')

In [122]:

jids

Out[122]:

xrange(886, 1063)

In [125]:

CLOUD.status(jids)

Out[125]:

['done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done',
 'done']

In [124]:

from itertools import izip
from collections import Counter

file_counter = Counter()
byte_counter = Counter()

problems = []

for (i, (seg_id, result)) in enumerate(izip(valid_segments[:n_tasks], CLOUD.iresult(jids))):
    try:
        file_counter.update(result['count'])
        byte_counter.update(result['size'])
        print i, seg_id, byte_counter['arc.gz']
    except Exception as e:
        print i, e
        problems.append((seg_id, e))

0 1346823845675 967409519222
1 1346823846036 3553203682764
2 1346823846039 3820954411597
3 1346823846110 4029974708066
4 1346823846125 4938023527469
5 1346823846150 6380060821398
6 1346823846176 7315551291172
7 1346876860445 8334659708028
8 1346876860454 9187150699597
9 1346876860467 11896043186281
10 1346876860493 13246916964401
11 1346876860565 13631424495444
12 1346876860567 14552849094730
13 1346876860596 15763174860108
14 1346876860609 16792378536403
15 1346876860611 17908079183527
16 1346876860614 18687986010390
17 1346876860648 19687744362335
18 1346876860765 20473269448488
19 1346876860767 21703240961143
20 1346876860774 22309169351458
21 1346876860777 23298934201330
22 1346876860779 25966145601312
23 1346876860782 27646931258933
24 1346876860786 30150302592003
25 1346876860789 32899405077684
26 1346876860791 34824837406778
27 1346876860795 36305658641145
28 1346876860798 38413069044828
29 1346876860804 39474839617312
30 1346876860807 40378846349842
31 1346876860817 41311134689724
32 1346876860819 42166812307925
33 1346876860828 43243896348334
34 1346876860835 44393423324102
35 1346876860838 44881325004723
36 1346876860840 45437469541448
37 1346876860843 46600817536866
38 1346876860877 47257528037624
39 1346981172137 48236801055067
40 1346981172142 49126078679164
41 1346981172155 50074359855347
42 1346981172184 52421877984584
43 1346981172186 53191394123130
44 1346981172229 53587646117654
45 1346981172231 54407568590805
46 1346981172234 55635338515638
47 1346981172239 56940072470679
48 1346981172250 58647521873731
49 1346981172253 59019879634240
50 1346981172255 59840956881260
51 1346981172258 60950777222782
52 1346981172261 62426055563924
53 1346981172264 63432768843301
54 1346981172266 64087753572980
55 1346981172268 64948840594427
56 1350433106986 64993772524195
57 1350433106987 65049903329490
58 1350433106988 65107642958459
59 1350433106989 65155629215137
60 1350433106990 65189010571432
61 1350433106991 65245327260222
62 1350433106992 65300894860127
63 1350433106993 65357528919644
64 1350433106994 65413280968973
65 1350433106995 65468946001606
66 1350433106996 65516134166545
67 1350433106997 65546030637997
68 1350433106998 65566085292522
69 1350433106999 65592895948031
70 1350433107000 65651682131318
71 1350433107001 65708766433391
72 1350433107002 65764282761683
73 1350433107003 65818061576506
74 1350433107004 65871826478215
75 1350433107005 65922021410922
76 1350433107006 65951294327983
77 1350433107007 65981514305027
78 1350433107008 66014100565148
79 1350433107009 66045390567503
80 1350433107010 66100227420131
81 1350433107011 66155211409152
82 1350433107012 66210522886965
83 1350433107013 66265384445983
84 1350433107014 66315938484916
85 1350433107015 66347090805261
86 1350433107016 66376322305136
87 1350433107017 66403565894545
88 1350433107018 66441935397566
89 1350433107019 66500205895008
90 1350433107020 66557385159877
91 1350433107021 66615604279942
92 1350433107022 66673123534852
93 1350433107023 66730435775699
94 1350433107024 66788598764403
95 1350433107025 66846231689611
96 1350433107026 66899638179359
97 1350433107027 66953250132531
98 1350433107028 67008458035944
99 1350433107029 67063620666288
100 1350433107030 67119441760857
101 1350433107031 67174714098931
102 1350433107032 67229485648427
103 1350433107033 67285429023650
104 1350433107034 67341182493562
105 1350433107035 67394715773406
106 1350433107036 67450388470442
107 1350433107037 67506579324518
108 1350433107038 67563202716872
109 1350433107039 67619164740435
110 1350433107040 67673686039454
111 1350433107041 67727144313179
112 1350433107042 67782946305396
113 1350433107043 67839313951652
114 1350433107044 67895744716941
115 1350433107045 67951630598755
116 1350433107046 68003963040374
117 1350433107047 68060302836609
118 1350433107048 68116998096389
119 1350433107049 68172627665624
120 1350433107050 68228752479508
121 1350433107051 68285076646511
122 1350433107052 68329327651313
123 1350433107053 68378420697174
124 1350433107054 68434859137325
125 1350433107055 68490152728193
126 1350433107056 68544678871135
127 1350433107057 68600886597908
128 1350433107058 68657000938195
129 1350433107059 68713818967443
130 1350433107060 68768956966210
131 1350433107061 68822398420466
132 1350433107062 68877107765261
133 1350433107063 68932377655373
134 1350433107064 68987268920833
135 1350433107065 69024953430220
136 1350433107066 69069102319415
137 1350433107067 69125633967258
138 1350433107068 69181773768234
139 1350433107069 69237958470847
140 1350433107070 69294250936591
141 1350433107071 69345949012163
142 1350433107072 69401728459428
143 1350433107073 69458122391478
144 1350433107074 69514738676565
145 1350433107075 69570842996541
146 1350433107076 69626665474111
147 1350433107077 69682124503218
148 1350433107078 69725238149318
149 1350433107079 69783223611693
150 1350433107080 69822505395198
151 1350433107081 69869478614916
152 1350433107082 69925791407855
153 1350433107083 69980593433946
154 1350433107084 70016563055463
155 1350433107085 70064767551174
156 1350433107086 70118360102607
157 1350433107087 70171890819408
158 1350433107088 70225460974161
159 1350433107089 70259840730405
160 1350433107090 70311861936521
161 1350433107091 70369027326245
162 1350433107092 70422347307315
163 1350433107093 70463358792608
164 1350433107094 70506415334261
165 1350433107095 70560187547795
166 1350433107096 70614477397165
167 1350433107097 70660609372718
168 1350433107098 70716046844854
169 1350433107099 70771466684867
170 1350433107100 70827111399073
171 1350433107101 70880433312611
172 1350433107102 70936093589472
173 1350433107103 70992275114599
174 1350433107104 71047941547468
175 1350433107105 71103070723002
176 1350433107106 71106384571350

In [126]:

jobs_info = CLOUD.info(jids,
                 info_requested=['created', 'finished', 'runtime', 'cputime']
                 )

In [127]:

# plot # cores running vs time

started = [{'jid':k, 'time':v['finished'] - datetime.timedelta(seconds=v['runtime']), 'count': 1} for (k,v) in jobs_info.items()]
finished = [{'jid':k, 'time':v['finished'], 'count': -1} for (k,v) in jobs_info.items()]

df = DataFrame(started + finished)

plot(df.sort_index(by='time')['time'], df.sort_index(by='time')['count'].cumsum())

Out[127]:

[<matplotlib.lines.Line2D at 0x7f36e30>]

In [128]:

byte_counter

Out[128]:

Counter({'arc.gz': 71106384571350, 'metadata': 11010558690874, 'textData': 6978342039325, 'other': 1626219, 'success': 0})

In [129]:

# http://stackoverflow.com/a/1823101/7782

import locale
locale.setlocale(locale.LC_ALL, 'en_US')

locale.format("%d", byte_counter['arc.gz'], grouping=True)

Out[129]:

'71,106,384,571,350'

Goals¶

Learning about Common Crawl structure¶

Common Crawl data stored in Amazon S3¶

Using s3cmd and boto to confirm the examples from the documentation¶

EXERCISE: use s3cmd to confirm existence of s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/1341826131693_45.arc.gz¶

using s3cmd to look at parse-output and valid_segments.txt in current crawl¶

using boto to study parse-output and valid_segments.txt¶

Using boto to compile stats on each valid segment¶

Start by looking at a small subset of keys from valid_segments[0]¶

Running segment_status locally and on PiCloud¶

What I got the first time¶

picloud job infos¶

Try to do this better with automated retry and using cloud.iresult¶

EXERCISE: use s3cmd to confirm existence of `s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/1341826131693_45.arc.gz`¶