# Configuring a cluster for this demo

This demo uses slightly newer versions of packages than were released on the Spark 0.8.0 AMI, including using a Spark-0.9.0 snapshot, using a newer version of the spark-ec2 scripts, and using Python 2.7 and the latest version of IPython.

I'm using a snapshot version of Spark because I wanted to include the patch that fixes SPARK-669.

Here are some notes on how I configured my cluster. Note: this section is slightly incomplete; I'll complete these later.

## Launching the Cluster

• Launch a cluster with 20 m1.xlarge instances:

 ./spark-ec2 -i ~/.ssh/berkeley-laptop.pem -k berkeley-laptop -s 20 -t m1.xlarge launch meetup


## Updating Spark components

• Backup the current Spark version:

cp -r ~/spark/conf/ ~/sparkconf-backup
mv -r ~/spark/ ~/spark-old

• Clone and build Spark:

git clone https://github.com/apache/incubator-spark.git spark
cd spark
rm -rf ~/spark/conf
cp -r ~/sparkconf-backup ~/spark/conf

• Deploy the updated Spark: since copy-dir doesn't have a --delete option yet, edit it and add that flag to the rsync command. Then, run

~/spark-ec2/copy-dir ~/spark

• TODO: Update spark-ec2

## Updating Python and IPython:

yum install -y pssh
yum install -y python27 python27-devel
pssh -h /root/spark-ec2/slaves yum install -y python27
wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27
easy_install-2.7 pip
pip-2.7 install ipython[all]
pip-2.7 install requests numpy
yum install -y freetype-devel libpng-devel
pip-2.7 install matplotlib


To have PySpark use this Python version, add the line

export PYSPARK_PYTHON=python2.7


to spark-env.sh and re-sync it across the workers.

## Restarting Spark

Now that we've upgraded everything, restart Spark:

~/spark/bin/stop-all.sh
~/spark/bin/start-all.sh


## Configuring IPython Notebook for Spark

Follow these instructions to configure IPython notebook. Before running ipython notebook, make sure to source spark-env.sh so that its settings are used.

## Copying the Wikipedia data from S3

Edit ~/ephemeral-hdfs/conf/core-site.xml to include your S3 credentials:

Then, run ~/ephemeral-hdfs/bin/start-all.sh to start Hadoop, then run

~/ephemeral-hdfs/bin/hadoop distcp s3n://wiki-traffic/wiki-dump/text_with_title /wikitext


to copy the Wikipedia dump from S3.

## Misc. notes:

• Reduce SPARK_MEM slightly to leave some extra memory for the OS.

# IPython Notebook Configuration¶

In [1]:
from IPython.display import HTML
import requests
MASTER_HOSTNAME = requests.get("http://169.254.169.254/latest/meta-data/public-hostname").text
%matplotlib inline


## Helper Functions¶

In [2]:
def wiki_table(data):
from types import StringTypes
html = ["<table>"]
for row in data:
if isinstance(row, StringTypes):
row = [row]
html.append('<tr>')
html.append('<td><a href="http://en.wikipedia.org/wiki/%s" target="_BLANK">%s</a></td>' % (row[0], row[0]))
for col in row[1:]:
html.append("<td>%s</td>" % str(col))
html.append('</tr>')
html.append('</table>')
return HTML(''.join(html))


# Connecting to Spark¶

In [3]:
print CLUSTER_URL

from pyspark import SparkContext
sc = SparkContext(CLUSTER_URL, 'ipython-notebook')

spark://ec2-174-129-88-152.compute-1.amazonaws.com:7077



Let's test it out by creating a simple RDD

In [4]:
data = sc.parallelize(range(1000), 10)
data

Out[4]:
<pyspark.rdd.RDD at 0x384ac90>

In [5]:
data.count()

Out[5]:
1000

In [6]:
data.take(10)

Out[6]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [10]:
help(data)

Help on RDD in module pyspark.rdd object:

class RDD(__builtin__.object)
|  A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
|  Represents an immutable, partitioned collection of elements that can be
|  operated on in parallel.
|
|  Methods defined here:
|
|      Return the union of this RDD and another one.
|
|      >>> rdd = sc.parallelize([1, 1, 2, 3])
|      >>> (rdd + rdd).collect()
|      [1, 1, 2, 3, 1, 1, 2, 3]
|
|  __init__(self, jrdd, ctx)
|
|  cache(self)
|      Persist this RDD with the default storage level (C{MEMORY_ONLY}).
|
|  cartesian(self, other)
|      Return the Cartesian product of this RDD and another one, that is, the
|      RDD of all pairs of elements C{(a, b)} where C{a} is in C{self} and
|      C{b} is in C{other}.
|
|      >>> rdd = sc.parallelize([1, 2])
|      >>> sorted(rdd.cartesian(rdd).collect())
|      [(1, 1), (1, 2), (2, 1), (2, 2)]
|
|  checkpoint(self)
|      Mark this RDD for checkpointing. It will be saved to a file inside the
|      checkpoint directory set with L{SparkContext.setCheckpointDir()} and
|      all references to its parent RDDs will be removed. This function must
|      be called before any job has been executed on this RDD. It is strongly
|      recommended that this RDD is persisted in memory, otherwise saving it
|      on a file will require recomputation.
|
|  cogroup(self, other, numPartitions=None)
|      For each key k in C{self} or C{other}, return a resulting RDD that
|      contains a tuple with the list of values for that key in C{self} as well
|      as C{other}.
|
|      >>> x = sc.parallelize([("a", 1), ("b", 4)])
|      >>> y = sc.parallelize([("a", 2)])
|      >>> sorted(x.cogroup(y).collect())
|      [('a', ([1], [2])), ('b', ([4], []))]
|
|  collect(self)
|      Return a list that contains all of the elements in this RDD.
|
|  collectAsMap(self)
|      Return the key-value pairs in this RDD to the master as a dictionary.
|
|      >>> m = sc.parallelize([(1, 2), (3, 4)]).collectAsMap()
|      >>> m[1]
|      2
|      >>> m[3]
|      4
|
|  combineByKey(self, createCombiner, mergeValue, mergeCombiners, numPartitions=None)
|      Generic function to combine the elements for each key using a custom
|      set of aggregation functions.
|
|      Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined
|      type" C.  Note that V and C can be different -- for example, one might
|      group an RDD of type (Int, Int) into an RDD of type (Int, List[Int]).
|
|      Users provide three functions:
|
|          - C{createCombiner}, which turns a V into a C (e.g., creates
|            a one-element list)
|          - C{mergeValue}, to merge a V into a C (e.g., adds it to the end of
|            a list)
|          - C{mergeCombiners}, to combine two C's into a single one.
|
|      In addition, users can control the partitioning of the output RDD.
|
|      >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
|      >>> def f(x): return x
|      >>> def add(a, b): return a + str(b)
|      [('a', '11'), ('b', '1')]
|
|  count(self)
|      Return the number of elements in this RDD.
|
|      >>> sc.parallelize([2, 3, 4]).count()
|      3
|
|  countByKey(self)
|      Count the number of elements for each key, and return the result to the
|      master as a dictionary.
|
|      >>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
|      >>> sorted(rdd.countByKey().items())
|      [('a', 2), ('b', 1)]
|
|  countByValue(self)
|      Return the count of each unique value in this RDD as a dictionary of
|      (value, count) pairs.
|
|      >>> sorted(sc.parallelize([1, 2, 1, 2, 2], 2).countByValue().items())
|      [(1, 2), (2, 3)]
|
|  distinct(self)
|      Return a new RDD containing the distinct elements in this RDD.
|
|      >>> sorted(sc.parallelize([1, 1, 2, 3]).distinct().collect())
|      [1, 2, 3]
|
|  filter(self, f)
|      Return a new RDD containing only the elements that satisfy a predicate.
|
|      >>> rdd = sc.parallelize([1, 2, 3, 4, 5])
|      >>> rdd.filter(lambda x: x % 2 == 0).collect()
|      [2, 4]
|
|  first(self)
|      Return the first element in this RDD.
|
|      >>> sc.parallelize([2, 3, 4]).first()
|      2
|
|  flatMap(self, f, preservesPartitioning=False)
|      Return a new RDD by first applying a function to all elements of this
|      RDD, and then flattening the results.
|
|      >>> rdd = sc.parallelize([2, 3, 4])
|      >>> sorted(rdd.flatMap(lambda x: range(1, x)).collect())
|      [1, 1, 1, 2, 2, 3]
|      >>> sorted(rdd.flatMap(lambda x: [(x, x), (x, x)]).collect())
|      [(2, 2), (2, 2), (3, 3), (3, 3), (4, 4), (4, 4)]
|
|  flatMapValues(self, f)
|      Pass each value in the key-value pair RDD through a flatMap function
|      without changing the keys; this also retains the original RDD's
|      partitioning.
|
|  fold(self, zeroValue, op)
|      Aggregate the elements of each partition, and then the results for all
|      the partitions, using a given associative function and a neutral "zero
|      value."
|
|      The function C{op(t1, t2)} is allowed to modify C{t1} and return it
|      as its result value to avoid object allocation; however, it should not
|      modify C{t2}.
|
|      >>> from operator import add
|      >>> sc.parallelize([1, 2, 3, 4, 5]).fold(0, add)
|      15
|
|  foreach(self, f)
|      Applies a function to all elements of this RDD.
|
|      >>> def f(x): print x
|      >>> sc.parallelize([1, 2, 3, 4, 5]).foreach(f)
|
|  getCheckpointFile(self)
|      Gets the name of the file to which this RDD was checkpointed
|
|  glom(self)
|      Return an RDD created by coalescing all elements within each partition
|      into a list.
|
|      >>> rdd = sc.parallelize([1, 2, 3, 4], 2)
|      >>> sorted(rdd.glom().collect())
|      [[1, 2], [3, 4]]
|
|  groupBy(self, f, numPartitions=None)
|      Return an RDD of grouped items.
|
|      >>> rdd = sc.parallelize([1, 1, 2, 3, 5, 8])
|      >>> result = rdd.groupBy(lambda x: x % 2).collect()
|      >>> sorted([(x, sorted(y)) for (x, y) in result])
|      [(0, [2, 8]), (1, [1, 1, 3, 5])]
|
|  groupByKey(self, numPartitions=None)
|      Group the values for each key in the RDD into a single sequence.
|      Hash-partitions the resulting RDD with into numPartitions partitions.
|
|      >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
|      >>> sorted(x.groupByKey().collect())
|      [('a', [1, 1]), ('b', [1])]
|
|  groupWith(self, other)
|      Alias for cogroup.
|
|  isCheckpointed(self)
|      Return whether this RDD has been checkpointed or not
|
|  join(self, other, numPartitions=None)
|      Return an RDD containing all pairs of elements with matching keys in
|      C{self} and C{other}.
|
|      Each pair of elements will be returned as a (k, (v1, v2)) tuple, where
|      (k, v1) is in C{self} and (k, v2) is in C{other}.
|
|      Performs a hash join across the cluster.
|
|      >>> x = sc.parallelize([("a", 1), ("b", 4)])
|      >>> y = sc.parallelize([("a", 2), ("a", 3)])
|      >>> sorted(x.join(y).collect())
|      [('a', (1, 2)), ('a', (1, 3))]
|
|  keyBy(self, f)
|      Creates tuples of the elements in this RDD by applying C{f}.
|
|      >>> x = sc.parallelize(range(0,3)).keyBy(lambda x: x*x)
|      >>> y = sc.parallelize(zip(range(0,5), range(0,5)))
|      >>> sorted(x.cogroup(y).collect())
|      [(0, ([0], [0])), (1, ([1], [1])), (2, ([], [2])), (3, ([], [3])), (4, ([2], [4]))]
|
|  leftOuterJoin(self, other, numPartitions=None)
|      Perform a left outer join of C{self} and C{other}.
|
|      For each element (k, v) in C{self}, the resulting RDD will either
|      contain all pairs (k, (v, w)) for w in C{other}, or the pair
|      (k, (v, None)) if no elements in other have key k.
|
|      Hash-partitions the resulting RDD into the given number of partitions.
|
|      >>> x = sc.parallelize([("a", 1), ("b", 4)])
|      >>> y = sc.parallelize([("a", 2)])
|      >>> sorted(x.leftOuterJoin(y).collect())
|      [('a', (1, 2)), ('b', (4, None))]
|
|  map(self, f, preservesPartitioning=False)
|      Return a new RDD containing the distinct elements in this RDD.
|
|  mapPartitions(self, f, preservesPartitioning=False)
|      Return a new RDD by applying a function to each partition of this RDD.
|
|      >>> rdd = sc.parallelize([1, 2, 3, 4], 2)
|      >>> def f(iterator): yield sum(iterator)
|      >>> rdd.mapPartitions(f).collect()
|      [3, 7]
|
|  mapPartitionsWithSplit(self, f, preservesPartitioning=False)
|      Return a new RDD by applying a function to each partition of this RDD,
|      while tracking the index of the original partition.
|
|      >>> rdd = sc.parallelize([1, 2, 3, 4], 4)
|      >>> def f(splitIndex, iterator): yield splitIndex
|      >>> rdd.mapPartitionsWithSplit(f).sum()
|      6
|
|  mapValues(self, f)
|      Pass each value in the key-value pair RDD through a map function
|      without changing the keys; this also retains the original RDD's
|      partitioning.
|
|  mean(self)
|      Compute the mean of this RDD's elements.
|
|      >>> sc.parallelize([1, 2, 3]).mean()
|      2.0
|
|  partitionBy(self, numPartitions, partitionFunc=<built-in function hash>)
|      Return a copy of the RDD partitioned using the specified partitioner.
|
|      >>> pairs = sc.parallelize([1, 2, 3, 4, 2, 4, 1]).map(lambda x: (x, x))
|      >>> sets = pairs.partitionBy(2).glom().collect()
|      >>> set(sets[0]).intersection(set(sets[1]))
|      set([])
|
|  persist(self, storageLevel)
|      Set this RDD's storage level to persist its values across operations after the first time
|      it is computed. This can only be used to assign a new storage level if the RDD does not
|      have a storage level set yet.
|
|  pipe(self, command, env={})
|      Return an RDD created by piping elements to a forked external process.
|
|      >>> sc.parallelize([1, 2, 3]).pipe('cat').collect()
|      ['1', '2', '3']
|
|  reduce(self, f)
|      Reduces the elements of this RDD using the specified commutative and
|      associative binary operator.
|
|      >>> from operator import add
|      >>> sc.parallelize([1, 2, 3, 4, 5]).reduce(add)
|      15
|      >>> sc.parallelize((2 for _ in range(10))).map(lambda x: 1).cache().reduce(add)
|      10
|
|  reduceByKey(self, func, numPartitions=None)
|      Merge the values for each key using an associative reduce function.
|
|      This will also perform the merging locally on each mapper before
|      sending results to a reducer, similarly to a "combiner" in MapReduce.
|
|      Output will be hash-partitioned with C{numPartitions} partitions, or
|      the default parallelism level if C{numPartitions} is not specified.
|
|      >>> from operator import add
|      >>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
|      [('a', 2), ('b', 1)]
|
|  reduceByKeyLocally(self, func)
|      Merge the values for each key using an associative reduce function, but
|      return the results immediately to the master as a dictionary.
|
|      This will also perform the merging locally on each mapper before
|      sending results to a reducer, similarly to a "combiner" in MapReduce.
|
|      >>> from operator import add
|      >>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
|      [('a', 2), ('b', 1)]
|
|  rightOuterJoin(self, other, numPartitions=None)
|      Perform a right outer join of C{self} and C{other}.
|
|      For each element (k, w) in C{other}, the resulting RDD will either
|      contain all pairs (k, (v, w)) for v in this, or the pair (k, (None, w))
|      if no elements in C{self} have key k.
|
|      Hash-partitions the resulting RDD into the given number of partitions.
|
|      >>> x = sc.parallelize([("a", 1), ("b", 4)])
|      >>> y = sc.parallelize([("a", 2)])
|      >>> sorted(y.rightOuterJoin(x).collect())
|      [('a', (2, 1)), ('b', (None, 4))]
|
|  sample(self, withReplacement, fraction, seed)
|      Return a sampled subset of this RDD (relies on numpy and falls back
|      on default random generator if numpy is unavailable).
|
|      >>> sc.parallelize(range(0, 100)).sample(False, 0.1, 2).collect() #doctest: +SKIP
|      [2, 3, 20, 21, 24, 41, 42, 66, 67, 89, 90, 98]
|
|  sampleStdev(self)
|      Compute the sample standard deviation of this RDD's elements (which corrects for bias in
|      estimating the standard deviation by dividing by N-1 instead of N).
|
|      >>> sc.parallelize([1, 2, 3]).sampleStdev()
|      1.0
|
|  sampleVariance(self)
|      Compute the sample variance of this RDD's elements (which corrects for bias in
|      estimating the variance by dividing by N-1 instead of N).
|
|      >>> sc.parallelize([1, 2, 3]).sampleVariance()
|      1.0
|
|  saveAsTextFile(self, path)
|      Save this RDD as a text file, using string representations of elements.
|
|      >>> tempFile = NamedTemporaryFile(delete=True)
|      >>> tempFile.close()
|      >>> sc.parallelize(range(10)).saveAsTextFile(tempFile.name)
|      >>> from fileinput import input
|      >>> from glob import glob
|      >>> ''.join(sorted(input(glob(tempFile.name + "/part-0000*"))))
|      '0\n1\n2\n3\n4\n5\n6\n7\n8\n9\n'
|
|  stats(self)
|      Return a L{StatCounter} object that captures the mean, variance
|      and count of the RDD's elements in one operation.
|
|  stdev(self)
|      Compute the standard deviation of this RDD's elements.
|
|      >>> sc.parallelize([1, 2, 3]).stdev()
|      0.816...
|
|  subtract(self, other, numPartitions=None)
|      Return each value in C{self} that is not contained in C{other}.
|
|      >>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])
|      >>> y = sc.parallelize([("a", 3), ("c", None)])
|      >>> sorted(x.subtract(y).collect())
|      [('a', 1), ('b', 4), ('b', 5)]
|
|  subtractByKey(self, other, numPartitions=None)
|      Return each (key, value) pair in C{self} that has no pair with matching key
|      in C{other}.
|
|      >>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 2)])
|      >>> y = sc.parallelize([("a", 3), ("c", None)])
|      >>> sorted(x.subtractByKey(y).collect())
|      [('b', 4), ('b', 5)]
|
|  sum(self)
|      Add up the elements in this RDD.
|
|      >>> sc.parallelize([1.0, 2.0, 3.0]).sum()
|      6.0
|
|  take(self, num)
|      Take the first num elements of the RDD.
|
|      This currently scans the partitions *one by one*, so it will be slow if
|      a lot of partitions are required. In that case, use L{collect} to get
|
|      >>> sc.parallelize([2, 3, 4, 5, 6]).cache().take(2)
|      [2, 3]
|      >>> sc.parallelize([2, 3, 4, 5, 6]).take(10)
|      [2, 3, 4, 5, 6]
|
|  takeSample(self, withReplacement, num, seed)
|      Return a fixed-size sampled subset of this RDD (currently requires numpy).
|
|      >>> sc.parallelize(range(0, 10)).takeSample(True, 10, 1) #doctest: +SKIP
|      [4, 2, 1, 8, 2, 7, 0, 4, 1, 4]
|
|  union(self, other)
|      Return the union of this RDD and another one.
|
|      >>> rdd = sc.parallelize([1, 1, 2, 3])
|      >>> rdd.union(rdd).collect()
|      [1, 1, 2, 3, 1, 1, 2, 3]
|
|  unpersist(self)
|      Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.
|
|  variance(self)
|      Compute the variance of this RDD's elements.
|
|      >>> sc.parallelize([1, 2, 3]).variance()
|      0.666...
|
|  ----------------------------------------------------------------------
|  Data descriptors defined here:
|
|  __dict__
|      dictionary for instance variables (if defined)
|
|  __weakref__
|      list of weak references to the object (if defined)
|
|  context
|      The L{SparkContext} that this RDD was created on.


In [11]:
data.map(lambda x: str(x)).take(10)

Out[11]:
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [12]:
data.reduce(lambda x, y: x + y)

Out[12]:
499500


In [13]:
wiki = sc.textFile("/wikitext")

In [14]:
berkeley_pages = wiki.filter(lambda x: "Berkeley" in x)

In [15]:
%time berkeley_pages.count()

CPU times: user 156 ms, sys: 12 ms, total: 168 ms
Wall time: 23.2 s


Out[15]:
80535


# Spark's Web Interfaces¶

In [16]:
HTML(('<a href="http://%s:8080" target="_BLANK">Master Web UI</a>' % MASTER_HOSTNAME))

Out[16]:
In []:
HTML(('<a href="http://%s:5080/ganglia" target="_BLANK">Ganglia UI</a>' % MASTER_HOSTNAME))


# Caching¶

In [17]:
wiki.cache()

Out[17]:
<pyspark.rdd.RDD at 0x3860f90>

In [18]:
%time berkeley_pages.count()

CPU times: user 120 ms, sys: 56 ms, total: 176 ms
Wall time: 21 s


Out[18]:
80535

In [19]:
%time berkeley_pages.count()

CPU times: user 100 ms, sys: 48 ms, total: 148 ms
Wall time: 9.26 s


Out[19]:
80535


# Extracting Page Titles¶

In [20]:
berkeley_pages.first()

Out[20]:
u'History_of_Cape_Colony_from_1806_to_1870\t{{CapeColony}}\\n{{POV|date=December 2007}}\\nThe \'\'\'history of Cape Colony from 1806 to 1870\'\'\' spans the period of the history of [[Cape Colony]] during the [[Cape Frontier Wars]], also called the Kaffir Wars, which lasted from 1811 to 1858. The wars were fought between the [[Europe]]an [[colonist]]s and the native [[Xhosa]] who rebelled against continuing European rule. The Cape Colony was the first European colony in [[South Africa]], which was initially controlled by the [[Netherlands|Dutch]] but subsequently invaded and taken over by the [[United Kingdom|British]]. After war broke out again, a British force was sent once more to the [[Cape of Good Hope|Cape]]. After a battle in January 1806 on the shores of [[Table Bay]], the Dutch garrison of [[Cape Castle]] surrendered to the British under [[David Baird|Sir David Baird]], and in 1814, the colony was ceded outright by the Netherlands to the British crown. At that time, the colony extended to the [[mountain]]s in front of the vast [[Highveld|central plateau]], then called "Bushmansland", and had an area of about 194,000 [[square kilometre]]s and a population of some 60,000, of whom 27,000 were [[Whites|white]], 17,000 free [[Khoikhoi]] (Hottentots), and the rest [[slavery|slave]]s. These slaves were mostly imported black people and [[Cape Malays|Malays]].\\n\\n==First and second frontier wars==\\n{{Main|Xhosa Wars}}\\nThe first of several wars with the [[Xhosa]] had already been fought by the time that the Cape Colony had been ceded to the [[United Kingdom]]. The Xhosa that crossed the colonial frontier had been expelled from the district between the [[Sundays River]] and [[Fish River, Eastern Cape|Great Fish River]] known as the [[Zuurveld]], which became a neutral ground of sorts. For some time before 1811, the Xhosa had taken possession of the neutral ground and attacked the colonists. In order to expel them from the Zuurveld, [[John Graham (Albany)|Colonel John Graham]] took the area with a mixed-race army in December 1811, and finally the Xhosa were driven beyond the Fish River. On the site of Colonel Graham\u2019s headquarters arose a town bearing his name: Graham\'s Town, subsequently becoming [[Grahamstown]].\\n\\nA difficulty between the Cape Colony government and the Xhosa arose in 1817, the immediate cause of which was an attempt by the colonial authorities to enforce the restitution of some stolen cattle. On [[22 April]] [[1817]], led by a prophet-chief named [[Makana]], they attacked Graham\u2019s Town, then held by a handful of white troops. Help arrived in time and the enemy were beaten back. It was then agreed that the land between the Fish and the [[Keiskamma River|Keiskamma]] rivers should be neutral territory.\\n\\n==1820 settlers==\\nThe war of 1817\u201319 led to the first wave of [[immigration]] of English settlers of any considerable scale, an event with far-reaching consequences. The then governor, [[Lord Charles Henry Somerset|Lord Charles Somerset]], whose treaty arrangements with the Xhosa chiefs had proved untenable, desired to erect a barrier against the Xhosa by having white colonists settle in the border region. In 1820, upon the advice of Lord Somerset, parliament voted to spend [[British pound|\xa3]]50,000 to promote migration to the Cape, prompting 4,000 British people to emigrate. These immigrants, who are now known as the [[1820 Settlers]], formed the [[Albany settlement]], later [[Port Elizabeth]], and made [[Grahamstown]] their headquarters. Intended primarily as a measure to secure the safety of the frontier, and regarded by the [[Government of the United Kingdom|British government]] chiefly as a way of finding employment for a few thousand of the unemployed in Britain.  Yet, the emigration scheme accomplished something with more far reaching implications than its authors had intended. The new settlers, drawn from every part of the [[United Kingdom]] and from almost every grade of society, retained strong loyalty to Britain.  In the course of time, they formed a counterpoint to the Dutch colonists. \\n\\nThe arrival of these immigrants also introduced the [[English language]] to the Cape.  English language ordinances were issued for the first time in 1825, and in 1827, its use was extended to the conduct of judicial proceedings. Dutch was not, however, ousted, and the colonists became largely bilingual.\\n\\n==Dutch hostility to British rule==\\nAlthough the colony was prosperous, many Dutch farmers were as dissatisfied with British rule as they had been with that of the Dutch East India Company, though their grievances were not the same. In 1792, [[Moravian Church|Moravian]] [[Mission (Christian)|missions]] had been established for the benefit of the Khoikhoi, and in 1799, the [[London Missionary Society]] began to try to [[convert]] both the Khoikhoi and the Xhosa. The championship of Khoikhoi grievances by the missionaries caused much dissatisfaction among the majority of the colonists, whose conservative views temporarily prevailed, for in 1812, an ordinance was issued which gave magistrates the power to bind Khoikhoi children as apprentices under conditions little different from those of [[slavery]].  In the meantime, the movement for the [[abolition of slavery]] was gaining strength in England, and the missionaries appealed at length, from the colonists to Britain. \\n\\nAn incident, which occurred from 1815 to 1816, did much to make the Dutch frontiersmen permanently hostile to the British.  A farmer named Bezuidenhout refused to obey a summons issued to him after a complaint from Khoikhoi was registered.  He fired on the party sent to arrest him, and was killed by the return fire.  This caused a miniature rebellion, and in its suppression five ringleaders were publicly hanged by the British at [[Slagter\'s Nek]] where they had originally sworn to expel "the English tyrants." The resentment caused by the hanging of these men was deepened by the circumstances of the execution, for the scaffold on which the rebels simultaneously were hanged broke from their united weight and the men were hanged one by one afterwards. The deeply religious Dutch frontiersmen believed the collapsing scaffold to be an [[act of God]]. An ordinance passed in 1827 abolished the old Dutch "\'\'[[landdrost]]\'\'" and "\'\'heemraden\'\'" courts, instead substituting [[resident magistrate]]s.  The ordinance further stipulated that all legal proceedings be henceforth conducted in English. \\n\\nAs a result of the championing of the missionaries, a subsequent ordinance in 1828 granted equal rights with white people to the [[Khoikhoi]] and other free coloured people.  Another ordinance in 1830 imposed heavy penalties for harsh treatment of slaves, and finally the [[abolitionism|emancipation]] of slaves was proclaimed in 1834.  Each of these ordinances drew further ire from the Dutch farmers towards the government.  Moreover, the inadequate compensation awarded to slave-owners, and the suspicions engendered by the method of payment, caused much resentment, and in 1835 the trend where farmers trekked into unknown country in order to escape from a disliked government recommenced. Emigration beyond the colonial border had in fact been continuous for 150 years, but it now took on larger proportions.\\n\\n==Third cape frontier war==\\n[[Image:Eastern Frontier, Cape of Good Hope, ca 1835.png|thumb|right|180px|The Eastern Frontier, ca 1835]]\\nOn the eastern border, further trouble arose between the government and the [[Xhosa]], towards whom the policy of the Cape government was marked by much vacillation. On [[11 December]] [[1834]], a government commando party killed a chief of high rank, incensing the Xhosa: an army of 10,000 men, led by [[Macomo]], a brother of the chief who had been killed, swept across the frontier, pillaged and burned the homesteads and killed all who resisted. Among the worst sufferers was a colony of freed Khoikhoi who, in 1829, had been settled in the [[Kat River]] valley by the British authorities. There were few available soldiers in the colony, but the governor, [[Benjamin d\'Urban|Sir Benjamin d\'Urban]] acted quickly and all available forces were mustered under [[Harry Smith (army)|Colonel Sir Harry Smith]], who reached Graham\u2019s Town on [[6 January]] [[1835]], six days after news of the uprising had reached Cape Town.  The British fought the Xhosa for nine months until hostilities were ended on [[17 September]] 1836 with the signing of a new peace treaty, by which all the country as far as the [[River Kei]] was acknowledged to be British, and its inhabitants declared British subjects. A site for the seat of government was selected and named [[King William\u2019s Town]].\\n\\n==Great Trek==\\n[[Image:MapoftherouteoftheGreakTrek.jpg|right|180px|thumb|Map of the route of the [[Great Trek]]]]\\nThe British government did not approve of the actions of Sir Benjamin d\'Urban, and the British Secretary for the Colonies, [[Lord Glenelg]], declared in a letter to the [[British monarchy|King]] that "the great evil of the Cape Colony consists in its magnitude" and demanded that the boundary be moved back to the [[Fish River, Eastern Cape|Fish River]]. He also eventually had d\'Urban dismissed from office in 1837.  "The [[Kaffir (Historical usage in southern Africa)|Kaffirs]]," in Lord Glenelg\'s dispatch of [[26 December]], "had an ample justification for war; they had to resent, and endeavoured justly, though impotently, to avenge a series of encroachments.\u201d  This attitude towards the Xhosa was one of the many reasons given by the [[Trek Boer]]s for leaving the Cape Colony. The [[Great Trek]], as it is called, lasted from 1836 to 1840.  The trekkers (Boers), numbering around 7,000, founded communities with a [[republic]]an form of government beyond the [[Orange River|Orange]] and [[Vaal River|Vaal]] rivers, and in [[KwaZulu-Natal Province|Natal]], where they had been preceded, however, by British emigrants. From this time on, Cape Colony ceased to be the only European community in South Africa, though it was the most predominant for many years.\\n\\nConsiderable trouble was caused by the emigrant Boers on either side of the Orange River, where the Boers, the [[Basuto]]s, other native tribes, Bushmen, and [[Griqua]]s fought for superiority, while the Cape government endeavoured to protect the rights of the natives.  On the advice of the [[missionary|missionaries]], who exercised great influence on all non-Dutch people, a number of the native states were recognised and subsidised by the Cape government with the objective of creating peace on the northern frontier. The first "Treaty States" to be recognised was [[Griqualand West]] of the Griqua people.  Subsequent states were recognised between 1843 and 1844.  While the northern frontier became more secure, the state of the eastern frontier was deplorable, with the government either unable or unwilling to protect farmers from the Xhosa.\\n\\nElsewhere, however, the colony was making progress. The change from slave to free labour proved to be advantageous to the farmers in the western provinces.  An efficient [[education]] system, owing its inception to [[John Herschel|Sir John Herschel]], an [[astronomer]] who lived in Cape Colony from 1834 to 1838, was adopted. Road Boards were established and proved to be very effective in constructing new roads.  A new stable industry, [[sheep]]raising, was added to the original set of [[wheat]]growing, [[cattle|cattle rearing]], and [[wine|wine making]]. By 1846, [[wool]] became the country\'s most valuable export. A [[legislative council]] was established in 1835, giving the colonists a share in the government.\\n\\n==War of the Axe==\\nAnother war with the Xhosa, known as the [[War of the Axe]], broke out in 1846, when a Khoikhoi escort who had been [[manacle]]d to a Xhosa [[thief]] was [[murder]]ed while transporting the man to Graham\u2019s Town to be tried for stealing an axe. A party of Xhosa attacked and killed the escort. The surrender of the murderer was refused, and war was declared in March 1846. The [[Ngqika]]s were the chief tribe engaged in the war, assisted by the Tambukies. The Xhosa were defeated on 7 June 1846 by [[Henry Somerset (British Army officer)|General Somerset]] on the [[Gwangu]], a few miles from [[Fort Peddie]].  However, the war continued until [[Sandili]], the chief of the Ngqika, surrendered.  Other chiefs gradually followed this action, and by the beginning of 1848, the Xhosa had been completely subdued after twenty-one months of fighting.\\n\\n==Extension of British sovereignty==\\n[[Image:Sir Harry Smith.gif|thumb|right|170px|Sir Harry Smith]]\\nIn December 1847, or what was to be the last month of the War of the Axe, [[Harry Smith (army)|Sir Harry Smith]] reached Cape Town by boat to become the new governor of the colony.  He reversed Glenelg\'s policy soon after arrival.  A proclamation he issued on [[17 December]], 1847, extended the borders of the colony northwards to the Orange river and eastward to the [[Keiskamma river]], and at a meeting of the Xhosa chiefs on [[23 December]], 1847, Sir Harry announced the annexation of the land between the Keiskamma and the [[Kei River]]s to the British crown, thus re-absorbing the territory abandoned by Lord Glenelg. The land was not, however, incorporated into the Cape Colony, but instead made a crown dependency under the name of [[British Kaffraria]]. For a time, the Xhosa accepted the new government in British Kaifaria since they were mainly left alone as the governor had other serious matters to contend with, including the assertion of British authority over the Boers beyond the Orange river, and the establishment of amicable relations with the [[Transvaal|Transvaal Boers]].\\n\\n==Convict agitation and granting of a constitution==\\nA crisis arose in the colony over a proposal to make the Cape Colony a [[convict|convict station]].  A circular written in 1848 by the third [[Earl Grey]], then colonial secretary, was sent to the governor of the Cape, as well as other colonial governors, asking them to ascertain the feelings of the colonists regarding the reception of a certain class of convicts.  The Earl intended to send [[Ireland|Irish]] peasants who had been driven to crime by the [[Irish Potato Famine|famine]] of 1845 to South Africa.  Due to a misunderstanding, a boat named the \'\'Neptune\'\' was sent to the Cape Colony before the colonists\' opinion had been received.  The boat had 289 convicts on board, among whom was the famous Irish rebel [[John Mitchel]], and his colleagues. When the news that this vessel was on her way reached the Cape, people became violently excited and established an anti-convict association whose members bound themselves to cease from all interaction of any kind with persons in any way associated "with the landing, supplying or employing convicts".  Sir Harry Smith, confronted with violent public agitation, agreed not to allow the convicts to land when the \'\'Neptune\'\' arrived in Simon\'s Bay on [[19 September]] [[1849]], but to keep them on board the ship until he received orders to send them elsewhere.  When the home government became aware of the state of affairs, orders were sent directing the \'\'Neptune\'\' to proceed to [[Tasmania]], and it did so after staying in Simon\u2019s Bay for five months. The agitation did not fade away without further achievements, as it led to another movement that intended to obtain a free, representative government for the colony.  The British government granted this concession, which had been previously promised by Lord Grey, and a constitution was established in 1854 of almost unprecedented liberality.\\n\\n==Eighth frontier war of 1850-1853==\\nThe anti-convict move had scarcely ended when the colony was once again involved in a war. The Xhosa bitterly resented their loss of independence, and had secretly prepared to renew their struggle ever since the last war. Sir Harry Smith, informed of the increasingly threatening attitude of the natives, went to the border region and summoned Sandili and the other chiefs for a meeting. Sandili refused obedience, after which the governor declared him deposed from his chieftanship at an assembly of other chiefs in October 1850, and appointed an English magistrate named Mr Brownlee to be temporary chief of the Ngqika tribe. It seems that the governor believed that he would be able to prevent a war and that Sandili could be arrested without armed resistance. [[George Mackinnon|Colonel George Mackinnon]], who had been sent out with a small army with the goal of securing the chief, was attacked on [[24 December]], 1850, in a narrow gorge by a large number of Xhosa, and compelled to retreat after some loss of men. This small battle prompted a general rising among the whole Ngqika tribe. Settlers in military villages that had been established along the border, were caught in a surprise attack after they had gathered to celebrate [[Christmas|Christmas day]].  Many of them were killed, and their houses set on fire. \\n\\nOther setbacks followed in quick succession. The greater part of the Xhosa police deserted, many of them leaving with their arms. Emboldened by their initial success, the Xhosa surrounded and attacked [[Fort Cox]] with immense force, where the governor was stationed with a small number of soldiers. More than one unsuccessful attempt was made to kill Sir Harry, and he needed to find a way to escape. At the head of 150 mounted riflemen, accompanied by Colonel Mackinnon, he galloped out of the fort, and rode to King William\u2019s Town through heavy enemy fire \u2014 a distance of 12 miles (19 km). \\n\\nMeanwhile, a new enemy appeared. Some 900 of the Kat river Khoikhoi, who had in former wars been firm allies of the British, joined their former enemies: the Xhosa. They were not without justification. They complained that while serving as soldiers in former wars \u2014 the Cape Mounted Rifles consisted largely of Khoikhois \u2014 they had not received the same treatment as others serving in defence of the colony, that they got no compensation for the losses they had sustained, and that they were in various ways made to feel they were a wronged and injured race. A secret alliance was formed with the Xhosa to take up arms in order to remove the Europeans and establish a Khoikhoi republic. Within a fortnight of the attack on Colonel Mackinnon, the Kat river Khoikhoi were also in arms. Their revolt was followed by that of the Khoikhoi at other missionary stations, and some of the Khoikhoi of the Cape Mounted Rifles followed their example, including some of the very men who had escorted the governor from Fort Cox. But many of the Khoikhoi remained loyal, and the [[Fingo]] likewise sided with the British.\\n\\nAfter the confusion caused by the surprise attack had subsided, Sir Harry Smith and his force turned the tide of war against the Xhosa. The [[Amatola Mountains]] were stormed, and [[Sarhili]], the highest ranking chief, who had been secretly assisting the Ngqika all along, was severely punished. In April 1852, Sir Harry Smith was recalled by Earl Grey, who accused him \u2014 unjustly, in the opinion of the Duke of Wellington \u2014 of a want of energy and judgement in conducting the war; he was succeeded by Lieutenant-General Cathcart. Sarhili was again attacked and reduced to submission. The Amatolas were finally cleared of Xhosa, and small forts were erected to prevent their reoccupation. \\n\\nThe British commanders were hampered throughout by their insufficient equipment, and it was not until March 1853 that the largest of the Frontier wars was brought to an end after the loss several hundred British soldiers. Shortly afterwards, British Kaffraria was made a [[crown colony]]. The Khoikhoi settlement at Kat River remained, but the Khoikhoi power within the colony was crushed.\\n\\n==Xhosa cattle-killing movement and famine==<!-- This section is linked from [[Cattle-killing movement]] -->\\nThe Xhosa tribes gave the colony few problems after the war. This was due, in large measure, to an extraordinary [[delusion]] which arose among the Xhosa in 1856, and led in 1857 to the death of some 50,000 people. This incident is one of the most remarkable instances of misplaced faith recorded in history. The Xhosa had not accepted their defeat in 1853 as decisive and were preparing to renew their struggle with the Europeans. \\n\\nIn 1854, a disease spread through the cattle of the Xhosa. It was believed to have spread from cattle owned by the Settlers. Widespread cattle deaths resulted, and the Xhosa believed that the deaths were caused by ubuthi, or witchcraft. In April, 1856 two girls one being nongqawuse went to scare birds out of the fields. When she returned, she told her uncle Mhlakaza that she had met three spirits at the bushes, and that they had told her that all cattle should be slaughtered, and their crops destroyed. On the day following the destruction, the dead Xhosa would return and help expel the whites. The ancestors would bring cattle with them to replace those that had been killed. Mhlakaza believed the prophecy, and repeated it to the chief [[Sarhili]].\\n\\nSarhili ordered the commands of the spirits to be obeyed. At first, the Xhosa were ordered to destroy their fat cattle. Nongqawuse, standing in the river where the spirits had first appeared, heard unearthly noises, interpreted by her uncle as orders to kill more and more cattle. At length, the spirits commanded that not an animal of all their herds was to remain alive, and every grain of corn was to be destroyed. If that were done, on a given date, myriads of cattle more beautiful than those destroyed would issue from the earth, while great fields of corn, ripe and ready for harvest, would instantly appear. The dead would rise, trouble and sickness vanish, and youth and beauty come to all alike. Unbelievers and the hated white man would on that day perish. \\n\\nThe people heard and obeyed. Sarhili is believed by many people to have been the instigator of the prophecies. Certainly some of the principal chiefs believed that they were acting simply in preparation for a last struggle with the Europeans, their plan being to throw the whole Xhosa nation fully armed and famished upon the colony. Belief in the prophecy was bolstered by the death of Lieutenant-General [[George Cathcart|Cathcart]] in the [[Crimean War]] in 1854. His death was interpreted as being due to intervention by the ancestors.\\n\\nThere were those who neither believed the predictions nor looked for success in war, but destroyed their last particle of food in unquestioning obedience to their chief\u2019s command. Either in faith that reached the [[sublime (philosophy)|sublime]], or in obedience equally great, vast numbers of the people acted. Great [[kraal]]s were also prepared for the promised cattle, and huge skin sacks to hold the milk that was soon to be more plentiful than water. At length the day dawned which, according to the prophecies, was to usher in the terrestrial paradise. The sun rose and sank, but the expected miracle did not come to pass. The chiefs who had planned to hurl the famished warriors upon the colony had committed an incredible blunder in neglecting to call the nation together under pretext of witnessing the resurrection. They realised their error too late, and attempted to fix the situation by changing the resurrection to another day, but blank despair had taken the place of hope and faith, and it was only as starving supplicants that the Xhosa sought the British. \\n\\nSir George Grey, governor of the Cape at the time ordered the European settlers not to help the Xhosa unless they entered labour contracts with the settlers who owned land in the area. In their extreme famine, many of the Xhosa turned to [[cannibalism]], and one instance of parents eating their own child is authenticated. Among the survivors was the girl Nongqawuse; however, her uncle perished. A vivid narrative of the whole incident is found in G. M. Theal\u2019s \'\'History and Geography of South Africa\'\' (3rd edition, London, 1878). The depopulated country was afterwards peopled by European settlers, among whom were members of the German legion which had served with the British army in the [[Crimea]], and some, 2000 industrious North German emigrants, who proved a valuable acquisition to the colony.\\n\\nHistorians now view this movement as a [[millennialist]] response both directly to a lung disease spreading among Xhosa cattle at the time, and less directly to the stress to Xhosa society caused by the continuing loss of their territory and autonomy. At least one historian has also suggested that it can be seen as a rebellion against the upper classes of Xhosa society, which used cattle as a means of consolidating wealth and political power, and which had lost respect as they failed to hold back white expansion.\\n\\n==Sir George Grey\u2019s governorship==\\n<!-- Deleted image removed: [[Image:SirGeorgeGrey.jpg|thumb|right|180px|Sir George Grey]] -->\\n[[George Edward Grey|Sir George Grey]] became governor of the Cape Colony in 1854, and the development of the colony owes much to his administration. In his opinion, policy imposed upon the colony by the home government\'s policy of not governing beyond the Orange River was mistaken, and in 1858 he proposed a scheme for a [[confederation]] that would include all of South Africa, however it was rejected by Britain. Sir George kept open a British road through [[Bechuanaland]] to the far interior, gaining the support of the missionaries Moffat and [[David Livingstone]]. Sir George also attempted for the first time, missionary effort apart, to educate the Xhosa and to firmly establish British authority among them, which the self-destruction of the Xhosa rendered easy. Beyond the Kei River, the natives were left to their own devices.\\n\\nSir George Grey left the Cape in 1861. During his governorship the resources of the colony had increased with the opening of the [[copper]] mines in [[Little Namaqualand]], the [[mohair]] wool industry had been established and Natal made a separate colony. The opening, in November 1863, of the [[railway]] from [[Cape Town]] to [[Wellington, South Africa|Wellington]], and the construction in 1860 of the great breakwater in [[Table Bay]], long needed on that perilous coast, marked the beginning in the colony of [[public works]] on a large scale. They were the more-or-less direct result of the granting to the colony of a large share in its own government. \\n\\nThe province of British Kaffraria was incorporated into the colony in 1865, under the title of the Electoral Divisions of King William\u2019s Town and [[East London, South Africa|East London]]. The transfer was marked by the removal of the prohibition of the sale of [[alcohol]]ic beverages to the natives, and the free trade in intoxicants which followed had most deplorable results among the Xhosa tribes. A severe drought, affecting almost the entire colony for several years, caused great economic depression, and many farmers suffered severely. It was at this period in 1869 that [[ostrich]]-farming was successfully established as a separate [[industry]].\\n\\nWhether by or against the wish of the home government, the limits of British authority continued to extend. The [[Basotho]], who dwelt in the upper valleys of the Orange River, had subsisted under a semi-protectorate of the British government from 1843 to 1854; but having been left to their own resources on the abandonment of the Orange sovereignty, they fell into a long exhaustive warfare with the Boers of the [[Orange Free State]]. On the urgent petition of their chief [[Moshesh]], they were proclaimed British subjects in 1868, and their territory became part of the Cape Colony in 1871 (see [[Basutoland]]). In the same year, the southeastern part of [[Bechuanaland]] was annexed to Britain under the title of Griqualand West. This annexation was a consequence of the discovery there of rich [[diamond]] mines, an event which was destined to have far-reaching results.\\n\\n==References==\\n* \'\'The Migant Farmer in the History of the Cape Colony\'\'.P.J. Van Der Merwe, Roger B. Beck. [[Ohio University Press]]. [[1 January]] [[1995]]. 333 pages. ISBN 0-8214-1090-3.\\n* \'\'History of the Boers in South Africa; Or, the Wanderings and Wars of the Emigrant Farmers from Their Leaving the Cape Colony to the Acknowledgment of Their Independence by Great Britain\'\'. George McCall Theal. Greenwood Press. [[28 February]] [[1970]]. 392 pages. ISBN 0-8371-1661-9.\\n* \'\'Status and Respectability in the Cape Colony, 1750\u20131870 : A Tragedy of Manners\'\'. Robert Ross, David Anderson. [[Cambridge University Press]]. [[1 July]] [[1999]]. 220 pages. ISBN 0-521-62122-4.\\n* \'\'The War of the Axe, 1847: Correspondence between the governor of the Cape Colony, Sir Henry Pottinger, and the commander of the British forces at the Cape, Sir George Berkeley, and others\'\'. Basil Alexander Le Cordeur. Brenthurst Press. 1981. 287 pages. ISBN 0-909079-14-5.\\n* \'\'Blood Ground: Colonialism, Missions, and the Contest for Christianity in the Cape Colony and Britain, 1799\u20131853\'\'. Elizabeth Elbourne. McGill-Queen\'s University Press. December 2002. 560 pages. ISBN 0-7735-2229-8.\\n* \'\'Recession and its aftermath: The Cape Colony in the eighteen eighties\'\'. Alan Mabin. University of the Witwatersrand, African Studies Institute. 1983. 27 pages. ASIN B0007B2MXA.\\n\\n==External links==\\n* [http://www.encyclopedia.com/html/section/CapeProv_History.asp Cape Colony History on Encyclopedia.com]\\n* [http://www.britannica.com/eb/article?tocId=44059 Encyclop\xe6dia Britannica Cape Colony]\\n\\n[[Category:History of colonialism]]\\n[[Category:History of South Africa]]\\n[[Category:19th century in Africa]] \\n\\n[[af:Geskiedenis van die Kaapkolonie vanaf 1806 tot 1870]]\\n[[de:Geschichte der Kapkolonie (1806\u20131870)]]'

In [21]:
for title in berkeley_pages.map(lambda x: x.split("\t")[0]).take(20):
print title

History_of_Cape_Colony_from_1806_to_1870
Cattle-killing_movement
Outlanders_(anime)
Outlanders_(anime
Outlanders_(manga)
Komodo_dragon
Varanus_komodoensis
Kimodo_Dragon
Comodo_dragon
Komodo_monitor
Komodo_Monitor
Komodo_Island_Monitor
Komodo_dragons
Kimodo_dragon
Komodos
Komodo_Dragon
Buaja_durat
Biawak_raksasa
Kodomo_dragon
Komodo_dragon_fact_sheet



# Which article has the most incoming links?¶

In [22]:
import re
for link in re.findall("$$$([^$]*)$$\]", raw_text.split("\t")[1]):
if not splits[0].startswith("Image:"):
yield splits[-1]

In [23]:
linked_pages = wiki.flatMap(extract_links)

In [24]:
wiki_table(linked_pages.take(20))

In []:
linked_pages.count()

In [25]:
number_of_inlinks = linked_pages.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y).cache()

In [26]:
%time inlink_frequencies = number_of_inlinks.map(lambda x: x[1]).countByValue()

CPU times: user 196 ms, sys: 52 ms, total: 248 ms
Wall time: 48.1 s


In [27]:
import pylab


#### Why are there so many pages that are linked only once?¶

In [29]:
linked_only_once = number_of_inlinks.filter(lambda (title, count): count == 1).map(lambda x: x[0])

In [30]:
wiki_table(linked_only_once.take(20))


Many of these pages don't even exist! Wikilinks may point to pages that don't exist, so let's filter those pages out and see if it changes the histogram:

In [31]:
page_titles = wiki.map(lambda x: x.split("\t")[0]).cache()

In [32]:
number_of_inlinks_for_existing_pages = number_of_inlinks. \
leftOuterJoin(page_titles.map(lambda x: (x, True))). \
filter(lambda (title, (count, exists)): exists is not None). \
map(lambda (title, (count, exists)): (title, count))

In [33]:
%time inlink_frequencies_for_existing_pages = number_of_inlinks_for_existing_pages.map(lambda x: x[1]).countByValue()

CPU times: user 212 ms, sys: 36 ms, total: 248 ms
Wall time: 28.4 s


In [34]:
pylab.hist(inlink_frequencies_for_existing_pages.keys(), bins=range(50), weights=inlink_frequencies_for_existing_pages.values());

In [35]:
num_pages = page_titles.count()
print "Existing pages: %i\nMissing pages: %i" % (num_pages, num_linked_pages - num_pages)

Existing pages: 6222620
Missing pages: 10166483


In [36]:
outlink_counts = wiki.map(lambda x: (x.split("\t")[0], len(list(extract_links(x)))))

In [37]:
%time outlink_count_frequencies = outlink_counts.map(lambda x: x[1]).countByValue()

CPU times: user 196 ms, sys: 40 ms, total: 236 ms
Wall time: 25.6 s


In [38]:
pylab.hist(outlink_count_frequencies.keys(), bins=range(0, 400, 4), weights=outlink_count_frequencies.values());


prolific_linkers = outlink_counts.filter(lambda (title, count): count > 400).collectAsMap()

len(prolific_linkers)

115260