In this notebook we will introduce two different ways of getting data into the basic Spark data structure, the Resilient Distributed Dataset or RDD. An RDD is a distributed collection of elements. All work in Spark is expressed as either creating new RDDs, transforming existing RDDs, or calling actions on RDDs to compute a result. Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them.
The reference book for these and other Spark related topics is Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia.
The KDD Cup 1999 competition dataset is described in detail here.
In this notebook we will use the reduced dataset (10 percent) provided for the KDD Cup 1999, containing nearly half million network interactions. The file is provided as a Gzip file that we will download locally.
import urllib
f = urllib.urlretrieve ("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz", "kddcup.data_10_percent.gz")
The most common way of creating an RDD is to load it from a file. Notice that Spark's textFile
can handle compressed files directly.
data_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_file)
Now we have our data file loaded into the raw_data
RDD.
Without getting into Spark transformations and actions, the most basic thing we can do to check that we got our RDD contents right is to count()
the number of lines loaded from the file into the RDD.
raw_data.count()
494021
We can also check the first few entries in our data.
raw_data.take(5)
[u'0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.', u'0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.', u'0,tcp,http,SF,235,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,29,29,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,normal.', u'0,tcp,http,SF,219,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,6,0.00,0.00,0.00,0.00,1.00,0.00,0.00,39,39,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,normal.', u'0,tcp,http,SF,217,2032,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,6,0.00,0.00,0.00,0.00,1.00,0.00,0.00,49,49,1.00,0.00,0.02,0.00,0.00,0.00,0.00,0.00,normal.']
In the following notebooks, we will use this raw data to learn about the different Spark transformations and actions.
parallelize
¶Another way of creating an RDD is to parallelize an already existing list.
a = range(100)
data = sc.parallelize(a)
As we did before, we can count()
the number of elements in the RDD.
data.count()
100
As before, we can access the first few elements on our RDD.
data.take(5)
[0, 1, 2, 3, 4]