Note this HowTo assumes that you are using a cluster provided by the AMPLab team, where port 8888 has already been opened. If you have spun up your own cluster using their AMI, you need to go to the Security Groups tab and open that port for traffic. See here for full details.
You can run the IPython Notebook interface as a more friendly way to interact with your AMPCamp EC2 cluster. The detailed instructions on how to run a public IPython Notebook Server are here, but the basics are:
Create a certificate file for your cluster by typing at the command line:
cd /root
openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem
Let's make sure there's a default IPython profile ready for us to use:
ipython profile create default
You will need a hashes password next, which you can create with (note these two lines are a single command, copy and paste the whole thing in one shot.
python -c "from IPython.lib import passwd; print passwd()" \
> /root/.ipython/profile_default/nbpasswd.txt
Verify the password file has a string like sha1:16a8a30fb9b6:82c0...
in it (your actual value will differ). If you don't get this, repeat the prior step:
cat /root/.ipython/profile_default/nbpasswd.txt
sha1:16a8a30fb9b6:82c030d3989b0069b9ed603822949a954a2beb21
Put the following into the file /root/.ipython/profile_default/ipython_notebook_config.py
:
# Configuration file for ipython-notebook.
c = get_config()
# Notebook config
c.NotebookApp.certfile = u'/root/mycert.pem'
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
# It is a good idea to put it on a known, fixed port
c.NotebookApp.port = 8888
PWDFILE="/root/.ipython/profile_default/nbpasswd.txt"
c.NotebookApp.password = open(PWDFILE).read().strip()
Put the following into the file /root/.ipython/profile_default/startup/00-pyspark-setup.py
:
# Configure the necessary Spark environment
import os
os.environ['SPARK_HOME'] = '/root/spark/'
# And Python path
import sys
sys.path.insert(0, '/root/spark/python')
# Detect the PySpark URL
CLUSTER_URL = open('/root/spark-ec2/cluster-url').read().strip()
That's it! You can now start the notebook server by typing the following command:
ipython notebook
Note: I strongly recommend you do this inside a screen
or tmux
session so it's persistent. This will let it survive cleanly if you lose your connection to your cluster.
You can then connect to the server via https://[YOUR INSTANCE URL HERE]:8888
. Once you type your password, you should be able to start running code!
Warning: the URL for your notebook must start with https
, not http
.
The config file above creates a variable called CLUSTER_URL
which you can use to create your SparkContext
:
print CLUSTER_URL
spark://ec2-50-16-173-245.compute-1.amazonaws.com:7077
Now let's create the context:
from pyspark import SparkContext
sc = SarkContext( CLUSTER_URL, 'pyspark')
And test it by creating a trivial RDD:
sc.parallelize([1,2,3])
<pyspark.rdd.RDD at 0x1e16d90>
Because of how PySpark works, the above context will hog all your cluster resources. If you are going to do new work and are done with this tutorial, remember to shut it down from the dashboard so you free the cluster for other work.