Setting up the IPython Notebook with PySpark on AMPCamp EC2 clusters

Note this HowTo assumes that you are using a cluster provided by the AMPLab team, where port 8888 has already been opened. If you have spun up your own cluster using their AMI, you need to go to the Security Groups tab and open that port for traffic. See here for full details.

You can run the IPython Notebook interface as a more friendly way to interact with your AMPCamp EC2 cluster. The detailed instructions on how to run a public IPython Notebook Server are here, but the basics are:

  • Create a certificate file for your cluster by typing at the command line:

      cd /root
      openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem
    
  • Let's make sure there's a default IPython profile ready for us to use:

      ipython profile create default
    
  • You will need a hashes password next, which you can create with (note these two lines are a single command, copy and paste the whole thing in one shot.
    python -c "from IPython.lib import passwd; print passwd()" \
        > /root/.ipython/profile_default/nbpasswd.txt
  • Verify the password file has a string like sha1:16a8a30fb9b6:82c0... in it (your actual value will differ). If you don't get this, repeat the prior step:

      cat /root/.ipython/profile_default/nbpasswd.txt
      sha1:16a8a30fb9b6:82c030d3989b0069b9ed603822949a954a2beb21
    
  • Put the following into the file /root/.ipython/profile_default/ipython_notebook_config.py:

      # Configuration file for ipython-notebook.
      c = get_config()
    
      # Notebook config
      c.NotebookApp.certfile = u'/root/mycert.pem'
      c.NotebookApp.ip = '*'
      c.NotebookApp.open_browser = False
      # It is a good idea to put it on a known, fixed port
      c.NotebookApp.port = 8888
    
      PWDFILE="/root/.ipython/profile_default/nbpasswd.txt"
      c.NotebookApp.password = open(PWDFILE).read().strip()
    
  • Put the following into the file /root/.ipython/profile_default/startup/00-pyspark-setup.py:

      # Configure the necessary Spark environment
      import os
      os.environ['SPARK_HOME'] = '/root/spark/'
    
      # And Python path
      import sys
      sys.path.insert(0, '/root/spark/python')
    
      # Detect the PySpark URL
      CLUSTER_URL = open('/root/spark-ec2/cluster-url').read().strip()
    
  • That's it! You can now start the notebook server by typing the following command:

      ipython notebook
    

Note: I strongly recommend you do this inside a screen or tmux session so it's persistent. This will let it survive cleanly if you lose your connection to your cluster.

You can then connect to the server via https://[YOUR INSTANCE URL HERE]:8888. Once you type your password, you should be able to start running code!

Warning: the URL for your notebook must start with https, not http.

The config file above creates a variable called CLUSTER_URL which you can use to create your SparkContext:

In [1]:
print CLUSTER_URL
spark://ec2-50-16-173-245.compute-1.amazonaws.com:7077

Now let's create the context:

In [2]:
from pyspark import  SparkContext
sc = SarkContext( CLUSTER_URL, 'pyspark')

And test it by creating a trivial RDD:

In [3]:
sc.parallelize([1,2,3])
Out[3]:
<pyspark.rdd.RDD at 0x1e16d90>

WARNING: Shutdown this tutorial when you are done with it!

Because of how PySpark works, the above context will hog all your cluster resources. If you are going to do new work and are done with this tutorial, remember to shut it down from the dashboard so you free the cluster for other work.