# Installation Instructions (for non-VM users)¶

• Install IPython and IPython Notebook from http://ipython.org/install.html. In Ubuntu, you can do sudo apt-get install ipython ipython-notebook. The version of IPython needs to be 1.0 or later.
• Once IPython 1.0+ and Julia 0.2 is installed, you can type julia in the command line to start the Julia REPL, and install IJulia with: Pkg.add("IJulia").
• Continue the installation with Pkg.add("PyPlot") to install PyPlot, which is a plotting package for Julia based on Python's Matplotlib
• If the commands above returns an error, you may need to run Pkg.update(), then retry it.
• To start the IJulia interface, run ipython notebook --profile julia (a window will open in your web browser).
• Execute the code below block-by-block. For example, there's a good chance the code in part n will not run without first running the code in part a.

More comprehensive installation instructions can be found at https://github.com/JuliaLang/IJulia.jl, as well as https://github.com/stevengj/PyPlot.jl. If you are not using the recommended VM, stackoverflow and google will be your best friends.

# Installation Instructions (for saasbook VM users)¶

(borrowed heavily from the installation instructions in CS194-16: Introduction to Data Science http://amplab.github.io/datascience-sp14/setup.html)

Note: While you should also be able to set up a similar environment on your Mac or PC without needing a Virtual Machine, the course staff will not support such configurations - so you’re on your own if you choose to go that route!

• Follow the instructions available here to install VirtualBox, and the virtual machine image. The virtual machine image already includes most of the software necessary to run the code. We will install extra packages below.
• Start up the machine. Enter ‘saasbook’ as the password.
• Launch a terminal. (Third icon in the launcher on the left.)
• Run sudo bash setup.bash. Make sure you are in the same directory as setup.bash before executing this command. Enter the same password again to install lots of packages.
• Grab a coffee or something - it will take a few minutes to build/install these components. Also if you see warnings etc. on the screen, don’t worry that is expected.
• Once IPython 1.0+ and Julia 0.2 is installed, you can type julia in the command line to start the Julia REPL, and install IJulia with: Pkg.add("IJulia").
• Continue the installation with Pkg.add("PyPlot") to install PyPlot, which is a plotting package for Julia based on Python's Matplotlib.
• If the commands above returns an error, you may need to run Pkg.update(), then retry it.
• To start the IJulia interface, run ipython notebook --profile julia (a window will open in your web browser).
• Execute the code below block-by-block. For example, there's a good chance the code in part n will not run without first running the code in part a.

# Tips and Tricks¶

• If you are new to Julia, I recommend downloading and playing around with the language. The syntax is a mix of the best of Python and Matlab, and is very easy to get used to. Speaking of Matlab, Julia is 1-indexed instead of 0-indexed like Python and other languages
• I recommend http://learnxinyminutes.com/docs/julia/ as the quick tutorial to learn most of Julia's basic features
• Use a semicolon at the end of a statement to suppress the output/printing
• To make the plot small enough and fit into the writeup, use the figure() command. In particular, the code below used figure(figsize=(5, 3)) to indicate that the size of the plot should be 5x3 inches
• PyPlot is the porting of matplotlib.pyplot to Julia. If there are naming conflicts (like the function hist() which plots a histogram), use the full function call plt.hist()
In [ ]:
# Import the plotting package for Julia
using PyPlot

In [4]:
# Declaring some global variables for multiple uses
N = 1000
sequence = [2, 4, 10, 50, 100, 500, 10000, 100000]
Qs = [0.4, 0.1, 0.25, 0.5]
ks = [2, 10, 50, 100, 500, 10000, 100000];


# Part a¶

In [8]:
# Returns the number of heads in n flips
# rand() generates a random number between 0 and 1, that number is then rounded to either 0 or 1.
# By summing them up, we essentially are counting the number of heads
flip_coin(n) = sum(round(rand(n)))

Out[8]:
flip_coin (generic function with 1 method)
In [14]:
heads = flip_coin(N)
figure(figsize=(5, 3))
title("1000 coin tosses");


# Part b¶

We simply call flip_coin m times. The result will be a list of the number of heads in m runs

In [9]:
# Simulates a flip of n coins m times
flip_runs(n, m) = [flip_coin(n) for i = 1:m]

Out[9]:
flip_runs (generic function with 1 method)
In [17]:
figure(figsize=(5, 3))
plt.hist(flip_runs(N, 1000))
title("Heads frequency in 1000 runs of 1000 coin tosses");


# Part c¶

• We repeat part b for every number in the sequence
• for i = sequence is Julia's syntax for Python's for i in list
• $i is the Julia's way to do String concatenation by evaluating i, much like Perl's In [21]: for i = sequence figure(figsize=(5, 3)) plt.hist(flip_runs(i, 1000)) title("Heads frequency in 1000 runs of$i coin tosses");
end


# Part d¶

We shift the center of the horizontal axis to the origin by subtracting half the number of tosses from the number of heads

In [10]:
scaled_run(n, m) = [flip_coin(n) - n/2 for i = 1:m]

Out[10]:
scaled_run (generic function with 1 method)
In [11]:
for i = sequence
figure(figsize=(5, 3))
plt.hist(scaled_run(i, 1000));
end


# Part e¶

We use a fixed horizontal axis for all plots in this part

In [12]:
for i = sequence
figure(figsize=(5, 3))
plt.hist(scaled_run(i, 1000));
xlim(-500, 500)
end


# Part f¶

To normalize, the leftmost point corresponds to tossing 0 heads and all tails, whereas the rightmost point corresponds to tossing all i heads and 0 tails

In [13]:
for i = sequence
figure(figsize=(5, 3))
plt.hist(flip_runs(i, 1000));
xlim(0, i)
end


# Part g¶

Comments on the three parts above:

• Part d: As the number of tosses increases, the distribution moves closer to a normal curve
• Part e: Since we are working on the same scale, the histogram gets bigger as the number of tosses increases, since the number of heads will also increase
• Part f: The curve becomes tighter and tends toward the middle as the number of tosses increases. This makes sense because as we toss more coins, the chance of getting all heads or all tails decreases significantly comparing to say, when we toss 2 or 4 coins.

# Part h¶

• We first define a function that generates a random number, either -1 or 1 based on the result of rand()
• We then proceed to append the y-values to the list one-by-one. push! is Julia's syntax for appending an element to an array and modifying the first argument
In [26]:
rand_one() = rand() < 0.5 ? -1 : 1

# An n-step random walk starts at 0 by default
function path(n, walk=[0])
for i = 1:n
push!(walk, rand_one() + walk[end])
end
return walk
end

Out[26]:
path (generic function with 2 methods)
In [28]:
figure(figsize=(5, 3))
title("20 random walks of 1000 coin flips")
for _ = 1:20
plot(path(1000))
end


# Part i¶

• Instead of (n, k) where n is the number of tosses and k is the sum so far, we use (n, k/n)
• Every value is normalized between -1 and 1
• Like in the earlier plots, as the number of tosses increase, the deviation is smaller
In [30]:
figure(figsize=(5, 3))
title("100 normalized random walks of 1000 coin flips")
for _ = 1:100
plot([k/n for (n, k) = enumerate(path(1000))])
end


# Part j¶

• Plot how frequently the fraction of heads is less than q
• The optimization trick we're using here is based on the fact that a run of length 1000 contains a run of length 500 as a prefix
• Hence, we can use earlier values that we computed for smaller number of coin tosses in computing the number of heads in larger number of coin tosses
In [33]:
identity(x) = x

# runs and tosses are the number of runs and tosses respectively
# q is the threshold which we're interested in finding the fraction of heads smaller than
# func is the function we want to apply to the vertical axis values
# For this part, that function should do nothing (identity). For next part, it's the log function
function multi_toss_runs(runs, tosses, q, func)
runs_table = Dict()
for k = 1:tosses
runs_table[k] = Float64[]
end
for _ = 1:runs
for k = 1:tosses
end
end

# Computes the frequency of the fraction of heads is less than q
# Code explanation: for every run whose fraction of heads is less than q, give that run a 1, otherwise 0
# Then sum up all the values to get the number of runs that satisfies this requirement, and divides
# it by the total number of runs. Repeat for each of the k tosses
return [func(sum([heads/k <= q ? 1 : 0 for heads = runs_table[k]]) / runs) for k = 1:tosses]
end

Out[33]:
multi_toss_runs (generic function with 1 method)
In [35]:
figure(figsize=(5, 3))
for q = Qs
plot(multi_toss_runs(10000, 1000, q, identity));
end


# Part k¶

When you increase the number of tosses, it gets less and less likely that the fraction of heads will be smaller than a small threshold (like 0.4 and below)

In [36]:
figure(figsize=(5, 3))
for q = Qs
plot(multi_toss_runs(10000, 1000, q, log));
end


# Part l¶

• We sort the ratio to clearly indicate the fraction of m runs in which R <= q
• We also return the quartile values (25th, 50th, 75th) for part n
In [6]:
function q_curve(m, k)
res = [flip_coin(k) / k for _ = 1:m]
sort!(res)
first_quartile = res[m * 0.25]
second_quartile = res[m * 0.5]
third_quartile = res[m * 0.75]
return res, first_quartile, second_quartile, third_quartile
end

Out[6]:
q_curve (generic function with 1 method)
In [41]:
figure(figsize=(5, 3))
title("Ratio of heads as a function of q")
curve, _, _, _ = q_curve(10000, 1000) # don't care about the quartile values here yet
plot(curve, 1:10000);


# Part m¶

• The curve is more straight as k increases, i.e. there's less deviations across runs
In [10]:
firsts, seconds, thirds = Float64[], Float64[], Float64[]

figure(figsize=(5, 3))
title("Ratio of heads in k tosses as a function of q")
for k = ks
curve, first_quartile, second_quartile, third_quartile = q_curve(10000, k)
plot(curve, 1:10000)

# Save the quartile values
push!(firsts, first_quartile)
push!(seconds, second_quartile)
push!(thirds, third_quartile)
end


# Part n¶

As k increases, the quartile values becomes closer and closer to 0.5. This suggests that the heads/tails distribution tend toward 0.5/0.5

In [11]:
# Ignore k = 2, so we start indexing from 2. Recall that Julia is 1-indexed
function plot_quartiles()
plot(ks[2:], firsts[2:]);
plot(ks[2:], seconds[2:]);
plot(ks[2:], thirds[2:]);
end

Out[11]:
plot_quartiles (generic function with 1 method)
In [11]:
figure(figsize=(5, 3))
title("q quartile markers as a function of k")
plot_quartiles();


# Part o¶

• The log-log plot seems to offer the most insight since it's a rather straight line comparing to the other plots
In [12]:
# Computes the gap between the 0.75 marker and the 0.25 marker
gap = thirds - firsts

function plot_quartiles(plot_function)
plot_function(ks[2:], gap[2:]);
end

Out[12]:
plot_quartiles (generic function with 2 methods)
In [12]:
figure(figsize=(5, 3))
title("Linear-linear plot")
plot_quartiles(plot);

figure(figsize=(5, 3))
title("Log-linear plot")
plot_quartiles(semilogx);

figure(figsize=(5, 3))
title("Linear-log plot")
plot_quartiles(semilogy);

figure(figsize=(5, 3))
title("Log-log plot")
plot_quartiles(loglog);


# Part p¶

• The distance between the 0.75 marker and the 0.25 marker is just twice the distance between the 0.75 and the 0.5 marker, since they are symmetric
• We observe that the slope of the line in the log-log scale is approximately 0.5 (it moves two powers of x before changing one power of y), which implies that in a normal scale, $y \sim \frac{1}{\sqrt{x}}$
• Hence, if we want to scale the gap, we first subtract 0.5 to remove the symmetry, scale by the square root of k, and then add 0.5 back to the final result to adjust for the subtraction earlier
• Most points on all of the q curves now align perfectly to each other. Let's translate the above to code!
In [50]:
function q_curve_norm(m, k)
res = [(flip_coin(k)/k - 0.5) * sqrt(k) + 0.5 for _ = 1:m]
sort!(res)
return res
end

Out[50]:
q_curve_norm (generic function with 1 method)
In [51]:
figure(figsize=(5, 3))
title("Scaled ratio of heads in k tosses as a function of q")
for k = ks
plot(q_curve_norm(10000, k), 1:10000)
end