In this notebook, we showcase iPython Notebook as a suitable solution for modeling and data analysis in Synthetic Biology. With this framework, you will be able to both (i) develop your ideas (pure data analysis, data visualization, ) and (ii) create a portable, shareable document that stores not only code, but also you reasoning throughout your research. This will be a super-fast introduction to the language and major points in order to get anyone started with their own projects.
This is intended as a suplementary material for iGEM 2015 competition. You can contact our team (Team USP-Brasil) if you would like a live demonstration!!
This is our Facebook page: https://www.facebook.com/brasilusp?fref=ts
If needed, the following references might be useful:
This notebook is a free software, you are free to use it the way you'd like to. Visit our Wiki or find all source codes here.
In every programming language, the "hello world" program/script is usually a small piece of code that shows how to implement a simple program that prints on the screen "Hello world". This is a common way to get a first glimpse of the how the langauge you're planning to learn looks like.
print 'Hello world'
Hello world
As you can see, whenever you want to print anything on your screen, simply use "print" keyword.
Disclaimer: in case of error here, check whether you are using Python3 instead of Python2. In Python3, print is a function and should be called like print('Hellow world') instead.
To import packages or libraries, simply use import keyword. For simplicity, you can create an alias to that library name using as keyword. Bellow a few examples.
Plotting:
%matplotlib inline
import pylab as pl
Fast numerical calculations:
from math import *
import numpy as np
Pretty data sheets:
import pandas as pd
In case you experience errors in the previous commands, simply try to install the packages we are trying to call. Under windows, you can use Anaconda package manager for this purpose.
In most linux systems software managers, python packages are usually named python-[PACKAGE-NAME]. For instance, numpy is usually pyhon-numpy.
Example of creating an variable:
X = 1
Y = 2
Z = X + Y
print Z
3
Be careful with division using whole numbers or integers:
print X/Y
0
If you type the following, you are defining X as the whole number 1:
X = 1
print type(X)
<type 'int'>
However, if you want X as the real number 1, then use:
X = 1.
print type(X)
<type 'float'>
Additionally, if you have X defined as the whole number 1 and you want it as the real number 1.0, then use float function like below:
X = 1
Y = 2
XX = float( X )
YY = float( Y )
print 'Integer division: ', X/Y
print 'Float division: ', XX/YY
Integer division: 0 Float division: 0.5
We can create similarly lists of variables:
X = [1,2,3,4]
Y = [4.2,5.6,6,'A']
print X
print X+Y
[1, 2, 3, 4] [1, 2, 3, 4, 4.2, 5.6, 6, 'A']
Did you notice anything strange? Yep, X+Y is not the sum of the "vectors" X and Y, but the concatenation (juxtaposition). That is because lists are not vecotrs:
type(X)
list
Also, notice that lists are simply a group of variables, regardless of their type. In Y, for instance, we have whole numbers (6), real numbers (4.2, 5.6) and a character ('A'). In a moment, we'll get to "vectors".
There are however more efficient ways to create and manipulate lists and vectors. For instance, to create an list that starts at 3 and goes up to 10, you can use range function.
print 'My range: ', range(3, 10+1)
print 'Type of range: ', type(range(3, 10+1))
My range: [3, 4, 5, 6, 7, 8, 9, 10] Type of range: <type 'list'>
Notice that we used 10+1 instead of 10 in range. The reason why is simple: if you call range(x0, xf), then range will have xf-x0 elements starting off x0. Which means, in our case, 11-3 = 8 and starting from 3 we will stop at 10. Bottom line: if you want range to stop at a number n, then use range(x0, n+1)
To manipulate vectors, arrays and matrices, you should use numpy. It implements all matrix operations for you and has an easy-to-understand interface.
Check below how we create a vector from 0 to 20, with an interval of 0.5.
X = np.arange(0.,20.,0.5)
print 'Type of X: ', type(X)
print X
Type of X: <type 'numpy.ndarray'> [ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5 5. 5.5 6. 6.5 7. 7.5 8. 8.5 9. 9.5 10. 10.5 11. 11.5 12. 12.5 13. 13.5 14. 14.5 15. 15.5 16. 16.5 17. 17.5 18. 18.5 19. 19.5]
Notice that we are calling the function np.arange to create this vector (whose type is something different than a list). The actual function is called arange, while np is actually telling Python that this function should be inside numpy library (whose alias is np).
We can also create vectors filled with (pseudo) random numbers, distributed under virtually any probability distribution. Let's use a normal distribution and draw 20/0.5 = 40 random numbers. To do so, we will call random.standard_normal from numpy:
Y = np.random.standard_normal(40)
print Y
[-0.58864086 0.49596875 -0.47822746 1.09337724 -0.10421426 -0.40074705 0.25858911 0.49787258 0.11296725 1.64549488 -0.22218368 -0.28222512 0.56122107 -0.24665776 0.86385487 0.03167906 0.19530497 0.3376739 1.44820725 -0.97188816 0.19079391 -0.82470083 -1.59494952 1.40015987 0.40768965 -1.28038193 1.04337321 0.82884378 -0.04808435 0.1565237 1.09977507 0.5318647 0.06631362 0.48611276 -1.07302408 0.22283512 -1.0729849 0.36147196 -0.18017443 -1.04193127]
Whenever you create variables, in Python they are understood as objects (as in Object-Oriented languages). In particular, objects have something called fields, that are like additional variables attached to that object. For instance: Every numpy.array type has a field that shows its dimensions. For instance, how many elements do X and Y have?
print X.shape, Y.shape
(40,) (40,)
This shows us that X and Y have only one dimension (i.e., are vectors) and each have 40 elements.
Objects also have methods: functions that perform some task on the object itself. Since they are functions, you must call it using parenthesis to pass their arguments. For instance: Each numpy.array has a method called mean(), which naturally evaluates the average value of the elements in this vector. Let's give it a try!
print X.mean()
print Y.mean()
9.75 0.0981738159879
It is easy to check that these are the actual average values of the elements of each of our vectors.
Evidently, we usually store values on a vectors or matrices because we want to plot them. pylab library can be used to quickly construct beautiful plots:
pl.plot(X, Y)
pl.title('This is a title')
pl.xlabel('X vector')
pl.ylabel('Y vector')
pl.show()
Let's now style this plot a bit:
pl.plot(X, Y, 'rs--')
pl.show()
The trick is in the 'rs--':
You can check here more examples of how to style your plots: http://matplotlib.org/examples/pylab_examples/line_styles.html
Let's calculate some propertis of Y.
print 'Average value: ', Y.mean()
print 'Variance: ', Y.var()
print 'Stand. deviation: ', sqrt(Y.var())
print 'Histogram:'
pl.hist(Y, 50, normed=1, histtype='stepfilled');
pl.show()
Average value: 0.0981738159879 Variance: 0.586172226151 Stand. deviation: 0.765618851748 Histogram:
Of course, the above histogram doesn't look good. This is simply due the poor sampling, however the method to construct histograms is that simple: a function from pylab library. In the next sections, we will use a known dataset and properly illustrate it again.
You can use \t to add tabs and line everything up. You can also use \n as linebreak.
print 'Average value: \t\t', Y.mean()
print 'Variance: \t\t', Y.var()
print 'Stand. deviation: \t', sqrt(Y.var())
Average value: 0.0981738159879 Variance: 0.586172226151 Stand. deviation: 0.765618851748
print Y[:2]
print Y[:5]
print Y[:8]
[-0.58864086 0.49596875] [-0.58864086 0.49596875 -0.47822746 1.09337724 -0.10421426] [-0.58864086 0.49596875 -0.47822746 1.09337724 -0.10421426 -0.40074705 0.25858911 0.49787258]
for y in Y[:5]:
print 'Squared: ', y**2
Squared: 0.346498065971 Squared: 0.24598499844 Squared: 0.228701501748 Squared: 1.19547378855 Squared: 0.0108606109889
We next introduce how you define a function. Suppose you want to define a Hill Function, which is readily
$$F_{k,n}(x) = \frac{1}{1+\dfrac{k^n}{x^n}},$$where $k$ and $n$ are constants you can adjust to fit your data. In our specific case, $k$ is the concentration of ligand to which half of the binding sites are occupied.
Functions are "rules" that take something as input, and return something (else) as output. See below a very simple function defined:
def f(x):
a = x + 1
return a
print f(1), f(2), f(3)
2 3 4
In our example above, x is our input, and the function returns x + 1. We call x the argument of our function f. Without x Python cannot evaluate f.
Next, let's implement the Hill function.
def Hill(X, K = 1.0, n = 2.0):
return 1./(1. + (K/X)**(n))
As you can see, we have three arguments:
Notice that K and n are defined differently. Whenever you define an argument this way, you are telling Python that the user does not need to explicitly define it, and in this case those values should be used instead. Thus, we could call simply
Hill(1.5)
0.6923076923076923
The above command means we are evaluating our Hill function with X = 1.5, k = 1.0 and n = 2.0.
It is not possible to omit X instead, since X does not have an standard value defined in the function definition. Thus, you should only implicitly define arguments that are not utterly necessary to the function (i.e., what we usually call just parameters).
One extreme feature of numpy package is that we can actually perform calculations on several values at once. In the code below, we will provide Hill() with a vector X from numpy, and the function will be evaluated separately in each point. This could be especially useful in some occasions.
X = np.arange(0.,5.1,1.0)
print 'X: ', X
HX = Hill(X)
print '\nResult of Hill: ', HX
X: [ 0. 1. 2. 3. 4. 5.] Result of Hill: [ 0. 0.5 0.8 0.9 0.94117647 0.96153846]
You can try yourself and evaluate Hill() in some of the points of X and check if that is correct.
Notice that we have used np.arange this time. This is the numpy's equivalent of range we have studied before!
And plot it...
pl.plot(X, HX)
[<matplotlib.lines.Line2D at 0x7f6762894350>]
To improve the quality of the curve (make it smoother), we can increase the number of points in X. Nevertheless, we don't want to change the interval. Therefore, we will only change the step from 1.0 to 0.1, ending up with ten times more points.
X = np.arange(0.,5.,0.1)
pl.plot(X, Hill(X))
[<matplotlib.lines.Line2D at 0x7f6762888590>]
You may experience an error: division by zero (0). That happens because in X we have a zero element, and it will try to evaluate the Hill function in zero. To avoid this, we will slightly change our function:
def Hill_complete(X, K = 1.0, n = 2.0):
XX = X
XX[ X == 0 ] = 1e-20
return 1./(1. + (K/XX)**(n))
As you can see, whenever the argument is zero, we switched it by a very small real number. Since the function is continuous, it will "look nice" near zero.
X = np.arange(0.,5.,0.1)
pl.plot(X, Hill_complete(X, n=4.))
[<matplotlib.lines.Line2D at 0x7f676271ec50>]
What exactly represent $K$ and $n$ parameters? Let's use our function to briefly investigate these two parameters and learn how to manipulate functions in Python.
Let's start with $n$ below.
# We start re-defining our x-axis
X = np.arange(0.05,5.,0.01)
for n in [0.25, 1.0, 3.0, 15.]:
pl.plot(X, Hill_complete(X, n=n))
pl.ylim(-0.1,1.1)
(-0.1, 1.1)
As you can see, as $n$ grows the the growth part of the curve becomes steeper and steeper. Depending on the context, this parameter is linked to the cooperativity among the parts being modeled. When $n\to\infty$, we will have a Heaviside step function, as shown below (with $n = 10^7$).
pl.plot(X, Hill_complete(X, n=1000000.))
pl.ylim(-0.1,1.1)
(-0.1, 1.1)
Let's try changing $K$. In this context, $K$ means the concentration of the ligand where half of the binding sites are occupied.
for k in [0.25, 0.5, 1.0, 2.0, 3.0, 4.0]:
pl.plot(X, Hill_complete(X, n=4., K=k))
pl.ylim(-0.1,1.1)
(-0.1, 1.1)
To understand it visually, we can show with a dashed line where $K$ is in the graph.
for k in [0.25, 0.5, 1.0, 2.0, 3.0, 4.0]:
pl.plot(X, Hill_complete(X, n=4., K=k))
pl.plot([k,k], [-0.2, 1.2], '--', color=(0.5, 0.5, 0.5))
pl.ylim(-0.1,1.1)
(-0.1, 1.1)
Hill function is broadly used to represent gene expression when the promoter positively impacts the production of mRNA. However, there are promoters that inhibit the gene transcription. In this case, it is common to use the Hill Function, but inverting the variable: $$F_{k,n}(x) = \frac{1}{1+\dfrac{x^n}{k^n}}.$$ To get more experience with Python programming, let's redefine our function to quickly inspect this new function.
def repressorHill(X, K = 1.0, n = 2.0):
XX = X
XX[ X == 0 ] = 1e-20
return 1./(1. + (XX/K)**(n))
X = np.arange(0.05,5.,0.01)
pl.plot(X, repressorHill(X, n=2.))
pl.ylim(-0.1,1.1)
(-0.1, 1.1)
As you can see, this new function decreases with its argument smoothly. With $n\to \infty$, it becomes another step function, but with oposite behavior.
for n in [0.25, 1.0, 3.0, 15.]:
pl.plot(X, repressorHill(X, n=n))
pl.ylim(-0.1,1.1)
(-0.1, 1.1)
pl.plot(X, repressorHill(X, n=1000000.))
pl.ylim(-0.1,1.1)
(-0.1, 1.1)
As for the $K$ parameter, the same holds.
for k in [0.25, 0.5, 1.0, 2.0, 3.0, 4.0]:
pl.plot(X, repressorHill(X, n=4., K=k))
pl.plot([k,k], [-0.2, 1.2], '--', color=(0.5, 0.5, 0.5))
pl.ylim(-0.1,1.1)
(-0.1, 1.1)
Next, we will create a DataFrame, which will store all data we want to work on for us. Since we often have large datasets, we don't want to print all dataset in the screen, but have a feeling of what's in a particular DataFrame. For this purpose, you can use the head() method as below. Then, python will only print the first 5 rows of your DataFrame.
df = pd.DataFrame({
'K = 0.5' : Hill_complete(X, n=4., K=0.5),
'K = 1.0' : Hill_complete(X, n=4., K=1.0),
'K = 1.5' : Hill_complete(X, n=4., K=1.5),
'K = 2.0' : Hill_complete(X, n=4., K=2.5)
})
df.head()
K = 0.5 | K = 1.0 | K = 1.5 | K = 2.0 | |
---|---|---|---|---|
0 | 0.000100 | 0.000006 | 0.000001 | 1.600000e-07 |
1 | 0.000207 | 0.000013 | 0.000003 | 3.317759e-07 |
2 | 0.000384 | 0.000024 | 0.000005 | 6.146556e-07 |
3 | 0.000655 | 0.000041 | 0.000008 | 1.048575e-06 |
4 | 0.001049 | 0.000066 | 0.000013 | 1.679613e-06 |
We can easily plot
df.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f6762393090>
We see here the average value of each curve:
df.mean()
K = 0.5 0.896931 K = 1.0 0.785243 K = 1.5 0.675235 K = 2.0 0.468682 dtype: float64
We now evaluate the variance of each curve.
df.var()
K = 0.5 0.064562 K = 1.0 0.113308 K = 1.5 0.138143 K = 2.0 0.129073 dtype: float64
Exporting to CSV (excell compatiable).
df.to_csv('Test.csv')
You can now check if a Test.csv file was saved. If you have LibreOffice Calc (or its equivalent MS Excel) you'll be able to see our data in the form of a excel. This could be a powerful way to export your data and share it with collaborators.
Although Pandas has a lot more to offer, especially in terms of data analysis, let's stop here. Try and download this notebook, play with it and make sure these first concepts are well understood first.