#!/usr/bin/env python
# coding: utf-8

# ## [Python Programming for Biologists, Tel-Aviv University / 0411-3122 / Spring 2015](http://py4life.github.io/TAU2015/)
# # Homework 6

# In[31]:


get_ipython().run_line_magic('matplotlib', 'inline')
import numpy as np
import matplotlib.pyplot as plt


# # 1) Time-series analysis
# 
# In this exercise we will do a simple time series analysis. 
# The data is taken from an experiment that measures the growth of bacteria (_E. coli_) in a [96 wells microplate](http://upload.wikimedia.org/wikipedia/commons/0/07/Microplate_with_reagents.jpg). The growth is measured in OD (optical density) over time in seconds.
# 
# The data file is in CSV format (comma separated values). The first row in the file is the time of measurements. The next 96 rows are the OD values in each well at each time points.

# **a)** Start by loading the data using the `loadtxt` function in `numpy`. Note that in order to load a CSV file you must give it the proper `delimiter` argument (see [lecture 7](http://nbviewer.ipython.org/github/Py4Life/TAU2015/blob/master/lecture7.ipynb)).
# After you load the data, put the first row of the data in a variable called `t` (for time) and the rest of the rows in a variable called `OD`. Make sure (`assert`) that `OD` has 96 rows and that the number of columns in `OD` is equal to the length of `t`.

# In[2]:


data = np.loadtxt('bacterial_growth.csv', delimiter=',')
print(data.shape)
print(data[0,:5])
print (data[1:5,:5])


# In[3]:


t = data[0,:]
OD = data[1:,:]
assert OD.shape[0] == 96
assert len(t) == OD.shape[1]


# **b)** Plot all the growth cruves - one per well, or per row in the data. Matplotlib will assign each line you plot with a different color. Note that Matplotlib expects the length of `x` and `y` to be equal, but the length of `OD` is 96. To fix this you can [_transpose_](http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.T.html) `OD`.
# Don't forget to label the x and y axes.

# In[4]:


plt.plot(t, OD.T)
plt.xlabel('time')
plt.ylabel('OD');


# **c)** Now we want to present an aggregated version of the previous plot. In the next plot we will plot the mean and SEM (standrad error of the mean) of the OD values across the wells at each time point. We will present the result as a line with errorbars, where the length of the errorbars is given by the SEM. Reminders:
# - SEM is calculated as the standard deviation devided by the square root of the number of samples.
# - To plot an errorbar plot you can use Matplotlib's [errobar function](http://matplotlib.org/1.2.1/examples/pylab_examples/errorbar_demo.html)
# - To calculate the mean and SEM have a look at [lecture 7](http://nbviewer.ipython.org/github/Py4Life/TAU2015/blob/master/lecture7.ipynb).
# - Use `assert` to make sure you get what you expect: becuase we want to do a mean over all the wells we expect the result of the mean to have the same length as `t`.

# In[5]:


mean_OD = OD.mean(axis=0)
assert len(mean_OD) == len(t)
std_OD = OD.std(axis=0, ddof=1)
assert len(std_OD) == len(t)
sem_OD = std_OD / np.sqrt(len(std_OD))
assert (sem_OD < std_OD).all()


# In[6]:


plt.errorbar(t, mean_OD, sem_OD)
plt.xlabel('time')
plt.ylabel('OD');


# **d)** Finally, we want to check thed distributions of the maximum and minimum OD values in each well (row of data). To do this, we will calculate the maximum and minimum OD over time in each well and plot two histograms, one for the maximum OD and one for the minimum OD.
# 
# - There are several ways to plot two plots in the same figure. We  recommend [`subplots`](http://matplotlib.org/examples/pylab_examples/subplots_demo.html).
# - To plot the histogram you could calculate the histogram and plot with a `bar` plot, but a better, easier way to do this is with the [`hist`](http://matplotlib.org/examples/pylab_examples/histogram_demo_extended.html) function. Make sure you use enough bins to make the plot intersting but not too much.
# - Use `assert` to make sure you aggregated on the right axis: check the length of the aggregation result against you expectation.

# In[7]:


max_OD = OD.max(axis=1)
min_OD = OD.min(axis=1)


# In[8]:


fig, ax = plt.subplots(1, 2)
ax[0].hist(max_OD, bins=75)
ax[0].set_xlabel('Max OD')
ax[0].set_ylabel('Frequency')
ax[1].hist(min_OD, bins=75)
ax[1].set_xlabel('Min OD')
ax[1].set_ylabel('Frequency');


# ## 2) Split-apply-combine
# 
# In this question we will learn how to use the very useful [split-apply-combine](http://nbviewer.ipython.org/github/ResearchComputing/Meetup-Fall-2013/blob/master/python/lecture_12_pandas_groupby.ipynb) paradigm in _Pandas_ and how to use it to create sophisticated plots with very little coding.
# 
# Start by reading this [blog post](http://bconnelly.net/2013/10/summarizing-data-in-python-with-pandas/) by Brian Connelly
# which described the data and functions we will use.

# In[6]:


get_ipython().run_line_magic('matplotlib', 'inline')
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_style('ticks')


# **0)** Start by loading the dataset from this URL: <http://bconnelly.net/wp-content/uploads/2013/10/TradeoffData.csv>.
# 
# > This data set contains the fitness of a flocculated strain of Escherichia coli relative to a non-floculated strain when grown alone in either spatially-structured (dish) or spatially-unstructured (tube) environments.
# 
# Use the `read_csv` function in `pandas` which can open csv files from the local filesystem using a filename or from a remote resource using a URL. After reading the file into a variable called `data` (in your research consider using a less generic name for the `DataFrame` variable), view it by calling the method `head` to see the first few rows in `data`.

# In[2]:


data = pd.read_csv('http://bconnelly.net/wp-content/uploads/2013/10/TradeoffData.csv')
data.head()


# **a)** First, you should group the data by the `Treatment` variable and call the `describe` method on the grouped data to see a textual summary of the `RelativeFitness` distribution in each `Treatment` (`Dish` and `Tube`).

# In[3]:


bytreatment = data.groupby('Treatment')
bytreatment.describe()


# **b)** Next, we want to plot a summary of the distribution of `RelativeFitness` in each of the `Treatment`s.
# Here we aim at getting a plot of the mean or median of the `RelativeFitness` together with some meaure of the variance in the data. This can be achieved with a _boxplot_, _violinplot_, _whiskerplot_ and a regular plot with errorbars.
# 
# So - plot **either** a boxplot, violinplot or a plot with errorbars of the data. A boxplot will show the media, quartiles and outliers; the violinplot will show the entire distribution of values; the errorbar plot will show the mean and the standard deviation.
# 
# Here are some references to get you started:
# 
# - [Violin Plots: A Box Plot-Density Trace Synergism](http://www.sci.utah.edu/~kpotter/Library/Papers/hintze:1998:VPDT/)
# - [Violin plots](http://stanford.edu/~mwaskom/software/seaborn/examples/violinplots.html)
# - [factor plots](http://stanford.edu/~mwaskom/software/seaborn/examples/grouped_boxplot.html)
# - [Visualizing distributions of data](http://stanford.edu/~mwaskom/software/seaborn/tutorial/plotting_distributions.html)
# - [boxplot with matplotlib](http://matplotlib.org/examples/pylab_examples/boxplot_demo.html)
# - [violinplot with matplitlib](http://matplotlib.org/examples/statistics/violinplot_demo.html)

# In[7]:


data.boxplot('RelativeFitness', 'Treatment')
plt.ylabel('Relative Fitness');


# In[8]:


sns.boxplot(data.RelativeFitness, data.Treatment, widths=0.2);


# In[9]:


sns.violinplot(data.RelativeFitness, data.Treatment);


# In[10]:


sns.factorplot(x='Treatment', y='RelativeFitness', hue='Treatment', data=data);


# **c)** We now want to check if the variance between `Group`s in the same `Treatment` is large and if the `Treatment` had the same effect on all `Group`s.
# 
# Do a new grouping, this time by both `Group` and `Treatment`, and print the resut of the `describe` method.

# In[11]:


bygroup_treatment = data.groupby(['Group', 'Treatment'])
bygroup_treatment.describe()


# **d)** Now use the `sns.FacetGrid` function to create a faceted plot of the distributions of `RelativeFitness`. Each facet should be similar to the plot you made in **(b)** (but you are free to choose a different plot type if you want to practive it!). Facet on either column (`col`) or row (`row`) to make a wide or long plot. 
# 
# Create two figures - in the first you facet according to `Treatment` and group by `Group`, and in the second vice-versa, facet by `Group` and group by `Treatment`. 
# 
# For clarity and bonus points, use the `hue` argument of `FacetGrid` and set it to the same variable as you facet by. 
# 
# Note that you may have to set the value of the argument `size` in `FacetGrid` to a number larger than the default 3.

# In[13]:


g = sns.FacetGrid(data, col='Treatment', hue='Treatment',  size=5)
g.map(sns.boxplot, 'RelativeFitness', 'Group')
sns.despine()


# In[14]:


g = sns.FacetGrid(data, col='Treatment', hue='Treatment', size=5)
g.map(sns.violinplot, 'RelativeFitness', 'Group')
sns.despine()


# In[15]:


sns.set_palette('Paired')
g = sns.FacetGrid(data, col='Group', col_wrap=4, hue='Group', size=3)
g.map(sns.violinplot, 'RelativeFitness', 'Treatment')
sns.despine()


# **e)** Finally, we want to save a file with the mean and standard deviation of `RelativeFitness` for each of the `Group`s and `Treatment`s. Use the `aggregate` method of the `DataFrameGroupBy` object created by `groupby` an give it the names of required functions - `np.mean` for the mean and `np.std` for the standard deviation.
# Save the result to a csv file using the `to_csv` method of the `DataFrame` object created by `aggregate`.

# In[51]:


agg = bygroup_treatment['RelativeFitness'].aggregate([np.mean, np.std])
agg.head()


# In[25]:


agg.to_csv("agg.csv")