#!/usr/bin/env python
# coding: utf-8

# [![Py4Life](https://raw.githubusercontent.com/Py4Life/TAU2015/gh-pages/img/Py4Life-logo-small.png)](http://py4life.github.io/TAU2015/)
# 
# ## Exam - 4.8.2015 - Solution
# 
# ### Tel-Aviv University / 0411-3122 / Spring 2015

# ## 1) GC content
# 
# In molecular biology and genetics, [GC-content](http://en.wikipedia.org/wiki/GC-content) is the percentage of bases on a DNA molecule that are either G or C (from a possibility of four different ones, also including A and T).
# 
# In this question we will want to calculate the GC content of a DNA sequence in a way that will allow us to see changes in GC content along the sequence. 
# 
# For this we will write a function that goes over a sequence and at each position `k` in the sequence it calculates the GC content of a small overlapping subseqeunce starting at position `k`. We call this subsequence a _window_ and so we are calculating the GC content using a sliding window.
# 
# **a)** Write a function `calc_GC_content`. The arguments are `seq`, a string of the characters `AGCT` representing a DNA sequence; and `win_size`, an int specifiying the number of bases in the sliding window. The function will return a `list` of `float`s (each between 0 and 1). In this `list`, the `float` in index `k` will represent the GC-content of a subsequence of `seq` of length `win_size` starting at `k`.

# In[36]:


def calc_GC_content(seq, win_size=10):
    gc_values = []
    for i in range(len(seq) - win_size):
        sub_seq = seq[i:i+win_size]
        num_GC = sub_seq.count('G') + sub_seq.count('C')
        value = num_GC/win_size
        gc_values.append(value)
    return gc_values


# In[37]:


dna_seq = 'ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTG'
gc_values = calc_GC_content( dna_seq )
assert gc_values[:10] == [0.5, 0.5, 0.6, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.6]


# **b)** Now open the file `lec4_files/camelus.fasta`, read it as a text file, and save the last line which has the __camelus cytochrome b_ sequence into a variable called `dna_seq`. Then calculate the GC content by calling `calc_GC_content` with `dna_seq` and a windows size of 100.

# In[38]:


with open("lec4_files/camelus.fasta") as f:
    dna_seq = f.readlines()[-1]
gc_values = calc_GC_content(dna_seq, win_size=100)
assert gc_values[:10] == [0.37, 0.38, 0.39, 0.38, 0.39, 0.38, 0.39, 0.39, 0.39, 0.39]


# **c)** Next, plot the GC-content of the _camelus cytochrome b_ sequence. Don't forget axis labels!

# In[39]:


get_ipython().run_line_magic('matplotlib', 'inline')
import matplotlib.pyplot as plt
import seaborn as sns


# In[40]:


plt.plot(gc_values)
plt.xlabel('DNA position')
plt.ylabel('GC content (%)');


# ## 2) Percent of coding regions
# In this question we will examin the ratio of coding and non-coding regions in different nucleotide sequences.
# 
# __a)__ The strings in `mitochondrial_acc` represent GenBank accessions of complete mitochondrial genomes. Use the list to fetch each record from GB, like we did in [lecture 6](http://nbviewer.ipython.org/github/Py4Life/TAU2015/blob/master/lecture6.ipynb), and store them in a list, as `SeqRecord` objects.

# In[1]:


from Bio import SeqIO
from Bio import Entrez
import csv
Entrez.email = 'A.N.Other@example.com'


# In[2]:


mitochondrial_acc = ['AJ489607','U03843','Y12025','AP008824','AB080276','AB079613']
mitochondrial_records = []
for acc in mitochondrial_acc:
    entrez_obj = Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", id=acc)
    record = SeqIO.read(entrez_obj, 'gb')
    mitochondrial_records.append(record)

assert len(mitochondrial_records) == 6


# __b)__ Write a function that receives a `SeqRecord` object and returns the fraction of coding DNA in this record. Coding regions are the features of the record with type `gene`. Note that there could be many separate coding regions within a record. The function should return a `float`, representing the number of nucleotides belonging to coding regions. For example, if the whole sequence is 10,000 bp long, and there are 3 coding regions, each of 1,000 bp, than the function should return `0.3`.

# In[6]:


def coding_fraction(gb_record):
    coding = 0
    for feat in gb_record.features:           
        if feat.type == 'gene':
            coding += len(feat)
    return coding / len(record.seq)

assert 0.899 < coding_fraction(mitochondrial_records[0]) < 0.9

for x in range(6):
    print(coding_fraction(mitochondrial_records[x]))


# __c)__ Use the records from section _a_ and the function from section _b_ to compare the fraction of coding DNA between the mitochondrial sequences of different organisms.
# 
# Write the results to a CSV file (`Q2b.csv`). The file should have two columns: `organism` and `fraction_coding`, and each row should represent the result for one organism (one record).  
# You can use the provided code to plot the results and check your answer.

# In[4]:


with open('Q2b.csv', 'w', newline='') as fo:
    writer = csv.writer(fo)
    writer.writerow(['organism','fraction_coding'])
    for rec in mitochondrial_records:           
        organism = rec.annotations['organism']
        rec_coding_fraction = coding_fraction(rec)
        writer.writerow([organism,rec_coding_fraction])


# In[13]:


get_ipython().run_line_magic('matplotlib', 'inline')
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("Q2b.csv")
df.plot(x='organism',y='fraction_coding',kind="bar", legend=False)
plt.xlabel('organism')
plt.ylabel('Coding fraction')


# ## 3) Body Temperature
# 
# In this analysis we will compare the body temperature of animals to check if indeed there is such a thing as [warm-blooded](http://en.wikipedia.org/wiki/Warm-blooded) and cold-blooded animals.
# 
# For this analysis we will load the [AnAge](http://genomics.senescence.info/download.html#anage) dataset that we used in [lecture 7](http://nbviewer.ipython.org/github/Py4Life/TAU2015/blob/master/lecture7.ipynb).
# 
# First, import the neccesary libraries:

# In[9]:


get_ipython().run_line_magic('matplotlib', 'inline')
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats
import scipy.constants
import pandas as pd
import seaborn as sns
import urllib
import zipfile


# **a)** Get the zip file containing the data, extract it and read the data to a `DataFrame`. We are interested in the `Temperature (K)` column.

# In[44]:


urllib.request.urlretrieve('http://genomics.senescence.info/species/dataset.zip', 'anage_dataset.zip')


# In[45]:


with zipfile.ZipFile('anage_dataset.zip') as z:
    f = z.open('anage_data.txt')
    data = pd.read_table(f)
assert data.shape == (4212, 31)


# **b)** The temperatures are in Kelvin degrees and we like Celsius degrees, so use `scipy.constants.K2C` to transform the temperature to Celsius and save the result in a new column:

# In[46]:


data['Temperature (C)'] = scipy.constants.K2C(data['Temperature (K)'])
assert data.shape == (4212, 32)


# **c)** Plot a histogram of the temperature (in Celsius). Don't forget to use meaningful `bins`.

# In[47]:


bins = np.linspace(0, 45, 45)
plt.hist(data['Temperature (C)'], bins=bins, lw=0)
plt.xlabel("Temperature (C)")
plt.ylabel("# Species");


# **d)** Clean the data frame from rows with missing data in the temperature column You can find the rows that have missing data using `np.isfinite` which returns `False` for rows with infinite or NaN (missing) data.

# In[48]:


data = data[np.isfinite(data['Temperature (C)'])]
assert data.shape == (494, 32)


# **e)** Count how many species we have in the data frame for each animal Class. Remove from the data classes with less than 10 species (you can do this manually by specifiyng the class names or automatically using the count you calculated).

# In[49]:


data.Class.value_counts()


# In[50]:


data = data[data.Class != 'Aves']
assert data.shape == (491, 32)


# **f)** Plot a separate histogram of the temperature for each animal Class. Use a [faceted figure](http://stanford.edu/~mwaskom/software/seaborn/examples/faceted_histogram.html). Don't forget axis labels, proper bins, and a reasonable figure size.

# In[51]:


g = sns.FacetGrid(data=data, col="Class", margin_titles=True, sharex=True, sharey=False, size=4.5)
g.map(plt.hist, 'Temperature (C)', bins=bins, lw=0)
g.set_ylabels("# Species");


# **g)** Perform a [t-test](http://iaingallagher.tumblr.com/post/50980987285/t-tests-in-python) to verify that the temperature of mammals is, on average, larget then the temperature of Amphibians. Print the P-value of the t-test.

# In[52]:


mammalia = data[data.Class=='Mammalia']['Temperature (C)']
amphibia = data[data.Class=='Amphibia']['Temperature (C)']
t,p = scipy.stats.ttest_ind(mammalia, amphibia, equal_var=False)
print("P-value:", p)