Exam - 4.8.2015 - Solution¶

Tel-Aviv University / 0411-3122 / Spring 2015¶

1) GC content¶

In molecular biology and genetics, GC-content is the percentage of bases on a DNA molecule that are either G or C (from a possibility of four different ones, also including A and T).

In this question we will want to calculate the GC content of a DNA sequence in a way that will allow us to see changes in GC content along the sequence.

For this we will write a function that goes over a sequence and at each position k in the sequence it calculates the GC content of a small overlapping subseqeunce starting at position k. We call this subsequence a window and so we are calculating the GC content using a sliding window.

a) Write a function calc_GC_content. The arguments are seq, a string of the characters AGCT representing a DNA sequence; and win_size, an int specifiying the number of bases in the sliding window. The function will return a list of floats (each between 0 and 1). In this list, the float in index k will represent the GC-content of a subsequence of seq of length win_size starting at k.

In [36]:

def calc_GC_content(seq, win_size=10):
    gc_values = []
    for i in range(len(seq) - win_size):
        sub_seq = seq[i:i+win_size]
        num_GC = sub_seq.count('G') + sub_seq.count('C')
        value = num_GC/win_size
        gc_values.append(value)
    return gc_values

In [37]:

dna_seq = 'ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTG'
gc_values = calc_GC_content( dna_seq )
assert gc_values[:10] == [0.5, 0.5, 0.6, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.6]

b) Now open the file lec4_files/camelus.fasta, read it as a text file, and save the last line which has the __camelus cytochrome b_ sequence into a variable called dna_seq. Then calculate the GC content by calling calc_GC_content with dna_seq and a windows size of 100.

In [38]:

with open("lec4_files/camelus.fasta") as f:
    dna_seq = f.readlines()[-1]
gc_values = calc_GC_content(dna_seq, win_size=100)
assert gc_values[:10] == [0.37, 0.38, 0.39, 0.38, 0.39, 0.38, 0.39, 0.39, 0.39, 0.39]

c) Next, plot the GC-content of the camelus cytochrome b sequence. Don't forget axis labels!

In [39]:

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

In [40]:

plt.plot(gc_values)
plt.xlabel('DNA position')
plt.ylabel('GC content (%)');

2) Percent of coding regions¶

In this question we will examin the ratio of coding and non-coding regions in different nucleotide sequences.

a) The strings in mitochondrial_acc represent GenBank accessions of complete mitochondrial genomes. Use the list to fetch each record from GB, like we did in lecture 6, and store them in a list, as SeqRecord objects.

In [1]:

from Bio import SeqIO
from Bio import Entrez
import csv
Entrez.email = 'A.N.Other@example.com'

In [2]:

mitochondrial_acc = ['AJ489607','U03843','Y12025','AP008824','AB080276','AB079613']
mitochondrial_records = []
for acc in mitochondrial_acc:
    entrez_obj = Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", id=acc)
    record = SeqIO.read(entrez_obj, 'gb')
    mitochondrial_records.append(record)

assert len(mitochondrial_records) == 6

b) Write a function that receives a SeqRecord object and returns the fraction of coding DNA in this record. Coding regions are the features of the record with type gene. Note that there could be many separate coding regions within a record. The function should return a float, representing the number of nucleotides belonging to coding regions. For example, if the whole sequence is 10,000 bp long, and there are 3 coding regions, each of 1,000 bp, than the function should return 0.3.

In [6]:

def coding_fraction(gb_record):
    coding = 0
    for feat in gb_record.features:           
        if feat.type == 'gene':
            coding += len(feat)
    return coding / len(record.seq)

assert 0.899 < coding_fraction(mitochondrial_records[0]) < 0.9

for x in range(6):
    print(coding_fraction(mitochondrial_records[x]))

0.8990921787709497
0.7622206703910615
0.8923999068901304
0.6632914338919925
0.2989408752327747
0.6619529795158287

c) Use the records from section a and the function from section b to compare the fraction of coding DNA between the mitochondrial sequences of different organisms.

Write the results to a CSV file (Q2b.csv). The file should have two columns: organism and fraction_coding, and each row should represent the result for one organism (one record).
You can use the provided code to plot the results and check your answer.

In [4]:

with open('Q2b.csv', 'w', newline='') as fo:
    writer = csv.writer(fo)
    writer.writerow(['organism','fraction_coding'])
    for rec in mitochondrial_records:           
        organism = rec.annotations['organism']
        rec_coding_fraction = coding_fraction(rec)
        writer.writerow([organism,rec_coding_fraction])

In [13]:

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("Q2b.csv")
df.plot(x='organism',y='fraction_coding',kind="bar", legend=False)
plt.xlabel('organism')
plt.ylabel('Coding fraction')

Out[13]:

<matplotlib.text.Text at 0xb88cf60>

3) Body Temperature¶

In this analysis we will compare the body temperature of animals to check if indeed there is such a thing as warm-blooded and cold-blooded animals.

For this analysis we will load the AnAge dataset that we used in lecture 7.

First, import the neccesary libraries:

In [9]:

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats
import scipy.constants
import pandas as pd
import seaborn as sns
import urllib
import zipfile

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-9-29b204b28f81> in <module>()
      5 import scipy.constants
      6 import pandas as pd
----> 7 import seaborn as sns
      8 import urllib
      9 import zipfile

ImportError: No module named 'seaborn'

a) Get the zip file containing the data, extract it and read the data to a DataFrame. We are interested in the Temperature (K) column.

In [44]:

urllib.request.urlretrieve('http://genomics.senescence.info/species/dataset.zip', 'anage_dataset.zip')

Out[44]:

('anage_dataset.zip', <http.client.HTTPMessage at 0x86bd2b0>)

In [45]:

with zipfile.ZipFile('anage_dataset.zip') as z:
    f = z.open('anage_data.txt')
    data = pd.read_table(f)
assert data.shape == (4212, 31)

b) The temperatures are in Kelvin degrees and we like Celsius degrees, so use scipy.constants.K2C to transform the temperature to Celsius and save the result in a new column:

In [46]:

data['Temperature (C)'] = scipy.constants.K2C(data['Temperature (K)'])
assert data.shape == (4212, 32)

c) Plot a histogram of the temperature (in Celsius). Don't forget to use meaningful bins.

In [47]:

bins = np.linspace(0, 45, 45)
plt.hist(data['Temperature (C)'], bins=bins, lw=0)
plt.xlabel("Temperature (C)")
plt.ylabel("# Species");

d) Clean the data frame from rows with missing data in the temperature column You can find the rows that have missing data using np.isfinite which returns False for rows with infinite or NaN (missing) data.

In [48]:

data = data[np.isfinite(data['Temperature (C)'])]
assert data.shape == (494, 32)

e) Count how many species we have in the data frame for each animal Class. Remove from the data classes with less than 10 species (you can do this manually by specifiyng the class names or automatically using the count you calculated).

In [49]:

data.Class.value_counts()

Out[49]:

Mammalia    457
Amphibia     18
Reptilia     16
Aves          3
dtype: int64

In [50]:

data = data[data.Class != 'Aves']
assert data.shape == (491, 32)

f) Plot a separate histogram of the temperature for each animal Class. Use a faceted figure. Don't forget axis labels, proper bins, and a reasonable figure size.

In [51]:

g = sns.FacetGrid(data=data, col="Class", margin_titles=True, sharex=True, sharey=False, size=4.5)
g.map(plt.hist, 'Temperature (C)', bins=bins, lw=0)
g.set_ylabels("# Species");

g) Perform a t-test to verify that the temperature of mammals is, on average, larget then the temperature of Amphibians. Print the P-value of the t-test.

In [52]:

mammalia = data[data.Class=='Mammalia']['Temperature (C)']
amphibia = data[data.Class=='Amphibia']['Temperature (C)']
t,p = scipy.stats.ttest_ind(mammalia, amphibia, equal_var=False)
print("P-value:", p)

P-value: 1.86679756815e-16