In molecular biology and genetics, GC-content is the percentage of bases on a DNA molecule that are either G or C (from a possibility of four different ones, also including A and T).
In this question we will want to calculate the GC content of a DNA sequence in a way that will allow us to see changes in GC content along the sequence.
For this we will write a function that goes over a sequence and at each position k
in the sequence it calculates the GC content of a small overlapping subseqeunce starting at position k
. We call this subsequence a window and so we are calculating the GC content using a sliding window.
a) Write a function calc_GC_content
. The arguments are seq
, a string of the characters AGCT
representing a DNA sequence; and win_size
, an int specifiying the number of bases in the sliding window. The function will return a list
of float
s (each between 0 and 1). In this list
, the float
in index k
will represent the GC-content of a subsequence of seq
of length win_size
starting at k
.
def calc_GC_content(seq, win_size=10):
gc_values = []
for i in range(len(seq) - win_size):
sub_seq = seq[i:i+win_size]
num_GC = sub_seq.count('G') + sub_seq.count('C')
value = num_GC/win_size
gc_values.append(value)
return gc_values
dna_seq = 'ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTG'
gc_values = calc_GC_content( dna_seq )
assert gc_values[:10] == [0.5, 0.5, 0.6, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.6]
b) Now open the file lec4_files/camelus.fasta
, read it as a text file, and save the last line which has the __camelus cytochrome b_ sequence into a variable called dna_seq
. Then calculate the GC content by calling calc_GC_content
with dna_seq
and a windows size of 100.
with open("lec4_files/camelus.fasta") as f:
dna_seq = f.readlines()[-1]
gc_values = calc_GC_content(dna_seq, win_size=100)
assert gc_values[:10] == [0.37, 0.38, 0.39, 0.38, 0.39, 0.38, 0.39, 0.39, 0.39, 0.39]
c) Next, plot the GC-content of the camelus cytochrome b sequence. Don't forget axis labels!
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
plt.plot(gc_values)
plt.xlabel('DNA position')
plt.ylabel('GC content (%)');
In this question we will examin the ratio of coding and non-coding regions in different nucleotide sequences.
a) The strings in mitochondrial_acc
represent GenBank accessions of complete mitochondrial genomes. Use the list to fetch each record from GB, like we did in lecture 6, and store them in a list, as SeqRecord
objects.
from Bio import SeqIO
from Bio import Entrez
import csv
Entrez.email = 'A.N.Other@example.com'
mitochondrial_acc = ['AJ489607','U03843','Y12025','AP008824','AB080276','AB079613']
mitochondrial_records = []
for acc in mitochondrial_acc:
entrez_obj = Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", id=acc)
record = SeqIO.read(entrez_obj, 'gb')
mitochondrial_records.append(record)
assert len(mitochondrial_records) == 6
b) Write a function that receives a SeqRecord
object and returns the fraction of coding DNA in this record. Coding regions are the features of the record with type gene
. Note that there could be many separate coding regions within a record. The function should return a float
, representing the number of nucleotides belonging to coding regions. For example, if the whole sequence is 10,000 bp long, and there are 3 coding regions, each of 1,000 bp, than the function should return 0.3
.
def coding_fraction(gb_record):
coding = 0
for feat in gb_record.features:
if feat.type == 'gene':
coding += len(feat)
return coding / len(record.seq)
assert 0.899 < coding_fraction(mitochondrial_records[0]) < 0.9
for x in range(6):
print(coding_fraction(mitochondrial_records[x]))
0.8990921787709497 0.7622206703910615 0.8923999068901304 0.6632914338919925 0.2989408752327747 0.6619529795158287
c) Use the records from section a and the function from section b to compare the fraction of coding DNA between the mitochondrial sequences of different organisms.
Write the results to a CSV file (Q2b.csv
). The file should have two columns: organism
and fraction_coding
, and each row should represent the result for one organism (one record).
You can use the provided code to plot the results and check your answer.
with open('Q2b.csv', 'w', newline='') as fo:
writer = csv.writer(fo)
writer.writerow(['organism','fraction_coding'])
for rec in mitochondrial_records:
organism = rec.annotations['organism']
rec_coding_fraction = coding_fraction(rec)
writer.writerow([organism,rec_coding_fraction])
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("Q2b.csv")
df.plot(x='organism',y='fraction_coding',kind="bar", legend=False)
plt.xlabel('organism')
plt.ylabel('Coding fraction')
<matplotlib.text.Text at 0xb88cf60>
In this analysis we will compare the body temperature of animals to check if indeed there is such a thing as warm-blooded and cold-blooded animals.
For this analysis we will load the AnAge dataset that we used in lecture 7.
First, import the neccesary libraries:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats
import scipy.constants
import pandas as pd
import seaborn as sns
import urllib
import zipfile
--------------------------------------------------------------------------- ImportError Traceback (most recent call last) <ipython-input-9-29b204b28f81> in <module>() 5 import scipy.constants 6 import pandas as pd ----> 7 import seaborn as sns 8 import urllib 9 import zipfile ImportError: No module named 'seaborn'
a) Get the zip file containing the data, extract it and read the data to a DataFrame
. We are interested in the Temperature (K)
column.
urllib.request.urlretrieve('http://genomics.senescence.info/species/dataset.zip', 'anage_dataset.zip')
('anage_dataset.zip', <http.client.HTTPMessage at 0x86bd2b0>)
with zipfile.ZipFile('anage_dataset.zip') as z:
f = z.open('anage_data.txt')
data = pd.read_table(f)
assert data.shape == (4212, 31)
b) The temperatures are in Kelvin degrees and we like Celsius degrees, so use scipy.constants.K2C
to transform the temperature to Celsius and save the result in a new column:
data['Temperature (C)'] = scipy.constants.K2C(data['Temperature (K)'])
assert data.shape == (4212, 32)
c) Plot a histogram of the temperature (in Celsius). Don't forget to use meaningful bins
.
bins = np.linspace(0, 45, 45)
plt.hist(data['Temperature (C)'], bins=bins, lw=0)
plt.xlabel("Temperature (C)")
plt.ylabel("# Species");
d) Clean the data frame from rows with missing data in the temperature column You can find the rows that have missing data using np.isfinite
which returns False
for rows with infinite or NaN (missing) data.
data = data[np.isfinite(data['Temperature (C)'])]
assert data.shape == (494, 32)
e) Count how many species we have in the data frame for each animal Class. Remove from the data classes with less than 10 species (you can do this manually by specifiyng the class names or automatically using the count you calculated).
data.Class.value_counts()
Mammalia 457 Amphibia 18 Reptilia 16 Aves 3 dtype: int64
data = data[data.Class != 'Aves']
assert data.shape == (491, 32)
f) Plot a separate histogram of the temperature for each animal Class. Use a faceted figure. Don't forget axis labels, proper bins, and a reasonable figure size.
g = sns.FacetGrid(data=data, col="Class", margin_titles=True, sharex=True, sharey=False, size=4.5)
g.map(plt.hist, 'Temperature (C)', bins=bins, lw=0)
g.set_ylabels("# Species");
g) Perform a t-test to verify that the temperature of mammals is, on average, larget then the temperature of Amphibians. Print the P-value of the t-test.
mammalia = data[data.Class=='Mammalia']['Temperature (C)']
amphibia = data[data.Class=='Amphibia']['Temperature (C)']
t,p = scipy.stats.ttest_ind(mammalia, amphibia, equal_var=False)
print("P-value:", p)
P-value: 1.86679756815e-16