Exam - 6.7.2015¶

Tel-Aviv University / 0411-3122 / Spring 2015¶

General instructions¶

Exam duration: two hours (9:00-11:00)
Allowed material:
- full access to the internet.
- Personal laptops are allowed.
- Email, phones, SMS, and messaging are prohibited.
Answer all three questions, in the dedicated boxes within the notebook.
The expected outputs are included. Try to replicate them with your code.
Make sure you follow the instructions and generate the outputs exactly as described.
Exam submission: Submit your exam through Moodle, in the dedicated place.
You can change your submission as many times as you wish, within the exam time limit.
Before making your final submission, make sure all your solutions are there!

Good luck!

1) Distance between sequences¶

A common task in sequence analysis is to calculate the distance between two DNA sequences of equal lengths.
There are many ways to define this distance, but the most simple one is called the Hamming distance. This distance is defined as the number of differences between two sequences of the same length. For example, the Hamming distance between AGGTCT and AGCTAT is 2. The distance between two identical sequences is 0.

a) Write a function that receives two strings, representing DNA sequences, and returns the Hamming distance between them as an integer. Use an assertion to verify that the sequences are of the same length.

In [3]:

def hamming_distance(seq1, seq2):
    """
    Calculates the Hamming distance of two DNA sequences, given as strings.
    Returns score as an integer.
    Input sequences must have the same length!
    """
    pass   
    
    
assert hamming_distance('AGGTCT', 'AGGTCT') == 0
assert hamming_distance('AGGTCT', 'AGCTAT') == 2

A more complex way of evaluating distance between sequences is using a Cost-matrix. Such a matrix describes the cost of each difference between the sequences. For example, a change from A to G may have a cost of 1, while a change from A to T may have a cost of 3. This method is sometimes called the Sankoff distance. The file cost_matrix.txt contains such a matrix, in a tab-delimited format. Using this matrix, the Sankoff distance between AGGTCT and AGCTAT is 6.

b) Write a function that receives a path to a cost matrix file (using the format described above), parses it and stores the information in a data structure of your choice. The function will return this data structure. In this section, you are not allowed to import any modules!.

In [11]:

def read_cost_matrix(filename):
    """
    Parses a cost matrix file.
    """
    pass   
    
    
mat = read_cost_matrix('cost_matrix.txt')
assert mat != None

c) Write a function that receives two strings, representing DNA sequences, and a cost matrix (formated as defined by the function in section b), and returns the Sankoff distance between the sequences based on the given matrix as an integer. Use an assertion to verify that the sequences are of the same length.

In [12]:

def sankoff_distance(seq1, seq2, cost_mat):
    """
    Calculates the Sankoff distance of two DNA sequences, given as strings,
    based on a given cost matrix.
    Returns score as an integer.
    Input sequences must have the same length!
    """
    pass 
        
    
assert sankoff_distance('AGGTCT', 'AGGTCT',mat) == 0
assert sankoff_distance('AGGTCT', 'AGCTAT',mat) == 6

2) 16S Kink-turn¶

The Kink-turn (usually called K-turn) is a common structural motiff found in many bacterial 16S rRNA sequences. It introduces a very tight kink into the axis of helical RNA. The motiff occurs in sequences of the form CGRNNGANC (where R is A or G and N is any nucleotide).

a) Write a function that receives a list of GenBank accession IDs (as strings) and returns a list of SeqRecord objects, fetched from GenBank according to these accessions, like we did in lecture 6.

In [15]:

from Bio import Entrez, SeqIO
Entrez.email = 'A.N.Other@example.com'

def fetch_gb_records(gb_acc_list):
    """
    Receives a list of GB accessions as strings.
    Returns a list of the corresponding SeqRecords.
    """
    pass
    

bacteria_16s_accessions = ['EU014689','AJ578036','AF201899','NR_028978','EU118114','AM158979','AY773947','AJ697941','X81660','X83947']
bacteria_16s_records = fetch_gb_records(bacteria_16s_accessions)
assert len(bacteria_16s_records) == len(bacteria_16s_accessions)

b) Write a function that receives a list of SeqRecords and checks for each sequence if it contains a certain motiff, given as a pre-compiled regular expression. If it does, store the exact sequence of the motiff (only the motiff, not the whole sequence!) in a dict. The function returns a dict where the keys are the organism names and the values are the motiff sequences, as strings. If a sequence does not include the motiff, do not add it to the dict.
Complete the regex to scan the 16S sequences for K-turn motiffs.

In [23]:

import re

def find_motiff_in_records(rec_list,motiff_regex):
    """
    Receives a list of SeqRecords and searches them for a motiff, given as a regex.
    Returns a dictionary where the keys are the organism names and the values are the matched motiffs.
    """
    pass

kink_turn_regex = re.compile(r'CG[AG][AGCT]{2}GA[AGCT]C')
kink_turn_motiffs_dict = find_motiff_in_records(bacteria_16s_records,kink_turn_regex)

3) Relation between litter size and birth weight¶

In this question we will look for a relation between litter or clutch size (number of offspring per reproductive cycle) and the birth weight (the weight of the offpring) in the animal kingdom.

For this analysis we will load the AnAge dataset that we used in lecture 7.

First, import the neccesary libraries:

In [1]:

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy
import pandas as pd
import seaborn as sns
import urllib
import zipfile

a) Get the zip file containing the data, extract it and read the data to a DataFrame. We are interested in the Litter/Clutch size and Birth weight (g) columns.

In [11]:

Out[11]:

('anage_dataset.zip', <http.client.HTTPMessage at 0xb189d50>)

In [12]:

b) If you examined the data you might have noticed that some rows have a NaN value in our columns of interest. We need to remove these rows from the data. You can use np.isnan, np.isfinite or any other method you'd like.

In [13]:

c) Plot a scatter plot of the data to exmaine it. Use the litter size on the x-axis. Don't forget the axis labels.

In [8]:

d) We are looking for a possible linear relationship between the variables. Apply a log transformation on the data (both columns) and plot a new scatter plot of the transformed data (don't forget the axis labels should change to reflect the transformation!).

In [10]:

e) Perform linear regression on the transformed data and print the intercept and slope of the regression.

In [30]:

intercept: 5.718, slope: -2.434

f) Plot a scatterplot of data together with a line for the linear regression.

In [23]:

Out[23]:

[<matplotlib.lines.Line2D at 0x7c0a8d0>]

In [24]:

Out[24]:

(-5, 15)

g) predict the birth weight of offspring in a litter with 10 offspring (don't forget the transformation!):

In [35]:

In a litter with 10 offspring, the birth weight will be 1.11964209943 grams