Notebook

Homework 7¶

The cell below contains the functions necessary to parse a vcf file. The only function that needs to be called in order to return the parsed data file is parse_data_file. The input to this function is a vcf file handle (i.e., an open vcf file), and the output to parse_data_file will be a list of lists, for example:

[[('10', '89623323', 'G', 'A', [('HG00096',[0,0]), ('HG00097', [0,0]), ('HG00099', [0,0]), ('HG00100', [0,0]), ('HG00101', [0,0])])], [('10', '89626323', 'T', 'A', [('HG00096',[0,0]), ('HG00097', [0,0]), ('HG00099', [0,0]), ('HG00100', [0,0]), ('HG00101', [0,0])])]]

The first entry in the outer list will be details for the first snp in the vcf file, the second entry in the outer list will be details on the second snp in the file, and so on.

In each inner list, the entries are as follows:

the chromosome number
the location on the chromosome
the base found at this position in the first allele,
the base found at this position in the second allele,
a list of lists, where each netry contains an individual id, followed by their genotype.

I strongly recommend assigning the output of this function to a variable name such as snps. You can then access the items in the snps. For example, ``snp[0][4][0][0] == 'HG00096'`.

IMPORTANT: Do not edit anything in the next cell. You will call the parse_data_file function from within your code.

In [1]:

from random import choice, sample, randint, randrange
from __future__ import division

def process_data_entry_line (line, ids):
    '''for a line in the vcf file this will return a list of the information 
       in that line, see the return statement'''
    fields = line.split('\t') 
    chr_n = int(fields[0])
    pos_n = int(fields[1])
    ref = fields[3]
    alt = fields[4]
    indiv_ids = zip(ids, indiv_snp_variation(line))
    return (chr_n, pos_n, ref, alt, indiv_ids)


#Creates a list of allelic variation for each SNP. The list itself contains one entry of two numbers for each individual. 
def indiv_snp_variation(line):
    '''This function takes a single line from a vcf file and returns a list of the 
       genotypes of the given snp of each individual''' 
    l = line.split('\t')
    list = l[9:]
    variation_list = [] 
    for i in list: 
        variation_list.append(map(int,i.split(':')[0].split('|')))
    return variation_list

def parse_data_file(f):
    '''This function takes a filepath for a vcf file and returns a list of each snp with the 
       genotype of each individual in a sublist'''
    result = []
    for line in f:
        if line.startswith('##'):
            pass
        elif line.startswith('#CHROM'): 
            ids = line.rstrip().split('\t')[9:]
        else:
            result.append(process_data_entry_line(line, ids))
    return result

Example usage of parse_data_file

snps = parse_data_file(open('f.vcf','U'))

snps

[('10', '89626323', 'T', 'A', [('HG00096',[0,0]), ('HG00097', [0,0]), ('HG00099', [0,0]), ('HG00100', [0,0]), ('HG00101', [0,0])])]

Question 4¶

You will define a function that will then be used to answer questions 5 and 6. Do not change the inputs and output of this function - add your code in place of #do some work.

Hint: you may want to write a for loop that calls get_gene_stats for a list of all of the gene regions you are interested in so you only have call it once.

In [2]:

def get_gene_stats(vcf, gene_coordinates):
    '''This function should take a vcf file and a list of gene coordinates i.e. [(89653782, 89653798), ...]. The function should return 
     the total number of snps in the given region, and the snp frequency (snps/base).'''
    #do some work
    return total_snps, snp_freqs

Example input and output for question 4.

After you've written your function, you should test with the following to confirm that it is working as expected.

print get_gene_stats(open('f.vcf','U'), (89624636, 89655450))

(352, 0.0114)

Question 7¶

IMPORTANT: Do not edit anything in the next cell. You will call these functions from within your code.

To answer this question you will need to call fraction_better_or_equivalent_hwe on the SNP of interest. See below for example usage.

In [6]:

def create_new_population(pop_size, genotypes):
    '''create a list of all alleles. The list is the size of the population'''
    result = []
    for i in range(pop_size): 
        genotype = choice(genotypes)
        result.append(genotype)
    return result
   
def create_new_generation(population, pop_size, genotypes):
    '''create a new generation. This will be the same size as the original 
       population'''
    males = sample(population, (int(len(population)/2)))
    result = []
    for male in males:
        population.remove(male)
    while len(result) < pop_size:
        new_gene = (choice(choice(males)), choice(choice(population)))
        if new_gene == (genotypes[1][1], genotypes[1][0]):
            result.append((genotypes[1]))
        else:
            result.append((new_gene))
    return result

def count_generations(pop_size, alleles):
    '''count the number of generations until one allele is fixed'''
    genotypes = [(alleles[0], alleles[0]), (alleles[0], alleles[1]), (alleles[1], alleles[1])]
    population = create_new_population(pop_size, genotypes)
    c = 0
    while genotypes[1] in population: 
        pop1 = create_new_generation(population, pop_size, genotypes)
        population = pop1
#        print pop1
        c +=1
    return c
    
def count_allele_frequency(line):
    '''This counts the number of homozygotes and heterozygotes for a given snp'''
    hetero = 0
    ref = 0
    alt = 0
    for i in line[4]:
        if i[1] == [0,1] or i[1] == [1, 0]:
             hetero += 1
        elif i[1] == [0, 0]:
            ref += 1
        elif i[1] == [1, 1]:
            alt += 1
    return ref, alt, hetero

def calculate_hwe(pop_size, alleles, generations):
    '''This takes a given population size and the number of generations and returns 
       the predicted number of heterozygotes divided by the actual number present in the last 
       generation. A population in complete equilibrium would return a value of 1'''
    genotypes = [(alleles[0], alleles[0]), (alleles[0], alleles[1]), (alleles[1], alleles[1])]
    population = create_new_population(pop_size, genotypes)
    for i in range(generations):
        new_pop = create_new_generation(population, pop_size, genotypes)
        population = new_pop
    p = (2*population.count(genotypes[0]) + (population.count(genotypes[1])))/(2*len(population))
    q = p - 1
    pq2 = ((population.count(genotypes[1])))/(len(population))
    hetero = (2*p*q)
    return pq2/hetero

def fraction_better_or_equivalent_hwe(vcf, alleles, replications, generations, snp=None):
    '''This file compares the ratio of predicted to actual heterozygotes in a vcf file for
       a given snp. It then compares this value to an artificial distribution based on a 
       number of iterations of the calculate_hwe function. The return is a p-value for a 
       null hypothesis that the snp in question is in hardy weinberg equilibrium.'''
    snps = parse_data_file(vcf)
    pop_len = len(snps[0][4])
    result = []
    total_snps = []
#     If a snp is identified the function will return only the p-value for that snp 
#     otherwise it will return a p-value for every snp
    for i in snps:
        if snp == None:
            total_snps.append(i)
        elif snp == int(i[1]):
            total_snps.append(i)
        else: 
            pass
    for i in total_snps:
        snp = i[1]
        ref, alt, hetero = count_allele_frequency(i)
        p = ((2*ref) + hetero)/(2*pop_len)
        q = 1 - p
        pq2 = hetero/pop_len
        predicted_2pq = (2*p*q)
        if predicted_2pq == 0:
            pass
        else: 
            ratio_actual = pq2/predicted_2pq
            count = 0
            for i in range(replications):
                if (1 - abs(ratio_actual)) >= (1-abs(calculate_hwe(pop_len, alleles, generations))):
                    count += 1 
            if count == 0:
                result.append((snp, 1))
            else:
                result.append((snp, round((1 -(count/replications)), 10)))
    else:
        pass
    return result

Example usage for question 7

print fraction_better_or_equivalent_hwe(f.vcf, ('A', 'B'), 1000, 3, 89627667)

[(89627667, 0.534)]

To answer question 7, 1000 replications and 2 generations should be used

Hints:

p + q = 1

P2 + 2pq + q = 1

Think about how the estimated fraction is being created. Then think about what that is being compared to. Look at the code to help understand this.

Credits¶

This assignment was created by John Chase (chasejohnh@gmail.com) with guidance from Greg Caporaso for BIO/CS 299 Introduction to Bioinformatics at Northern Arizona University. As with all of our course materials, this is licensed under Creative Commons Attribution Only.