Example of reproducible analysis pipeline¶

We are going to reproduce a couple of figures (2A and 8A) of the following paper:

In [1]:

from IPython.display import IFrame
url = "http://www.ncbi.nlm.nih.gov/pubmed/22219510"
IFrame(url, 800, 400)

Out[1]:

I select this paper because:

The article is open access
The reference organism is yeast (12Mb)
The data are freely available
I know it very well...

Required software:¶

SRA: The Sequence Read Archive (SRA) stores raw sequencing data from the next generation of sequencing platforms including Roche 454 GS System®, Illumina Genome Analyzer®, Applied Biosystems SOLiD® System, Helicos Heliscope®, Complete Genomics®, and Pacific Biosciences SMRT®. In the analysis version: 2.3.5-2 was used.
Bowtie: Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end). In the analysis version 1.0.1 was used.

Required Python libraries:¶

This is extracted from the command pip freeze. On the right of each library there is the version I have been using in the analysis:

ipython==2.0.0
numpy==1.8.1
matplotlib==1.3.1
pandas==0.13.1
Cython==0.20.1
weblogo==3.3
patsy==0.2.1
scipy==0.13.3

Import modules¶

In [2]:

%pylab inline
import pandas as pd
import tarfile
import urllib
import os
import weblogolib
import random

Populating the interactive namespace from numpy and matplotlib

Define the path of the required software¶

In [20]:

SRA = "bin/fastq-dump"
BOWTIE = "bin"

PIPELINE¶

Download the data¶

In [4]:

# Download one of the set of raw reads of the paper
sra_url = "ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/" + \
            "SRP%2FSRP009%2FSRP009462/SRR384975/SRR384975.sra"
sra_file = "reads/SRR384975.sra"
# Download the yeast genomic sequences. The release used in this analysis is: R64-1-1
yeast_url = "http://downloads.yeastgenome.org/sequence/S288C_reference/" + \
            "genome_releases/S288C_reference_genome_Current_Release.tgz"
yeast_file = "data/yeast_genome.tgz"
genome_file = "data/yeast_genome.fa"
# Download the yeast genomic annotations
features_url = "http://downloads.yeastgenome.org/curation/chromosomal_feature/SGD_features.tab"
features_file = "data/yeast_features.tsv"

In [5]:

# Create the directories "reads" and "data"
folders = ["data", "reads", "index", "sequences", "figures"]
[os.mkdir(folder) for folder in folders if not os.path.isdir(folder)]

Out[5]:

[None, None, None, None, None]

In [6]:

%%timeit -n 1 -r 1
# Download the files and save it into the local directory
urllib.urlretrieve(sra_url, sra_file)
urllib.urlretrieve(yeast_url, yeast_file)
urllib.urlretrieve(features_url, features_file)

1 loops, best of 1: 13min 50s per loop

In [7]:

!ls

analysis.ipynb data           index          sequences
bin            figures        reads

Extract the SRA file¶

In [8]:

!time $SRA $sra_file --outdir reads

Read 7990112 spots for reads/SRR384975.sra
Written 7990112 spots for reads/SRR384975.sra

real	0m53.784s
user	0m51.446s
sys	0m1.265s

Extract the sequence of the yeast genome¶

In [9]:

tar = tarfile.open(yeast_file)
tar.getnames()

Out[9]:

['S288C_reference_genome_R64-1-1_20110203',
 'S288C_reference_genome_R64-1-1_20110203/S288C_reference_sequence_R64-1-1_20110203.fsa',
 'S288C_reference_genome_R64-1-1_20110203/gene_association_R64-1-1_20110205.sgd',
 'S288C_reference_genome_R64-1-1_20110203/saccharomyces_cerevisiae_R64-1-1_20110208.gff',
 'S288C_reference_genome_R64-1-1_20110203/other_features_genomic_R64-1-1_20110203.fasta',
 'S288C_reference_genome_R64-1-1_20110203/rna_coding_R64-1-1_20110203.fasta',
 'S288C_reference_genome_R64-1-1_20110203/NotFeature_R64-1-1_20110203.fasta',
 'S288C_reference_genome_R64-1-1_20110203/orf_trans_all_R64-1-1_20110203.fasta',
 'S288C_reference_genome_R64-1-1_20110203/orf_coding_all_R64-1-1_20110203.fasta']

The right file to extract is S288C_reference_sequence_R64-1-1_20110203.fsa

In [10]:

# Extract the file
tar.extract('S288C_reference_genome_R64-1-1_20110203/S288C_reference_sequence_R64-1-1_20110203.fsa',
            'data')

In [11]:

!ls data/

S288C_reference_genome_R64-1-1_20110203 yeast_genome.tgz
yeast_features.tsv

In [12]:

# Rename the reference_sequence file
!mv data/S288C_reference_genome_R64-1-1_20110203/S288C_reference_sequence_R64-1-1_20110203.fsa $genome_file
!rm -r data/S288C_reference_genome_R64-1-1_20110203/
!rm $yeast_file

In [13]:

!ls data/

yeast_features.tsv yeast_genome.fa

Build a bowtie index¶

In [14]:

index_name = "index/yeast_genome"
!$BOWTIE/bowtie-build -q $genome_file $index_name

Trim and align the reads to the yeast genome¶

The reads should start with the Ty1 LTR (Long Terminal Repeat), so I check for its presence. I also get rid of reads with undefined (N) nucleotides. In the alignment step, in order to align only the genomic sequence downstream the insertion site, I trim the first 10bp.

In [46]:

# Reads should start with the Ty1 LTR ("TATT"). 
# I also discard those reads that have ambiguous nucleotides.
fastq_file = os.path.splitext(sra_file)[0] + ".fastq"
trimmed_file = os.path.splitext(sra_file)[0] + ".trimmed"

LTR = "TATT"
MAX_MISMATCHES = 0

output = file(trimmed_file, 'w')

class ReadFastq(object):
    def __init__(self, line):
        self.lines = [line]
        self.output = ""
    
    def add_line(self, line):
        self.lines.append(line)        
        
    def get_output(self):
        if self.lines[1].startswith(LTR) and self.lines[1].count("N") <= MAX_MISMATCHES:
            self.output = "".join(self.lines)
            return True
        else:
            return False
        
for n, line in enumerate(file(fastq_file, 'r')):
    if n % 4 == 0:
        if n != 0 and obj.get_output():
            output.write(obj.output)
        obj = ReadFastq(line)
    else:
        obj.add_line(line)
            
output.close()

In [49]:

# I align the reads to the yeast genome with Bowtie
# I trim the first 10bp of the reads and then I require 
# unique (-m 1) and perfect (-v 0) alignments.
mapped_file = os.path.splitext(sra_file)[0] + ".mapped"
!$BOWTIE/bowtie -t $index_name -m 1 -v 0 -5 10 -p 2 -q $trimmed_file $mapped_file

Time loading forward index: 00:00:00
Time for 0-mismatch search: 00:00:29
# reads processed: 7504175
# reads with at least one reported alignment: 1187781 (15.83%)
# reads that failed to align: 5885730 (78.43%)
# reads with alignments suppressed due to -m: 430664 (5.74%)
Reported 1187781 alignments to 1 output stream(s)
Time searching: 00:00:29
Overall time: 00:00:29

Get rid of potential clonal reads¶

In [54]:

unique_file = mapped_file + ".unique"
%timeit -n 1 -r 1 !cut -f2-5 $mapped_file | sort -u > $unique_file

1 loops, best of 1: 47.5 s per loop

In [55]:

!head $unique_file

+	ref|NC_001133|	101400	GCATTATCTATCATTATAAATTCTTTTATT
+	ref|NC_001133|	11050	TAAAGTCACTGAGATATTAGAGGTTATAAA
+	ref|NC_001133|	11333	AAACTCTATGTAAACACTTATTTTATTGTG
+	ref|NC_001133|	115808	CCTCTCAGATACCACGATGCATAAGGCTCA
+	ref|NC_001133|	118483	ATATGTGTATATATACATAGGTTAGTATGT
+	ref|NC_001133|	118488	TGTATATATACATAGGTTAGTATGTATAGC
+	ref|NC_001133|	125037	AGGGTAGAAGCCATGAAAAAACTAGATACC
+	ref|NC_001133|	125957	GATCTTACCTGCTGTGCAGAATTTGAGTAT
+	ref|NC_001133|	127779	GATCTCAGTACTCGCATTCTAGCGTATGTT
+	ref|NC_001133|	1289	ATTTCTAGTTACAGTTACACAAAAAACTAT

Extract the regions upstream and downstream the Ty1 insertion sites¶

In [56]:

# Load the yeast genome in memory
dict_genome = {}
sequence = ""
for line in file(genome_file, 'r'):
    if line.startswith(">"):
        if sequence != "":
            dict_genome[head] = sequence.upper()
        head = line.split()[0][1:]
        sequence = ""
        continue
    sequence += line.strip()
if sequence != "":
    dict_genome[head] = sequence.upper()
    

In [57]:

# Define a class to read the alignment file
class ReadAlignment(object):
    def __init__(self, line):
        self.line = line.strip().split("\t")
        self.strand = self.line[0]
        self.chromosome = self.line[1]
        self.position = int(self.line[2])

In [58]:

# Define a function to extract the genomic sequences
# upstream and downstream a given position.
def reverse_complement(sequence):
    d = {"A":"T", "T":"A", "C":"G", "G":"C", "N":"N"}
    sequence = sequence[::-1]
    return "".join([d[i] for i in sequence])
    
def get_sequences(input_file, output_file, window=10):
    output = file(output_file, 'w')
    for n, line in enumerate(file(input_file, 'r')):
        obj = ReadAlignment(line)
        if obj.strand == "+":
            position = obj.position + 3
        else:
            position = obj.position - 3
        sequence = dict_genome[obj.chromosome][position - window: position + window]
        if obj.strand == "-":
            sequence = reverse_complement(sequence)
        output.write(">sequence_{}\n{}\n".format(n, sequence))
    output.close()

In [59]:

# Extract sequences of 10bp and 1Kb
get_sequences(input_file=unique_file, output_file="sequences/fasta_10bp.fa")
get_sequences(input_file=unique_file, output_file="sequences/fasta_1kb.fa", window=1000)

Seqlogos¶

First we will check the presence of the insertion site pattern. We will do a seqlogo of the flanking regions of the Ty1 insertion sites.

In [63]:

def do_seqlogo(input_fasta, output_seqlogo):
    fin = open(input_fasta)
    seqs = weblogolib.read_seq_data(fin) 
    data = weblogolib.LogoData.from_seqs(seqs)
    options = weblogolib.LogoOptions()
    options.yaxis_scale = 0.2
    options.yaxis_tic_interval  = 0.2
    format_logo = weblogolib.LogoFormat(data, options)
    fout = open(output_seqlogo, 'w') 
    weblogolib.png_formatter(data, format_logo, fout)
    fout.close()

In [64]:

# Create the seqlogos
do_seqlogo("sequences/fasta_10bp.fa", "figures/fasta_10bp.png")

To verify that the above seqlogo is meaninful, we can compare it with a seqlogo obtained with 10000 random sequences.

In [65]:

# Select 10000 random position in the yeast genome
N = 10000
genome_length = sum([len(i) for i in dict_genome.values()])
sample = random.sample(xrange(1, genome_length + 1), N)

# This function map a random number to a chromosome and position in the yeast genome
def get_coordinates(number):
    for chromosome, sequence in dict_genome.items():
        if (number - len(sequence)) > 0:
            number -= len(sequence)
        # Do not consider position that are too close to the start/end of the chromosome
        elif number < 10 or number > len(sequence) - 10:
            continue
        else:
            return (chromosome, number)
  
# Open the output file
random_file = "reads/random.mapped"
output = file(random_file, "w")

for n in sample:
    chromosome, position = get_coordinates(n) 
    output.write("+\t{}\t{}\n".format(chromosome, position))
    
# Close the output file
output.close()

In [66]:

# Extract sequences of 10bp and 1Kb
get_sequences(input_file=random_file, output_file="sequences/random_10bp.fa")

# Create the seqlogos
do_seqlogo("sequences/random_10bp.fa", "figures/random_10bp.png")

Ty1 insertions around tRNA¶

Second, we will verify that the Ty1 retrotrnsposons have a strong preference for inserting upstream tRNA genes.
First of all lets look at the structure of the features file.

In [67]:

# Create a dataframe with the yeast features
df = pd.read_csv(features_file, sep="\t", header=None)

In [68]:

df.head(3)

Out[68]:

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
0	S000028864	telomeric_repeat	NaN	TEL01L-TR	NaN	NaN	chromosome 1	NaN	1	62	1	C	NaN	2003-09-09	2003-09-09	Terminal telomeric repeats on the left arm of ...
1	S000002143	ORF	Dubious	YAL069W	NaN	NaN	chromosome 1	NaN	1	335	649	W	NaN	1996-07-31	1996-07-31	Dubious open reading frame; unlikely to encode...
2	S000031098	CDS	NaN	NaN	NaN	NaN	YAL069W	NaN	1	335	649	W	NaN	1996-07-31	1996-07-31	NaN

3 rows × 16 columns

The feature column is the second. Lets look at which features are in the file

In [69]:

np.unique(df[1])

Out[69]:

array(['ARS', 'ARS consensus sequence', 'CDEI', 'CDEII', 'CDEIII', 'CDS',
       'ORF', 'W_region', 'X_element_combinatorial_repeats',
       'X_element_core_sequence', 'X_region', "Y'_element", 'Y_region',
       'Z1_region', 'Z2_region', 'binding_site', 'centromere',
       'external_transcribed_spacer_region', 'five_prime_UTR_intron',
       'gene_cassette', 'insertion', 'internal_transcribed_spacer_region',
       'intron', 'long_terminal_repeat', 'mating_locus', 'multigene locus',
       'ncRNA', 'non_transcribed_region', 'noncoding_exon',
       'not in systematic sequence of S288C', 'not physically mapped',
       'plus_1_translational_frameshift', 'pseudogene', 'rRNA',
       'repeat_region', 'retrotransposon', 'snRNA', 'snoRNA', 'tRNA',
       'telomere', 'telomeric_repeat', 'transposable_element_gene'], 
      dtype='|S35')

We will extract the the rows that contain the tRNA feature

In [70]:

df_trna = df[df[1] == "tRNA"]
trna = pd.DataFrame({"Chromosome": df_trna[8], "Start": df_trna[9], 
                     "End": df_trna[10], "Strand": df_trna[11]})

The chromosomes in the features file are named from 1 to 17 but in the mapped file have different names. Here we create a dictionary that convert one format into the other.

In [71]:

dict_chromosomes = {"NC_001133": "1", "NC_001134": "2", "NC_001135": "3", 
                    "NC_001136": "4", "NC_001137": "5", "NC_001138": "6", 
                    "NC_001139": "7", "NC_001140": "8", "NC_001141": "9", 
                    "NC_001142": "10", "NC_001143": "11", "NC_001144": "12", 
                    "NC_001145": "13", "NC_001146": "14", "NC_001147": "15", 
                    "NC_001148": "16", "NC_001224": "17"}

Lets look at the first lines of the dataframe:

In [72]:

trna.head()

Out[72]:

	Chromosome	End	Start	Strand
176	1	139254	139152	W
222	1	166339	166267	W
239	1	181254	181141	W
243	1	182522	182603	C
376	10	59100	59172	C

5 rows × 4 columns

The strand column uses W (Whatson) and C (Crick) to indicate the positive and the negative strands. Lets use + and - instead.

In [73]:

trna.Strand = trna.Strand.apply(lambda x: "+" if x == "W" else "-")

In [74]:

# Look at the first lines of the dataframe
trna.head()

Out[74]:

	Chromosome	End	Start	Strand
176	1	139254	139152	+
222	1	166339	166267	+
239	1	181254	181141	+
243	1	182522	182603	-
376	10	59100	59172	-

5 rows × 4 columns

Now we create a dataframe with the Ty1 insertion positions.

In [75]:

insertions = []
for line in file(unique_file, 'r'):
    obj = ReadAlignment(line)
    chromosome = dict_chromosomes[obj.chromosome.split("|")[1]]
    insertions.append([chromosome, obj.position, obj.strand])
    
# We create a dataframew ith the Ty1 insertion positions.
insertions = pd.DataFrame(insertions, columns=["Chromosome", "Position", "Strand"])

In [76]:

# Look at the first lines of the dataframe
insertions.head()

Out[76]:

	Chromosome	Position	Strand
0	1	101400	+
1	1	11050	+
2	1	11333	+
3	1	115808	+
4	1	118483	+

5 rows × 3 columns

We calculate the distances of the insertions 1Kb upstream and downstream tRNA genes.

In [77]:

def calculate_distances(feature, insertions, window=1000):
    
    feature_grouped = feature.groupby(["Chromosome"])
    insertions_grouped = insertions.groupby(["Chromosome"])
    insertions_chromosomes = np.unique(insertions.Chromosome)

    distances_from_feature = []
    for chromosome, feature_group in feature_grouped:
        if chromosome not in insertions_chromosomes:
            continue
        insertions_group = insertions_grouped.get_group(chromosome)
        for n, line in feature_group.iterrows():
            start, end, strand = line[1:]
            insertions_around = insertions_group[(insertions_group.Position >= start - window) &
                                                 (insertions_group.Position <= start + window)]
            if len(insertions_around) == 0:
                continue
            # Calculate the distance of the insertions from the feature
            distances = insertions_around.Position.apply(lambda x: x - start)
            # Correct for the strand of the feature
            distances = distances.apply(lambda x: x * -1 if strand == "-" else x)
            distances_from_feature.extend(distances)
    return distances_from_feature

In [78]:

distances = calculate_distances(trna, insertions)

Represent with a histogram the distribution of the Ty1 insertions aroung tRNA TSS

In [79]:

figure = plt.figure(figsize=(10, 10))

ax = plt.subplot2grid((1, 1), (0, 0))
n, bins, patches = ax.hist(distances, bins=100, normed=1, alpha=0.6)
ax.set_xlabel("Distance from tRNA TSS", fontsize=14)
ax.plot([0, 0], [0, ax.axis()[3]], 'k--', lw=2, alpha=0.5)

# Fit polynomial of orders 20 using the Numpy "polyfit" function.
xnew = np.linspace(min(bins), max(bins), 100)
# 30th order polynomial fit
fit = np.polyfit(xnew, n, 30)
# Create a functions from the fit
f = np.poly1d(fit)
plt.plot(xnew, f(xnew),'-', color="red", linewidth=3)
plt.ylim(0, ax.get_ylim()[1])

plt.show()

We got a peak of insertions just upstream the tRNA TSS, as expected. This is the position where the nucleosome is located.
Also notice that there are other two smaller peaks that overlap with the positions of other nucleosomes.

Now we can check if this happens with every TSS. So first we extract the rows of the features file that contain the ORF feature and are not marked as Dubious.

In [80]:

df_orf = df[(df[1] == "ORF") & (df[2] != "Dubious")]
orf = pd.DataFrame({"Chromosome": df_orf[8], "Start": df_orf[9], 
                    "End": df_orf[10], "Strand": df_orf[11]})
orf.Strand = orf.Strand.apply(lambda x: "+" if x == "W" else "-")

In [81]:

# Look at the first lines of the dataframe
orf.head()

Out[81]:

	Chromosome	End	Start	Strand
11	1	1807	2169	-
13	1	2707	2480	+
16	1	7235	9016	-
20	1	11565	11951	-
22	1	12426	12046	+

5 rows × 4 columns

Calculate the distances of the insertions from the ORF TSS

In [82]:

distances = calculate_distances(orf, insertions)

Represent with a histogram the distribution of the Ty1 insertions aroung ORF TSS.
In this case we use the module xkcd. This library is useful when you want to show an expected result vs the observed result.

In [83]:

with plt.xkcd():
    figure = plt.figure(figsize=(10, 10))

    ax = plt.subplot2grid((1, 1), (0, 0))
    n, bins, patches = ax.hist(distances, bins=100, normed=1, alpha=0.5)
    ax.set_xlabel("Distance from ORF TSS", fontsize=14)
    ax.plot([0, 0], [0, ax.axis()[3]], 'k--', lw=2, alpha=0.5)
    
    # Fit polynomial of orders 20 using the Numpy "polyfit" function.
    xnew = np.linspace(min(bins), max(bins), 100)
    # 30th order polynomial fit
    fit = np.polyfit(xnew, n, 30)
    # Create a functions from the fit
    f = np.poly1d(fit)
    plt.plot(xnew, f(xnew),'-', color="red", linewidth=3)
    plt.ylim(0, ax.get_ylim()[1])

    plt.show()