We are going to reproduce a couple of figures (2A and 8A) of the following paper:
from IPython.display import IFrame
url = "http://www.ncbi.nlm.nih.gov/pubmed/22219510"
IFrame(url, 800, 400)
I select this paper because:
SRA: The Sequence Read Archive (SRA) stores raw sequencing data from the next generation of sequencing platforms including Roche 454 GS System®, Illumina Genome Analyzer®, Applied Biosystems SOLiD® System, Helicos Heliscope®, Complete Genomics®, and Pacific Biosciences SMRT®. In the analysis version: 2.3.5-2 was used.
Bowtie: Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end). In the analysis version 1.0.1 was used.
This is extracted from the command pip freeze
. On the right of each library there is the version I have been using in the analysis:
%pylab inline
import pandas as pd
import tarfile
import urllib
import os
import weblogolib
import random
Populating the interactive namespace from numpy and matplotlib
SRA = "bin/fastq-dump"
BOWTIE = "bin"
# Download one of the set of raw reads of the paper
sra_url = "ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/" + \
"SRP%2FSRP009%2FSRP009462/SRR384975/SRR384975.sra"
sra_file = "reads/SRR384975.sra"
# Download the yeast genomic sequences. The release used in this analysis is: R64-1-1
yeast_url = "http://downloads.yeastgenome.org/sequence/S288C_reference/" + \
"genome_releases/S288C_reference_genome_Current_Release.tgz"
yeast_file = "data/yeast_genome.tgz"
genome_file = "data/yeast_genome.fa"
# Download the yeast genomic annotations
features_url = "http://downloads.yeastgenome.org/curation/chromosomal_feature/SGD_features.tab"
features_file = "data/yeast_features.tsv"
# Create the directories "reads" and "data"
folders = ["data", "reads", "index", "sequences", "figures"]
[os.mkdir(folder) for folder in folders if not os.path.isdir(folder)]
[None, None, None, None, None]
%%timeit -n 1 -r 1
# Download the files and save it into the local directory
urllib.urlretrieve(sra_url, sra_file)
urllib.urlretrieve(yeast_url, yeast_file)
urllib.urlretrieve(features_url, features_file)
1 loops, best of 1: 13min 50s per loop
!ls
analysis.ipynb data index sequences bin figures reads
!time $SRA $sra_file --outdir reads
Read 7990112 spots for reads/SRR384975.sra Written 7990112 spots for reads/SRR384975.sra real 0m53.784s user 0m51.446s sys 0m1.265s
tar = tarfile.open(yeast_file)
tar.getnames()
['S288C_reference_genome_R64-1-1_20110203', 'S288C_reference_genome_R64-1-1_20110203/S288C_reference_sequence_R64-1-1_20110203.fsa', 'S288C_reference_genome_R64-1-1_20110203/gene_association_R64-1-1_20110205.sgd', 'S288C_reference_genome_R64-1-1_20110203/saccharomyces_cerevisiae_R64-1-1_20110208.gff', 'S288C_reference_genome_R64-1-1_20110203/other_features_genomic_R64-1-1_20110203.fasta', 'S288C_reference_genome_R64-1-1_20110203/rna_coding_R64-1-1_20110203.fasta', 'S288C_reference_genome_R64-1-1_20110203/NotFeature_R64-1-1_20110203.fasta', 'S288C_reference_genome_R64-1-1_20110203/orf_trans_all_R64-1-1_20110203.fasta', 'S288C_reference_genome_R64-1-1_20110203/orf_coding_all_R64-1-1_20110203.fasta']
The right file to extract is S288C_reference_sequence_R64-1-1_20110203.fsa
# Extract the file
tar.extract('S288C_reference_genome_R64-1-1_20110203/S288C_reference_sequence_R64-1-1_20110203.fsa',
'data')
!ls data/
S288C_reference_genome_R64-1-1_20110203 yeast_genome.tgz
yeast_features.tsv
# Rename the reference_sequence file
!mv data/S288C_reference_genome_R64-1-1_20110203/S288C_reference_sequence_R64-1-1_20110203.fsa $genome_file
!rm -r data/S288C_reference_genome_R64-1-1_20110203/
!rm $yeast_file
!ls data/
yeast_features.tsv yeast_genome.fa
index_name = "index/yeast_genome"
!$BOWTIE/bowtie-build -q $genome_file $index_name
The reads should start with the Ty1 LTR
(Long Terminal Repeat), so I check for its presence. I also get rid of reads with undefined (N) nucleotides. In the alignment step, in order to align only the genomic sequence downstream the insertion site, I trim the first 10bp.
# Reads should start with the Ty1 LTR ("TATT").
# I also discard those reads that have ambiguous nucleotides.
fastq_file = os.path.splitext(sra_file)[0] + ".fastq"
trimmed_file = os.path.splitext(sra_file)[0] + ".trimmed"
LTR = "TATT"
MAX_MISMATCHES = 0
output = file(trimmed_file, 'w')
class ReadFastq(object):
def __init__(self, line):
self.lines = [line]
self.output = ""
def add_line(self, line):
self.lines.append(line)
def get_output(self):
if self.lines[1].startswith(LTR) and self.lines[1].count("N") <= MAX_MISMATCHES:
self.output = "".join(self.lines)
return True
else:
return False
for n, line in enumerate(file(fastq_file, 'r')):
if n % 4 == 0:
if n != 0 and obj.get_output():
output.write(obj.output)
obj = ReadFastq(line)
else:
obj.add_line(line)
output.close()
# I align the reads to the yeast genome with Bowtie
# I trim the first 10bp of the reads and then I require
# unique (-m 1) and perfect (-v 0) alignments.
mapped_file = os.path.splitext(sra_file)[0] + ".mapped"
!$BOWTIE/bowtie -t $index_name -m 1 -v 0 -5 10 -p 2 -q $trimmed_file $mapped_file
Time loading forward index: 00:00:00 Time for 0-mismatch search: 00:00:29 # reads processed: 7504175 # reads with at least one reported alignment: 1187781 (15.83%) # reads that failed to align: 5885730 (78.43%) # reads with alignments suppressed due to -m: 430664 (5.74%) Reported 1187781 alignments to 1 output stream(s) Time searching: 00:00:29 Overall time: 00:00:29
unique_file = mapped_file + ".unique"
%timeit -n 1 -r 1 !cut -f2-5 $mapped_file | sort -u > $unique_file
1 loops, best of 1: 47.5 s per loop
!head $unique_file
+ ref|NC_001133| 101400 GCATTATCTATCATTATAAATTCTTTTATT + ref|NC_001133| 11050 TAAAGTCACTGAGATATTAGAGGTTATAAA + ref|NC_001133| 11333 AAACTCTATGTAAACACTTATTTTATTGTG + ref|NC_001133| 115808 CCTCTCAGATACCACGATGCATAAGGCTCA + ref|NC_001133| 118483 ATATGTGTATATATACATAGGTTAGTATGT + ref|NC_001133| 118488 TGTATATATACATAGGTTAGTATGTATAGC + ref|NC_001133| 125037 AGGGTAGAAGCCATGAAAAAACTAGATACC + ref|NC_001133| 125957 GATCTTACCTGCTGTGCAGAATTTGAGTAT + ref|NC_001133| 127779 GATCTCAGTACTCGCATTCTAGCGTATGTT + ref|NC_001133| 1289 ATTTCTAGTTACAGTTACACAAAAAACTAT
# Load the yeast genome in memory
dict_genome = {}
sequence = ""
for line in file(genome_file, 'r'):
if line.startswith(">"):
if sequence != "":
dict_genome[head] = sequence.upper()
head = line.split()[0][1:]
sequence = ""
continue
sequence += line.strip()
if sequence != "":
dict_genome[head] = sequence.upper()
# Define a class to read the alignment file
class ReadAlignment(object):
def __init__(self, line):
self.line = line.strip().split("\t")
self.strand = self.line[0]
self.chromosome = self.line[1]
self.position = int(self.line[2])
# Define a function to extract the genomic sequences
# upstream and downstream a given position.
def reverse_complement(sequence):
d = {"A":"T", "T":"A", "C":"G", "G":"C", "N":"N"}
sequence = sequence[::-1]
return "".join([d[i] for i in sequence])
def get_sequences(input_file, output_file, window=10):
output = file(output_file, 'w')
for n, line in enumerate(file(input_file, 'r')):
obj = ReadAlignment(line)
if obj.strand == "+":
position = obj.position + 3
else:
position = obj.position - 3
sequence = dict_genome[obj.chromosome][position - window: position + window]
if obj.strand == "-":
sequence = reverse_complement(sequence)
output.write(">sequence_{}\n{}\n".format(n, sequence))
output.close()
# Extract sequences of 10bp and 1Kb
get_sequences(input_file=unique_file, output_file="sequences/fasta_10bp.fa")
get_sequences(input_file=unique_file, output_file="sequences/fasta_1kb.fa", window=1000)
First we will check the presence of the insertion site pattern. We will do a seqlogo of the flanking regions of the Ty1 insertion sites.
def do_seqlogo(input_fasta, output_seqlogo):
fin = open(input_fasta)
seqs = weblogolib.read_seq_data(fin)
data = weblogolib.LogoData.from_seqs(seqs)
options = weblogolib.LogoOptions()
options.yaxis_scale = 0.2
options.yaxis_tic_interval = 0.2
format_logo = weblogolib.LogoFormat(data, options)
fout = open(output_seqlogo, 'w')
weblogolib.png_formatter(data, format_logo, fout)
fout.close()
# Create the seqlogos
do_seqlogo("sequences/fasta_10bp.fa", "figures/fasta_10bp.png")
To verify that the above seqlogo is meaninful, we can compare it with a seqlogo obtained with 10000 random sequences.
# Select 10000 random position in the yeast genome
N = 10000
genome_length = sum([len(i) for i in dict_genome.values()])
sample = random.sample(xrange(1, genome_length + 1), N)
# This function map a random number to a chromosome and position in the yeast genome
def get_coordinates(number):
for chromosome, sequence in dict_genome.items():
if (number - len(sequence)) > 0:
number -= len(sequence)
# Do not consider position that are too close to the start/end of the chromosome
elif number < 10 or number > len(sequence) - 10:
continue
else:
return (chromosome, number)
# Open the output file
random_file = "reads/random.mapped"
output = file(random_file, "w")
for n in sample:
chromosome, position = get_coordinates(n)
output.write("+\t{}\t{}\n".format(chromosome, position))
# Close the output file
output.close()
# Extract sequences of 10bp and 1Kb
get_sequences(input_file=random_file, output_file="sequences/random_10bp.fa")
# Create the seqlogos
do_seqlogo("sequences/random_10bp.fa", "figures/random_10bp.png")
Second, we will verify that the Ty1 retrotrnsposons have a strong preference for inserting upstream tRNA genes.
First of all lets look at the structure of the features file.
# Create a dataframe with the yeast features
df = pd.read_csv(features_file, sep="\t", header=None)
df.head(3)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | S000028864 | telomeric_repeat | NaN | TEL01L-TR | NaN | NaN | chromosome 1 | NaN | 1 | 62 | 1 | C | NaN | 2003-09-09 | 2003-09-09 | Terminal telomeric repeats on the left arm of ... |
1 | S000002143 | ORF | Dubious | YAL069W | NaN | NaN | chromosome 1 | NaN | 1 | 335 | 649 | W | NaN | 1996-07-31 | 1996-07-31 | Dubious open reading frame; unlikely to encode... |
2 | S000031098 | CDS | NaN | NaN | NaN | NaN | YAL069W | NaN | 1 | 335 | 649 | W | NaN | 1996-07-31 | 1996-07-31 | NaN |
3 rows × 16 columns
The feature column is the second. Lets look at which features are in the file
np.unique(df[1])
array(['ARS', 'ARS consensus sequence', 'CDEI', 'CDEII', 'CDEIII', 'CDS', 'ORF', 'W_region', 'X_element_combinatorial_repeats', 'X_element_core_sequence', 'X_region', "Y'_element", 'Y_region', 'Z1_region', 'Z2_region', 'binding_site', 'centromere', 'external_transcribed_spacer_region', 'five_prime_UTR_intron', 'gene_cassette', 'insertion', 'internal_transcribed_spacer_region', 'intron', 'long_terminal_repeat', 'mating_locus', 'multigene locus', 'ncRNA', 'non_transcribed_region', 'noncoding_exon', 'not in systematic sequence of S288C', 'not physically mapped', 'plus_1_translational_frameshift', 'pseudogene', 'rRNA', 'repeat_region', 'retrotransposon', 'snRNA', 'snoRNA', 'tRNA', 'telomere', 'telomeric_repeat', 'transposable_element_gene'], dtype='|S35')
We will extract the the rows that contain the tRNA feature
df_trna = df[df[1] == "tRNA"]
trna = pd.DataFrame({"Chromosome": df_trna[8], "Start": df_trna[9],
"End": df_trna[10], "Strand": df_trna[11]})
The chromosomes in the features file are named from 1 to 17 but in the mapped file have different names. Here we create a dictionary that convert one format into the other.
dict_chromosomes = {"NC_001133": "1", "NC_001134": "2", "NC_001135": "3",
"NC_001136": "4", "NC_001137": "5", "NC_001138": "6",
"NC_001139": "7", "NC_001140": "8", "NC_001141": "9",
"NC_001142": "10", "NC_001143": "11", "NC_001144": "12",
"NC_001145": "13", "NC_001146": "14", "NC_001147": "15",
"NC_001148": "16", "NC_001224": "17"}
Lets look at the first lines of the dataframe:
trna.head()
Chromosome | End | Start | Strand | |
---|---|---|---|---|
176 | 1 | 139254 | 139152 | W |
222 | 1 | 166339 | 166267 | W |
239 | 1 | 181254 | 181141 | W |
243 | 1 | 182522 | 182603 | C |
376 | 10 | 59100 | 59172 | C |
5 rows × 4 columns
The strand column uses W (Whatson) and C (Crick) to indicate the positive and the negative strands. Lets use + and - instead.
trna.Strand = trna.Strand.apply(lambda x: "+" if x == "W" else "-")
# Look at the first lines of the dataframe
trna.head()
Chromosome | End | Start | Strand | |
---|---|---|---|---|
176 | 1 | 139254 | 139152 | + |
222 | 1 | 166339 | 166267 | + |
239 | 1 | 181254 | 181141 | + |
243 | 1 | 182522 | 182603 | - |
376 | 10 | 59100 | 59172 | - |
5 rows × 4 columns
Now we create a dataframe with the Ty1 insertion positions.
insertions = []
for line in file(unique_file, 'r'):
obj = ReadAlignment(line)
chromosome = dict_chromosomes[obj.chromosome.split("|")[1]]
insertions.append([chromosome, obj.position, obj.strand])
# We create a dataframew ith the Ty1 insertion positions.
insertions = pd.DataFrame(insertions, columns=["Chromosome", "Position", "Strand"])
# Look at the first lines of the dataframe
insertions.head()
Chromosome | Position | Strand | |
---|---|---|---|
0 | 1 | 101400 | + |
1 | 1 | 11050 | + |
2 | 1 | 11333 | + |
3 | 1 | 115808 | + |
4 | 1 | 118483 | + |
5 rows × 3 columns
We calculate the distances of the insertions 1Kb upstream and downstream tRNA genes.
def calculate_distances(feature, insertions, window=1000):
feature_grouped = feature.groupby(["Chromosome"])
insertions_grouped = insertions.groupby(["Chromosome"])
insertions_chromosomes = np.unique(insertions.Chromosome)
distances_from_feature = []
for chromosome, feature_group in feature_grouped:
if chromosome not in insertions_chromosomes:
continue
insertions_group = insertions_grouped.get_group(chromosome)
for n, line in feature_group.iterrows():
start, end, strand = line[1:]
insertions_around = insertions_group[(insertions_group.Position >= start - window) &
(insertions_group.Position <= start + window)]
if len(insertions_around) == 0:
continue
# Calculate the distance of the insertions from the feature
distances = insertions_around.Position.apply(lambda x: x - start)
# Correct for the strand of the feature
distances = distances.apply(lambda x: x * -1 if strand == "-" else x)
distances_from_feature.extend(distances)
return distances_from_feature
distances = calculate_distances(trna, insertions)
Represent with a histogram the distribution of the Ty1 insertions aroung tRNA TSS
figure = plt.figure(figsize=(10, 10))
ax = plt.subplot2grid((1, 1), (0, 0))
n, bins, patches = ax.hist(distances, bins=100, normed=1, alpha=0.6)
ax.set_xlabel("Distance from tRNA TSS", fontsize=14)
ax.plot([0, 0], [0, ax.axis()[3]], 'k--', lw=2, alpha=0.5)
# Fit polynomial of orders 20 using the Numpy "polyfit" function.
xnew = np.linspace(min(bins), max(bins), 100)
# 30th order polynomial fit
fit = np.polyfit(xnew, n, 30)
# Create a functions from the fit
f = np.poly1d(fit)
plt.plot(xnew, f(xnew),'-', color="red", linewidth=3)
plt.ylim(0, ax.get_ylim()[1])
plt.show()
We got a peak of insertions just upstream the tRNA TSS, as expected. This is the position where
the nucleosome is located.
Also notice that there are other two smaller peaks that overlap with the positions of other nucleosomes.
Now we can check if this happens with every TSS. So first we extract the rows of the features file that contain the ORF feature and are not marked as Dubious.
df_orf = df[(df[1] == "ORF") & (df[2] != "Dubious")]
orf = pd.DataFrame({"Chromosome": df_orf[8], "Start": df_orf[9],
"End": df_orf[10], "Strand": df_orf[11]})
orf.Strand = orf.Strand.apply(lambda x: "+" if x == "W" else "-")
# Look at the first lines of the dataframe
orf.head()
Chromosome | End | Start | Strand | |
---|---|---|---|---|
11 | 1 | 1807 | 2169 | - |
13 | 1 | 2707 | 2480 | + |
16 | 1 | 7235 | 9016 | - |
20 | 1 | 11565 | 11951 | - |
22 | 1 | 12426 | 12046 | + |
5 rows × 4 columns
Calculate the distances of the insertions from the ORF TSS
distances = calculate_distances(orf, insertions)
Represent with a histogram the distribution of the Ty1 insertions aroung ORF TSS.
In this case we use the module xkcd. This library is useful when you want to show an expected
result vs the observed result.
with plt.xkcd():
figure = plt.figure(figsize=(10, 10))
ax = plt.subplot2grid((1, 1), (0, 0))
n, bins, patches = ax.hist(distances, bins=100, normed=1, alpha=0.5)
ax.set_xlabel("Distance from ORF TSS", fontsize=14)
ax.plot([0, 0], [0, ax.axis()[3]], 'k--', lw=2, alpha=0.5)
# Fit polynomial of orders 20 using the Numpy "polyfit" function.
xnew = np.linspace(min(bins), max(bins), 100)
# 30th order polynomial fit
fit = np.polyfit(xnew, n, 30)
# Create a functions from the fit
f = np.poly1d(fit)
plt.plot(xnew, f(xnew),'-', color="red", linewidth=3)
plt.ylim(0, ax.get_ylim()[1])
plt.show()