Notebook

Data set and pre-processing¶

The following procedure can be used to reproduce the results that are present in the article "Nucleotide excision repair is impaired by binding of transcription factors to DNA".

The following data set are mandatory to run the scripts for mutation and repair rate analysis. And, these files have to be placed inside the *dataset* directory

[Somatic mutations](#Somatic mutations)
Transcription factor binding sites
DNase Hypersensitive sites
Other annotations
Excision repair data

1. Somatic mutations

The somatic mutations of different cancer types used in this analysis were obtained from Fredriksson et al., 2014.

In [ ]:

%%bash 

# create a directory for dataset
mkdir dataset

# create a directory called "mutations" inside dataset directory and place the mutation file
mkdir dataset/mutations

# the mutation file "mutations.tsv.gz" obtained was already filtered for mutations that overlap with 
# dbSNP (v138) entries. Further, we seperate single nucleotide substitutions with a minimum variant frequency
# of 0.2 (as recommended by the authors of Fredriksson et al.,) for the following cancer types 
# SKCM, LUAD, LUSC, CRC, BRCA, BLCA and HNSC.

for ctype in SKCM LUAD LUSC BRCA BLCA HNSC CRC; do  lctype=`echo $ctype | tr '[:upper:]' '[:lower:]'`; \
    zgrep -w $ctype dataset/mutations/mutations.tsv.gz | awk '$7>0.2' | \
    awk 'BEGIN{OFS="\t";} {if(length($5)==1 && length($6)==1 && $5!~/-/ && $6!~/-/){print "chr"$3,$4,$4,$5,$6,$1}}' | \
    sort -k1,1 -k2,2n | gzip -c >dataset/mutations/${lctype}.txt.gz; done
    
# compute the number of mutations per sample
for i in $(ls dataset/mutations/*.gz);do name=`echo $i | cut -d "/" -f 3 | cut -d "." -f 1`; \
 zcat $i | cut -f 6 | sort | uniq -c | awk -v var=$name 'BEGIN{OFS="\t";}{print var,$2,$1}' | sort -n -k3,3; \
 done >dataset/mutations/perSample_MutationCount.txt

The above mutation files seperated per cancer type contain 6 columns delimited by tab (similar to BED format):

column	description
1	chromosome name with prefix "chr"
2	mutation position (one-based)
3	mutation position (one-based)
4	Reference allele. A single letter: A, C, G or T (upper case)
5	Alternate allele. A single letter: A, C, G or T (upper case)
6	Sample identifier

In [ ]:

# example
chr1   58553693   58553693   C   T   XXXX 

The somatic mutations from the whole genome of normal human skin sample were obtained from Martincorena et al., 2015

In [ ]:

%%bash

# the mutation file "PD20399be_wg_caveman_annotated_with_dinucs.5cols" from Marincorena et al., has to be placed
# inside dataset/mutations

# extract only single nucleotide variants
grep -v sampleID dataset/mutations/PD20399be_wg_caveman_annotated_with_dinucs.5cols | \
 awk 'BEGIN{OFS="\t";} {if(length($4)==1 && length($5)==1){print "chr"$2,$3,$3,$4,$5;}}' | gzip >dataset/mutations/eyelid.txt.gz

# number of snv counts
zcat dataset/mutations/eyelid.txt.gz | wc -l | sed "s/^/skcm\tnormalSkin\t/" >>dataset/mutations/perSample_MutationCount.txt

2. Transcription factor binding sites annotations

The genomic coordinates of the transcription factor binding sites used in the analysis can be obtained from http://bg.upf.edu/group/projects/tfbs/.

In [ ]:

%%bash

mkdir dataset/TFBS
wget -r -nH --cut-dirs=3 --no-parent --reject="index.html*" -e robots=off http://bg.upf.edu/group/projects/tfbs/ \
 -P dataset/TFBS

The files are in BED format (hg19 assembly).

column	description
1	chromosome name with prefix "chr"
2	chromosome start position (zero-based)
3	chromosome end position (one-based)
4	TF_motif_name

Data description

All these transcription factor binding sites (TFBS), i.e., TF motif match under ChIP-seq peak regions, were obtained from ENCODE. These comprised,

binding sites of 109 transcription factors (TF) as used in Khurana et al., 2013.
ENCODE predicted binding sites of 52 transcription factors which are not supported by ChIP- seq peaks obtained from ftp portal
In addition, we obtained the binding sites of 32 TFs used in Reijns et al., 2015

The binding sites were classified into active/inactive binding sites based on their overlap with the DNase Hypersensitive sites. Further, we seperate the binding sites, proximal (promoter regions upstream of TSS) and distal (>5kb away from TSS), based on their genomic location.

The different sets of genomic coordinates used for mutation rate analysis:

class	filename	description
active TFBS in proximal regions	proximalTFBS-DHS_\*.bed.gz	TFBS located in the promoter regions and overlap with DHSs
inactive TFBS in proximal regions	proximalTFBS-noDHS_\*.bed.gz	TFBS located in the promoter regions but do not overlap with DHSs
active TFBS in distal regions	distalTFBS-DHS_skcm.bed.gz	TFBS located in the intergenic region with no annotated TSS within 5kb distance on either sides
bound/unbound TFBS in proximal regions	allTFBS_bound\.bed.gz, allTFBS_unbound\.bed.gz	TFBS located in the promoter regions and overlap with TF peak (bound) or not (unbound)
active TFBS in transcribed regions	allTFBS_DHS_skcm_templateStrand.bed.gz	TFBS in template strand of the gene
active TFBS in transcribed regions	allTFBS_DHS_skcm_nontemplateStrand.bed.gz	TFBS in non-template strand of the gene
active TFBS in promoter region	proximalTFBS-DHS_skcm_quartiles.bed.gz	TFBS are startified based on ChIP-seq read coverage as provided by Reijns et al., 2015

3. DNase Hypersensitive site (DHS)

Download the DHSs identified by Hotspot algorith (narrowPeaks in FDR 1%) from Epigenome roadmap portal.

In [ ]:

%%bash

mkdir dataset/DHS
wget http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak/{E059,E088,E028,E084}-DNase.hotspot.fdr0.01.peaks.bed.gz \
 -P dataset/DHS 

# matching DHS for each cancer types
declare -A arr=(["skcm"]="E059" ["luad"]="E088" ["lusc"]="E088" ["crc"]="E084" ["brca"]="E028")

# save that information in a file
echo -e "#ctype\troadmapID\tfile" >dataset/DHS/DHS_list.txt
for i in ${!arr[@]};do echo -e "$i\t${arr[$i]}\t${arr[$i]}-DNase.hotspot.fdr0.01.peaks.bed.gz";done >>dataset/DHS/DHS_list.txt

4. Other annotations

a. Promoter and coding regions

Download the genomic co-ordinates of promoter and coding sequences that are used in this analysis

In [ ]:

%%bash

wget http://bg.upf.edu/group/projects/tfbs/otherAnnotations/{promoter2500.bed.gz,cds.regions.gz} -P dataset

As promoters, we considered the sequences upto 2.4kb upstream of TSS of all protein coding genes in GENCODE (v19). Promoter regions overlapping coding sequences (CDS) and UTRs were excluded.

b. Genome sequence

The genome sequence of hg19 version in fasta and the co-ordinates (as used in Khurana et al., 2013)

In [ ]:

%%bash

mkdir dataset/genome
wget http://bg.upf.edu/group/projects/tfbs/otherAnnotations/{human_g1k_v37.fasta.gz,hg19.genome} -P dataset/genome
gzip -d dataset/genome/hg19.genome.gz

c. Mappability

UCSC Browser blacklisted regions (Duke and DAC, http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeMapability/)
Unique mappability of sequencing reads ("CRG Alignability 36' Track", score == 1) (http://genome.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeMapability)

In [ ]:

%%bash

mkdir dataset/mappability
wget http://bg.upf.edu/group/projects/tfbs/otherAnnotations/{Duke_DAC_exclude_bed.gz,wgEncodeCrgMapabilityAlign30mer_score1.bed.gz} \
-P dataset/mappability

d. Nucleosome positioning data

In [ ]:

%%bash

mkdir dataset/nucleosome
wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeSydhNsome/wgEncodeSydhNsomeGm12878Sig.bigWig \
-P dataset/nucleosome
bigWigToBedGraph dataset/nucleosome/wgEncodeSydhNsomeGm12878Sig.bigWig dataset/nucleosome/wgEncodeSydhNsomeGm12878Sig.bed
gzip -9 dataset/nucleosome/wgEncodeSydhNsomeGm12878Sig.bed

5. Excision repair data

The nucleotide excision repair map generated by Hu et al., 2015 has been obtained from http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE67941

In [ ]:

%%bash

# create a directory to save the repair data
mkdir dataset/xr-seq

# download the bigWig files (GSE67941_*_UNIQUE_NORM_fixedStep_25.bw) from the "Supplementary file" table of
# above link and save in dataset/xr-seq

# convert bigWig file to bed format using bigWigToBedGraph
for i in $(ls dataset/xr-seq/*.bw);do filename=`echo $i | cut -d "." -f1`; if [ ! -f ${filename}.bed ];then \
bigWigToBedGraph ${filename}.bw ${filename}.bed;gzip -9 ${filename}.bed fi;done;