The following procedure can be used to reproduce the results that are present in the article "Nucleotide excision repair is impaired by binding of transcription factors to DNA".
The following data set are mandatory to run the scripts for mutation and repair rate analysis. And, these files have to be placed inside the *dataset* directory
The somatic mutations of different cancer types used in this analysis were obtained from Fredriksson et al., 2014.
%%bash
# create a directory for dataset
mkdir dataset
# create a directory called "mutations" inside dataset directory and place the mutation file
mkdir dataset/mutations
# the mutation file "mutations.tsv.gz" obtained was already filtered for mutations that overlap with
# dbSNP (v138) entries. Further, we seperate single nucleotide substitutions with a minimum variant frequency
# of 0.2 (as recommended by the authors of Fredriksson et al.,) for the following cancer types
# SKCM, LUAD, LUSC, CRC, BRCA, BLCA and HNSC.
for ctype in SKCM LUAD LUSC BRCA BLCA HNSC CRC; do lctype=`echo $ctype | tr '[:upper:]' '[:lower:]'`; \
zgrep -w $ctype dataset/mutations/mutations.tsv.gz | awk '$7>0.2' | \
awk 'BEGIN{OFS="\t";} {if(length($5)==1 && length($6)==1 && $5!~/-/ && $6!~/-/){print "chr"$3,$4,$4,$5,$6,$1}}' | \
sort -k1,1 -k2,2n | gzip -c >dataset/mutations/${lctype}.txt.gz; done
# compute the number of mutations per sample
for i in $(ls dataset/mutations/*.gz);do name=`echo $i | cut -d "/" -f 3 | cut -d "." -f 1`; \
zcat $i | cut -f 6 | sort | uniq -c | awk -v var=$name 'BEGIN{OFS="\t";}{print var,$2,$1}' | sort -n -k3,3; \
done >dataset/mutations/perSample_MutationCount.txt
The above mutation files seperated per cancer type contain 6 columns delimited by tab (similar to BED format):
column | description |
---|---|
1 | chromosome name with prefix "chr" |
2 | mutation position (one-based) |
3 | mutation position (one-based) |
4 | Reference allele. A single letter: A, C, G or T (upper case) |
5 | Alternate allele. A single letter: A, C, G or T (upper case) |
6 | Sample identifier |
# example
chr1 58553693 58553693 C T XXXX
The somatic mutations from the whole genome of normal human skin sample were obtained from Martincorena et al., 2015
%%bash
# the mutation file "PD20399be_wg_caveman_annotated_with_dinucs.5cols" from Marincorena et al., has to be placed
# inside dataset/mutations
# extract only single nucleotide variants
grep -v sampleID dataset/mutations/PD20399be_wg_caveman_annotated_with_dinucs.5cols | \
awk 'BEGIN{OFS="\t";} {if(length($4)==1 && length($5)==1){print "chr"$2,$3,$3,$4,$5;}}' | gzip >dataset/mutations/eyelid.txt.gz
# number of snv counts
zcat dataset/mutations/eyelid.txt.gz | wc -l | sed "s/^/skcm\tnormalSkin\t/" >>dataset/mutations/perSample_MutationCount.txt
2. Transcription factor binding sites annotations
The genomic coordinates of the transcription factor binding sites used in the analysis can be obtained from http://bg.upf.edu/group/projects/tfbs/.
%%bash
mkdir dataset/TFBS
wget -r -nH --cut-dirs=3 --no-parent --reject="index.html*" -e robots=off http://bg.upf.edu/group/projects/tfbs/ \
-P dataset/TFBS
The files are in BED format (hg19 assembly).
column | description |
---|---|
1 | chromosome name with prefix "chr" |
2 | chromosome start position (zero-based) |
3 | chromosome end position (one-based) |
4 | TF_motif_name |
All these transcription factor binding sites (TFBS), i.e., TF motif match under ChIP-seq peak regions, were obtained from ENCODE. These comprised,
The binding sites were classified into active/inactive binding sites based on their overlap with the DNase Hypersensitive sites. Further, we seperate the binding sites, proximal (promoter regions upstream of TSS) and distal (>5kb away from TSS), based on their genomic location.
The different sets of genomic coordinates used for mutation rate analysis:
class | filename | description |
---|---|---|
active TFBS in proximal regions | proximalTFBS-DHS_\*.bed.gz | TFBS located in the promoter regions and overlap with DHSs |
inactive TFBS in proximal regions | proximalTFBS-noDHS_\*.bed.gz | TFBS located in the promoter regions but do not overlap with DHSs |
active TFBS in distal regions | distalTFBS-DHS_skcm.bed.gz | TFBS located in the intergenic region with no annotated TSS within 5kb distance on either sides |
bound/unbound TFBS in proximal regions | allTFBS_bound\*.bed.gz, allTFBS_unbound\*.bed.gz | TFBS located in the promoter regions and overlap with TF peak (bound) or not (unbound) |
active TFBS in transcribed regions | allTFBS_DHS_skcm_templateStrand.bed.gz | TFBS in template strand of the gene |
active TFBS in transcribed regions | allTFBS_DHS_skcm_nontemplateStrand.bed.gz | TFBS in non-template strand of the gene |
active TFBS in promoter region | proximalTFBS-DHS_skcm_quartiles.bed.gz | TFBS are startified based on ChIP-seq read coverage as provided by Reijns et al., 2015 |
3. DNase Hypersensitive site (DHS)
Download the DHSs identified by Hotspot algorith (narrowPeaks in FDR 1%) from Epigenome roadmap portal.
%%bash
mkdir dataset/DHS
wget http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak/{E059,E088,E028,E084}-DNase.hotspot.fdr0.01.peaks.bed.gz \
-P dataset/DHS
# matching DHS for each cancer types
declare -A arr=(["skcm"]="E059" ["luad"]="E088" ["lusc"]="E088" ["crc"]="E084" ["brca"]="E028")
# save that information in a file
echo -e "#ctype\troadmapID\tfile" >dataset/DHS/DHS_list.txt
for i in ${!arr[@]};do echo -e "$i\t${arr[$i]}\t${arr[$i]}-DNase.hotspot.fdr0.01.peaks.bed.gz";done >>dataset/DHS/DHS_list.txt
Download the genomic co-ordinates of promoter and coding sequences that are used in this analysis
%%bash
wget http://bg.upf.edu/group/projects/tfbs/otherAnnotations/{promoter2500.bed.gz,cds.regions.gz} -P dataset
As promoters, we considered the sequences upto 2.4kb upstream of TSS of all protein coding genes in GENCODE (v19). Promoter regions overlapping coding sequences (CDS) and UTRs were excluded.
The genome sequence of hg19 version in fasta and the co-ordinates (as used in Khurana et al., 2013)
%%bash
mkdir dataset/genome
wget http://bg.upf.edu/group/projects/tfbs/otherAnnotations/{human_g1k_v37.fasta.gz,hg19.genome} -P dataset/genome
gzip -d dataset/genome/hg19.genome.gz
%%bash
mkdir dataset/mappability
wget http://bg.upf.edu/group/projects/tfbs/otherAnnotations/{Duke_DAC_exclude_bed.gz,wgEncodeCrgMapabilityAlign30mer_score1.bed.gz} \
-P dataset/mappability
%%bash
mkdir dataset/nucleosome
wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeSydhNsome/wgEncodeSydhNsomeGm12878Sig.bigWig \
-P dataset/nucleosome
bigWigToBedGraph dataset/nucleosome/wgEncodeSydhNsomeGm12878Sig.bigWig dataset/nucleosome/wgEncodeSydhNsomeGm12878Sig.bed
gzip -9 dataset/nucleosome/wgEncodeSydhNsomeGm12878Sig.bed
The nucleotide excision repair map generated by Hu et al., 2015 has been obtained from http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE67941
%%bash
# create a directory to save the repair data
mkdir dataset/xr-seq
# download the bigWig files (GSE67941_*_UNIQUE_NORM_fixedStep_25.bw) from the "Supplementary file" table of
# above link and save in dataset/xr-seq
# convert bigWig file to bed format using bigWigToBedGraph
for i in $(ls dataset/xr-seq/*.bw);do filename=`echo $i | cut -d "." -f1`; if [ ! -f ${filename}.bed ];then \
bigWigToBedGraph ${filename}.bw ${filename}.bed;gzip -9 ${filename}.bed fi;done;