This notebook details the steps taken to get the results in this manuscript. Please download the zip file to reproduce the analysis, as described below. All source data, and all results of running the commands in this notebook, are available in the download, but can be reproduced by running these commands.
First, preprocessing loads in the source tables into Python structures, and pre-calculates some information for the analysis.
Then, the analysis compares comorbidity to genetic similarity, using multiple methods and using the results from the preprocessing. This is done both including germline cancer census genes, and excluding these genes. Statistics for the association are generated, as well as tables showing similarity of comorbid and genetically similar pairs of Mendelian disease and cancer.
Additionally, tables are generated allowing reproduction of the figures in the paper. The figures are produced in MATLAB, and the code for this is also included.
This package is set up to be run from one directory, and it relies on Anaconda Python as well as a unix-like environment. Create a directory, and then download the zipped files (including the file for this notebook). Unzip everything, go to the directory everything is unzipped in (melamed_mendelian), and type ipython notebook to open this notebook within that directory.
Besides this notebook, included in the download are a few subdirectories. Of these, two represent the "source" material, the code and the data_source:
These represent the results of running preprocessing to integrate the data and prepare it for the analysis:
These represent the results of the analysis:
One additional directory will be needed if the user wishes to re-run the preprocessing steps. The user will have to download the TCGA analysis results from the Broad instute (step 1 below), into the firehose results download directory,(analyses_2013_09_23).
First, we set up the Python script path so we can execute the python commands.
import sys
sys.path.append('code')
from melamed import *
Note that preprocessing can be skipped, and instead the results included in this download can be used in the next section, running the analysis. The steps to do the preprocessing are included to fully document the method, and also to allow updates to the data.
Preprocessing includes:
As described in the paper, the idea is to compare recurrent somatic genetic alterations in cancer with those in Mendelian disease. We use an independent source to call recurrent somatic genetic alterations, the GISTIC2 and MutSigCV pipelines. These results are openly available for download using the firehose tool (https://confluence.broadinstitute.org/display/GDAC/Download). The command below would get the download for the version used in our analysis:
firehose_get -e -tasks mutsig gistic analyses 2013_09_23 READ GBM STAD UCEC THCA HNSC SKCM COAD BLCA LUSC KIRC KIRP LGG LAML PRAD LUAD BRCA OV KICH
This should be downloaded or linked to in the "melamed_mendelian" directory. Then, we set up soft links to these files. If the version is the same, run the command below to setup links to the cancer data. Otherwise, edit the .sh file appropriately to change the line:
firehose=analyses__2013_09_23/*/20130923
!code/setup_cancer_data.sh
Next, we will load the alterations structure into the environment. This will also load the pathways (Biological Pathways using gene symbol identifiers, downloaded from http://cpdb.molgen.mpg.de/CPDB) into data_processed, and calculate how enriched the cancer alterations are in each pathway. The resulting enrichment will be saved into data_processed/alteration_enrichments.pkl.
alterations = load_cancer_alteration_info()
This step creates the contents of data_proccessed/hprd, but creating the contents that can be seen in that directory: network, the PPI network, and rand, a directory containing 1000 random networks, representing a null distribution of gene connections.
The PPI uses the binary interactions from the Human Protein Reference Database (http://www.hprd.org/download HPRD_Release9_041310.tar.gz), downloaded into data_source. First, we convert the identifiers to Entrez gene symbols using the Entrez table. Then, we create 1000 random networks, a process that requires MATLAB and may take an hour or more to run. MATLAB must be on your path for this command to work.
setup_hprd('data_source/HPRD_Release9_062910/BINARY_PROTEIN_PROTEIN_INTERACTIONS.txt')
We start from the transcripts per million (TPM) for the 889 human samples from FANTOM5, downloaded from http://fantom.gsc.riken.jp/5/datafiles/latest/extra/CAGE_peaks/hg19.cage_peak_tpm_ann.osc.txt.gz. First, we convert to common Entrez identifiers, then we make an estimate of expression level of each gene in each sample by adding TPM from different peaks together.
Then we calculate coexpression of every cancer-linked gene (from the "alterations") structure above, and every other available gene, including the Mendelian disease genes. Coexpression (Pearson correlation coefficient) is precalculated. This is rather slow to calculate, taking a number of hours, so again the expression and coexpression is already available in data_processed/fantom.
However, if a different version of the firehose cancer data analysis is used, this must be run again. The code to reproduce the coexpression can be run below:
fantom()
This is the main part of the preprocessing. This command will integrate the data to link Mendelian diseases to their causal genes, using Blair, et al. annotation with manual curation and OrphaNet and OMIM's morbidmap. Also, it will link Mendelian diseases to cancers using the comorbidity derived from Blair, et al.'s supplementary table. The following tables, in data_source are derived from publicly available data and are integrated together:
The result will be a directory containing these linked data. One will be created including the germline mutations (mendelian_germline) and one with the germline mutations from the Sanger/Cosmic census removed (mendelian). The data saved includes most of the information needed to complete the analysis, including comorbidity between each pair of Mendelian disease and cancer, and genes linked to each Mendelian disease and cancer. Additionally, this will create pathway enrichment results for each.
load_mendelian_disease_info('mendelian', alterations)
For each pair of diseases, we calculate the number of connections in the actual HPRD PPI network and in 1000 randomly shuffled versions of the network. This must be done for the saved disease pairs, including or excluding Mendelian germline mutations. This analysis also may take a few hours to run, for each of the disease pair sets, and it saves the results the mendelian and mendelian_germline directories.
run_network_score('mendelian', 1000)
run_network_score('mendelian_germline', 1000)
This part actually compares the gene sets associated with each disease pair against the comorbidity status of each disease pair. The results will make text versions of the tables included in the manuscript, and will allow creation of the figures in the manuscript.
First, this command will compute the comorbidity scores. This command also is rather slow to run. The table returned contains the p-value (one-tailed Fisher's exact test) for the association of each genetic similarity metric (or any genetic similarity metric) with comorbidity.
comorbidity_analysis('mendelian')
comorb. association | |
---|---|
gene_enrichment | 0.014262 |
pathway_correlation | 0.010951 |
coex_CG | 0.001103 |
NC_set | 0.041618 |
Any Metric | 0.000077 |
5 rows × 1 columns
comorbidity_analysis('mendelian_germline')
The tables created in the comorbidity_analysis_mendelian* directories are as follows:
The results of this run contain the supplementary tables in a text format that can be opened in Excel:
Supplementary Table 2 corresponds to data_source/cancer_tables.txt, and was manually created rather than being the output of these scripts.
The figures were all made in MATLAB. They relied on tables pre-generated from Python, and stored in the directory figwork. These tables can be recreated by running:
figure_files('mendelian')
Then, direct MATLAB to the analysis directory (melamed_mendelian, the directory that we have been running Python in). Then, run the script code/figures_final.m to generate the figures. This will generate them all at once, but one figure at a time can be created instead by copy-pasting the relevant section of the script into MATLAB.