Notebook

Code and tables to provide reproducibility for "Genetic similarity between somatic alterations in cancers and germline mutations in comorbid Mendelian diseases identifies candidate driver genes" by Rachel D. Melamed, Chioma Madubata, Andrey Rzhetsky, and Raul Rabadan¶

This notebook details the steps taken to get the results in this manuscript. Please download the zip file to reproduce the analysis, as described below. All source data, and all results of running the commands in this notebook, are available in the download, but can be reproduced by running these commands.

First, preprocessing loads in the source tables into Python structures, and pre-calculates some information for the analysis.

Then, the analysis compares comorbidity to genetic similarity, using multiple methods and using the results from the preprocessing. This is done both including germline cancer census genes, and excluding these genes. Statistics for the association are generated, as well as tables showing similarity of comorbid and genetically similar pairs of Mendelian disease and cancer.

Additionally, tables are generated allowing reproduction of the figures in the paper. The figures are produced in MATLAB, and the code for this is also included.

Getting started¶

This package is set up to be run from one directory, and it relies on Anaconda Python as well as a unix-like environment. Create a directory, and then download the zipped files (including the file for this notebook). Unzip everything, go to the directory everything is unzipped in (melamed_mendelian), and type ipython notebook to open this notebook within that directory.

Besides this notebook, included in the download are a few subdirectories. Of these, two represent the "source" material, the code and the data_source:

code: containing all the python and MATLAB code and bash scripts
data_source: containing all downloaded tables. The source of these tables is listed in this notebook in the appropriate sections, but most of the data is already available in this download. The exception (due to its size) is the cancer alteration data, described below.

These represent the results of running preprocessing to integrate the data and prepare it for the analysis:

data_processed: containing some results from running preprocessing on the source data. This contains our processed results. The processing steps below can recreate these, or they can be used as is.
mendelian and mendelian_germline: described below, preprocess Mendelian disease data including known germline mutations, and excluding these germline mutations to assess the significance of novel associations.

These represent the results of the analysis:

comorbidity_analysis_mendelian and comorbidity_analysis_mendelian_germline: containing the results of the analysis including or excluding the annotated germline mutations, respectively.
figwork: containing all tables needed to make the figures

One additional directory will be needed if the user wishes to re-run the preprocessing steps. The user will have to download the TCGA analysis results from the Broad instute (step 1 below), into the firehose results download directory,(analyses_2013_09_23).

First, we set up the Python script path so we can execute the python commands.

In [8]:

import sys
sys.path.append('code')
from melamed import *

Preprocessing: setting up the data for the analysis¶

Note that preprocessing can be skipped, and instead the results included in this download can be used in the next section, running the analysis. The steps to do the preprocessing are included to fully document the method, and also to allow updates to the data.

Preprocessing includes:

Linking to TCGA analysis results
Preprocessing Human Protein Reference Database to make a null distribution of protein-protein interactions
Preprocessing FANTOM5 expression data to obtain aggregate gene expression levels and find coexpression of all genes.
Integrating disease-gene annotations from Blair, 2013, with OrphaNet and OMIM annotations
Converting all data sources to common gene symbols using Entrez downloads (ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz)

1. Setting up the cancer alteration data¶

As described in the paper, the idea is to compare recurrent somatic genetic alterations in cancer with those in Mendelian disease. We use an independent source to call recurrent somatic genetic alterations, the GISTIC2 and MutSigCV pipelines. These results are openly available for download using the firehose tool (https://confluence.broadinstitute.org/display/GDAC/Download). The command below would get the download for the version used in our analysis:

firehose_get -e -tasks mutsig gistic analyses 2013_09_23 READ GBM STAD UCEC THCA HNSC SKCM COAD BLCA LUSC KIRC KIRP LGG LAML PRAD LUAD BRCA OV KICH

This should be downloaded or linked to in the "melamed_mendelian" directory. Then, we set up soft links to these files. If the version is the same, run the command below to setup links to the cancer data. Otherwise, edit the .sh file appropriately to change the line:

firehose=analyses__2013_09_23/*/20130923

In [ ]:

!code/setup_cancer_data.sh

Next, we will load the alterations structure into the environment. This will also load the pathways (Biological Pathways using gene symbol identifiers, downloaded from http://cpdb.molgen.mpg.de/CPDB) into data_processed, and calculate how enriched the cancer alterations are in each pathway. The resulting enrichment will be saved into data_processed/alteration_enrichments.pkl.

In [ ]:

alterations = load_cancer_alteration_info()

2. Setting up the Protein-Protein Interaction Network¶

This step creates the contents of data_proccessed/hprd, but creating the contents that can be seen in that directory: network, the PPI network, and rand, a directory containing 1000 random networks, representing a null distribution of gene connections.

The PPI uses the binary interactions from the Human Protein Reference Database (http://www.hprd.org/download HPRD_Release9_041310.tar.gz), downloaded into data_source. First, we convert the identifiers to Entrez gene symbols using the Entrez table. Then, we create 1000 random networks, a process that requires MATLAB and may take an hour or more to run. MATLAB must be on your path for this command to work.

In [ ]:

setup_hprd('data_source/HPRD_Release9_062910/BINARY_PROTEIN_PROTEIN_INTERACTIONS.txt')

3. Setting up the coexpression data¶

We start from the transcripts per million (TPM) for the 889 human samples from FANTOM5, downloaded from http://fantom.gsc.riken.jp/5/datafiles/latest/extra/CAGE_peaks/hg19.cage_peak_tpm_ann.osc.txt.gz. First, we convert to common Entrez identifiers, then we make an estimate of expression level of each gene in each sample by adding TPM from different peaks together.

Then we calculate coexpression of every cancer-linked gene (from the "alterations") structure above, and every other available gene, including the Mendelian disease genes. Coexpression (Pearson correlation coefficient) is precalculated. This is rather slow to calculate, taking a number of hours, so again the expression and coexpression is already available in data_processed/fantom.

However, if a different version of the firehose cancer data analysis is used, this must be run again. The code to reproduce the coexpression can be run below:

In [ ]:

fantom()

4. Setting up the Mendelian disease gene and clinical associations¶

This is the main part of the preprocessing. This command will integrate the data to link Mendelian diseases to their causal genes, using Blair, et al. annotation with manual curation and OrphaNet and OMIM's morbidmap. Also, it will link Mendelian diseases to cancers using the comorbidity derived from Blair, et al.'s supplementary table. The following tables, in data_source are derived from publicly available data and are integrated together:

morbidmap.txt: from http://www.omim.org/downloads
orph2: derived from orphadata.org (see details in code/parse_orphanet.py. This file was also manually edited, and was where most of the manual curation of disease genes took place)
mmc3_2: derived from Blair, et al. table S3, but also some manual curation
comorbidities: derived from Blair, et al., table S4, but disease names are fixed to be consistent with S3.
cancer_table.txt: corresponding to our Supplementary Table 2, a mapping of the Blair, et al., complex disease categories to TCGA cancer names.
germline_to_remove: about 83 known germline cancer genes, downloaded from http://cancer.sanger.ac.uk/cosmic/census/tables?name=gmuts (germline mutations table), then converted to the Entrez gene symbols

The result will be a directory containing these linked data. One will be created including the germline mutations (mendelian_germline) and one with the germline mutations from the Sanger/Cosmic census removed (mendelian). The data saved includes most of the information needed to complete the analysis, including comorbidity between each pair of Mendelian disease and cancer, and genes linked to each Mendelian disease and cancer. Additionally, this will create pathway enrichment results for each.

In [ ]:

load_mendelian_disease_info('mendelian', alterations)

5. Calculate network connections between Mendelian disease and cancer¶

For each pair of diseases, we calculate the number of connections in the actual HPRD PPI network and in 1000 randomly shuffled versions of the network. This must be done for the saved disease pairs, including or excluding Mendelian germline mutations. This analysis also may take a few hours to run, for each of the disease pair sets, and it saves the results the mendelian and mendelian_germline directories.

In [ ]:

run_network_score('mendelian', 1000)
run_network_score('mendelian_germline', 1000)

Running the analysis to make tables and figures¶

This part actually compares the gene sets associated with each disease pair against the comorbidity status of each disease pair. The results will make text versions of the tables included in the manuscript, and will allow creation of the figures in the manuscript.

First, this command will compute the comorbidity scores. This command also is rather slow to run. The table returned contains the p-value (one-tailed Fisher's exact test) for the association of each genetic similarity metric (or any genetic similarity metric) with comorbidity.

In [11]:

comorbidity_analysis('mendelian')

Out[11]:

	comorb. association
gene_enrichment	0.014262
pathway_correlation	0.010951
coex_CG	0.001103
NC_set	0.041618
Any Metric	0.000077

5 rows × 1 columns

In [ ]:

comorbidity_analysis('mendelian_germline')

Description of the analysis output tables¶

The tables created in the comorbidity_analysis_mendelian* directories are as follows:

peak_mut.xls: a file that MATLAB can read to make the figures
predicted_pairs.xls: a text version of the tables in Supplementary Table 3 & 4. The format is described in those tables.

Tables and figures from the manuscript¶

The results of this run contain the supplementary tables in a text format that can be opened in Excel:

Supp. Table 1: data_processed/supp_tab_1.txt
Supp. Table 3: comorbidity_analysis_mendelian_germline/predicted_pairs.xls -- includes all comorbid and genetically similar disease pairs. These can be filtered in Excel to only view the Mendelian cancer syndromes.
Supp. Table 4: comorbidity_analysis_mendelian/predicted_pairs.xls

Supplementary Table 2 corresponds to data_source/cancer_tables.txt, and was manually created rather than being the output of these scripts.

The figures were all made in MATLAB. They relied on tables pre-generated from Python, and stored in the directory figwork. These tables can be recreated by running:

In [15]:

figure_files('mendelian')

Then, direct MATLAB to the analysis directory (melamed_mendelian, the directory that we have been running Python in). Then, run the script code/figures_final.m to generate the figures. This will generate them all at once, but one figure at a time can be created instead by copy-pasting the relevant section of the script into MATLAB.