#!/usr/bin/env python
# coding: utf-8

# # The ISB-CGC open-access TCGA tables in Big-Query
# 
# The goal of this notebook is to introduce you to a new publicly-available, open-access dataset in BigQuery.  This set of BigQuery tables was produced by the [ISB-CGC](http://www.isb-cgc.org) project, based on the open-access [TCGA](http://cancergenome.nih.gov/) data available at the TCGA [Data Portal](https://tcga-data.nci.nih.gov/tcga/).  You will need to have access to a Google Cloud Platform (GCP) project in order to use BigQuery.  If you don't already have one, you can sign up for a [free-trial](https://cloud.google.com/free-trial/) or contact [us](mailto://info@isb-cgc.org) and become part of the community evaluation phase of our Cancer Genomics Cloud pilot.  (You can find more information about this NCI-funded program [here](https://cbiit.nci.nih.gov/ncip/nci-cancer-genomics-cloud-pilots).)
# 
# We are not attempting to provide a thorough BigQuery or IPython tutorial here, as a wealth of such information already exists.  Here are links to some resources that you might find useful: 
# * [BigQuery](https://cloud.google.com/bigquery/what-is-bigquery), 
# * the BigQuery [web UI](https://bigquery.cloud.google.com/) where you can run queries interactively, 
# * [IPython](http://ipython.org/) (now known as [Jupyter](http://jupyter.org/)), and 
# * [Cloud Datalab](https://cloud.google.com/datalab/) the recently announced interactive cloud-based platform that this notebook is being developed on. 
# 
# There are also many tutorials and samples available on github (see, in particular, the [datalab](https://github.com/GoogleCloudPlatform/datalab) repo and the [Google Genomics](  https://github.com/googlegenomics) project).
# 
# In order to work with BigQuery, the first thing you need to do is import the [gcp.bigquery](http://googlecloudplatform.github.io/datalab/gcp.bigquery.html) package:

# In[6]:


import gcp.bigquery as bq


# The next thing you need to know is how to access the specific tables you are interested in.  BigQuery tables are organized into datasets, and datasets are owned by a specific GCP project.  The tables we are introducing in this notebook are in a dataset called **`tcga_201607_beta`**, owned by the **`isb-cgc`** project.  A full table identifier is of the form `<project_id>:<dataset_id>.<table_id>`.  Let's start by getting some basic information about the tables in this dataset:

# In[7]:


d = bq.DataSet('isb-cgc:tcga_201607_beta')
for t in d.tables():
  print '%10d rows  %12d bytes   %s' \
      % (t.metadata.rows, t.metadata.size, t.name.table_id)


# These tables are based on the open-access TCGA data as of July 2016.  The molecular data is all "Level 3" data, and is divided according to platform/pipeline.  See [here](https://tcga-data.nci.nih.gov/tcga/tcgaDataType.jsp) for additional details regarding the TCGA data levels and data types.
# 
# Additional notebooks go into each of these tables in more detail, but here is an overview, in the same alphabetical order that they are listed in above and in the BigQuery web UI:
# 
# 
# - **Annotations**:  This table contains the annotations that are also available from the interactive [TCGA Annotations Manager](https://tcga-data.nci.nih.gov/annotations/).  Annotations can be associated with any type of "item" (*eg* Patient, Sample, Aliquot, etc), and a single item may have more than one annotation.  Common annotations include "Item flagged DNU", "Item is noncanonical", and "Prior malignancy."  More information about this table can be found in the [TCGA Annotations](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/TCGA%20Annotations.ipynb) notebook.
# 
# 
# - **Biospecimen_data**:  This table contains information obtained from the "biospecimen" and "auxiliary" XML files in the TCGA Level-1 "bio" archives.  Each row in this table represents a single "biospecimen" or "sample".  Most participants in the TCGA project provided two samples: a "primary tumor" sample and a "blood normal" sample, but others provided normal-tissue, metastatic, or other types of samples.  This table contains metadata about all of the samples, and more information about exploring this table and using this information to create your own custom analysis cohort can be found  in the [Creating TCGA cohorts (part 1)](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%201.ipynb) and [(part 2)](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%202.ipynb) notebooks.
# 
# 
# - **Clinical_data**:  This table contains information obtained from the "clinical" XML files in the TCGA Level-1 "bio" archives.  Not all fields in the XML files are represented in this table, but any field which was found to be significantly filled-in for at least one tumor-type has been retained.  More information about exploring this table and using this information to create your own custom analysis cohort can be found in the [Creating TCGA cohorts (part 1)](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%201.ipynb) and [(part 2)](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%202.ipynb) notebooks.
# 
# 
# - **Copy_Number_segments**:  This table contains Level-3 copy-number segmentation results generated by The Broad Institute, from Genome Wide SNP 6 data using the CBS (Circular Binary Segmentation) algorithm.  The values are base2 log(copynumber/2), centered on 0.  More information about this data table can be found in the [Copy Number segments](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Copy%20Number%20segments.ipynb) notebook.
# 
# 
# - **DNA_Methylation_betas**:  This table contains Level-3 summary measures of DNA methylation for each interrogated locus (beta values: M/(M+U)).  This table contains data from two different platforms: the Illumina Infinium HumanMethylation 27k and 450k arrays.  More information about this data table can be found in the [DNA Methylation](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/DNA%20Methylation.ipynb) notebook.  Note that individual chromosome-specific DNA Methylation tables are also available to cut down on the amount of data that you may need to query (depending on yoru use case).  
# 
# 
# - **Protein_RPPA_data**:  This table contains the normalized Level-3 protein expression levels based on each antibody used to probe the sample.  More information about how this data was generated by the RPPA Core Facility at MD Anderson can be found [here](https://wiki.nci.nih.gov/display/TCGA/Protein+Array+Data+Format+Specification#ProteinArrayDataFormatSpecification-Expression-Protein), and more information about this data table can be found in the [Protein expression](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Protein%20expression.ipynb) notebook.
# 
# 
# - **Somatic_Mutation_calls**: This table contains annotated somatic mutation calls.  All current MAF (Mutation Annotation Format) files were annotated using [Oncotator](http://onlinelibrary.wiley.com/doi/10.1002/humu.22771/abstract;jsessionid=15E7960BA5FEC21EE608E6D262390C52.f01t04) v1.5.1.0, and merged into a single table.  More information about this data table can be found in the [Somatic Mutations](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Somatic%20Mutations.ipynb) notebook, including an example of how to use the [Tute Genomics annotations database in BigQuery](http://googlegenomics.readthedocs.org/en/latest/use_cases/annotate_variants/tute_annotation.html).
# 
# 
# - **mRNA_BCGSC_HiSeq_RPKM**: This table contains mRNAseq-based gene expression data produced by the [BC Cancer Agency](http://www.bcgsc.ca/).  (For details about a very similar table, take a look at a [notebook](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/UNC%20HiSeq%20mRNAseq%20gene%20expression.ipynb) describing the other mRNAseq gene expression table.)
# 
# 
# - **mRNA_UNC_HiSeq_RSEM**: This table contains mRNAseq-based gene expression data produced by [UNC Lineberger](https://unclineberger.org/).  More information about this data table can be found in the [UNC HiSeq mRNAseq gene expression](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/UNC%20HiSeq%20mRNAseq%20gene%20expression.ipynb) notebook.
# 
# 
# - **miRNA_expression**: This table contains miRNAseq-based expression data for mature microRNAs produced by the [BC Cancer Agency](http://www.bcgsc.ca/).  More information about this data table can be found in the [microRNA expression](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/BCGSC%20microRNA%20expression.ipynb) notebook.

# ### Where to start?
# We suggest that you start with the two "Creating TCGA cohorts" notebooks ([part 1](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%201.ipynb) and [part 2](https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%202.ipynb)) which describe and make use of the Clinical and Biospecimen tables.  From there you can delve into the various molecular data tables as well as the Annotations table.  For now these sample notebooks are intentionally relatively simple and do not do any analysis that integrates data from multiple tables but once you have a grasp of how to use the data, developing your own more complex analyses should not be difficult.  You could even contribute an example back to our github repository!  You are also welcome to submit bug reports, comments, and feature-requests as [github issues](https://github.com/isb-cgc/examples-Python/issues).

# ### A note about BigQuery tables and "tidy data"
# You may be used to thinking about a molecular data table such as a gene-expression table as a matrix where the rows are genes and the columns are samples (or *vice versa*).  These BigQuery tables instead use the [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html) approach, with each "cell" from the traditional data-matrix becoming a single row in the BigQuery table.  A 10,000 gene x 500 sample matrix would therefore become a 5,000,000 row BigQuery table.