Creating TCGA cohorts (part 1)¶

This notebook will show you how to create a TCGA cohort using the publicly available TCGA BigQuery tables that the ISB-CGC project has produced based on the open-access TCGA data available at the Data Portal. You will need to have access to a Google Cloud Platform (GCP) project in order to use BigQuery. If you don't already have one, you can sign up for a free-trial or contact us and become part of the community evaluation phase of our Cancer Genomics Cloud pilot.

We are not attempting to provide a thorough BigQuery or IPython tutorial here, as a wealth of such information already exists. Here are some links to some resources that you might find useful:

BigQuery,
the BigQuery web UI where you can run queries interactively,
IPython (now known as Jupyter), and
Cloud Datalab the recently announced interactive cloud-based platform that this notebook is being developed on.

There are also many tutorials and samples available on github (see, in particular, the datalab repo and the Google Genomics project).

OK then, let's get started! In order to work with BigQuery, the first thing you need to do is import the bigquery module:

In [1]:

import gcp.bigquery as bq

The next thing you need to know is how to access the specific tables you are interested in. BigQuery tables are organized into datasets, and datasets are owned by a specific GCP project. The tables we will be working with in this notebook are in a dataset called tcga_201607_beta, owned by the isb-cgc project. A full table identifier is of the form <project_id>:<dataset_id>.<table_id>. Let's start by getting some basic information about the tables in this dataset:

In [2]:

d = bq.DataSet('isb-cgc:tcga_201607_beta')
for t in d.tables():
  print '%10d rows  %12d bytes   %s' \
      % (t.metadata.rows, t.metadata.size, t.name.table_id)

      6322 rows       1729204 bytes   Annotations
     23797 rows       6382147 bytes   Biospecimen_data
     11160 rows       4201379 bytes   Clinical_data
   2646095 rows     333774244 bytes   Copy_Number_segments
3944304319 rows  445303830985 bytes   DNA_Methylation_betas
 382335670 rows   43164264006 bytes   DNA_Methylation_chr1
 197519895 rows   22301345198 bytes   DNA_Methylation_chr10
 235823572 rows   26623975945 bytes   DNA_Methylation_chr11
 198050739 rows   22359642619 bytes   DNA_Methylation_chr12
  97301675 rows   10986815862 bytes   DNA_Methylation_chr13
 123239379 rows   13913712352 bytes   DNA_Methylation_chr14
 124566185 rows   14064712239 bytes   DNA_Methylation_chr15
 179772812 rows   20296128173 bytes   DNA_Methylation_chr16
 234003341 rows   26417830751 bytes   DNA_Methylation_chr17
  50216619 rows    5669139362 bytes   DNA_Methylation_chr18
 211386795 rows   23862583107 bytes   DNA_Methylation_chr19
 279668485 rows   31577200462 bytes   DNA_Methylation_chr2
  86858120 rows    9805923353 bytes   DNA_Methylation_chr20
  35410447 rows    3997986812 bytes   DNA_Methylation_chr21
  70676468 rows    7978947938 bytes   DNA_Methylation_chr22
 201119616 rows   22705358910 bytes   DNA_Methylation_chr3
 159148744 rows   17968482285 bytes   DNA_Methylation_chr4
 195864180 rows   22113162401 bytes   DNA_Methylation_chr5
 290275524 rows   32772371379 bytes   DNA_Methylation_chr6
 240010275 rows   27097948808 bytes   DNA_Methylation_chr7
 164810092 rows   18607886221 bytes   DNA_Methylation_chr8
  81260723 rows    9173717922 bytes   DNA_Methylation_chr9
  98082681 rows   11072059468 bytes   DNA_Methylation_chrX
   2330426 rows     263109775 bytes   DNA_Methylation_chrY
   1867233 rows     207365611 bytes   Protein_RPPA_data
   5356089 rows    5715538107 bytes   Somatic_Mutation_calls
   5738048 rows     657855993 bytes   mRNA_BCGSC_GA_RPKM
  38299138 rows    4459086535 bytes   mRNA_BCGSC_HiSeq_RPKM
  44037186 rows    5116942528 bytes   mRNA_BCGSC_RPKM
  16794358 rows    1934755686 bytes   mRNA_UNC_GA_RSEM
 211284521 rows   24942992190 bytes   mRNA_UNC_HiSeq_RSEM
 228078879 rows   26877747876 bytes   mRNA_UNC_RSEM
  11997545 rows    2000881026 bytes   miRNA_BCGSC_GA_isoform
   4503046 rows     527101917 bytes   miRNA_BCGSC_GA_mirna
  90237323 rows   15289326462 bytes   miRNA_BCGSC_HiSeq_isoform
  28207741 rows    3381212265 bytes   miRNA_BCGSC_HiSeq_mirna
 102234868 rows   17290207488 bytes   miRNA_BCGSC_isoform
  32710787 rows    3908314182 bytes   miRNA_BCGSC_mirna
  26763022 rows    3265303352 bytes   miRNA_Expression

In this tutorial, we are going to look at a few different ways that we can use the information in these tables to create cohorts. Now, you maybe asking what we mean by "cohort" and why you might be interested in creating one, or maybe what it even means to "create" a cohort. The TCGA dataset includes clinical, biospecimen, and molecular data from over 10,000 cancer patients who agreed to be a part of this landmark research project to build The Cancer Genome Atlas. This large dataset was originally organized and studied according to cancer type but now that this multi-year project is nearing completion, with over 30 types of cancer and over 10,000 tumors analyzed, you have the opportunity to look at this dataset from whichever angle most interests you. Maybe you are particularly interested in early-onset cancers, or gastro-intestinal cancers, or a specific type of genetic mutation. This is where the idea of a "cohort" comes in. The original TCGA "cohorts" were based on cancer type (aka "study"), but now you can define a cohort based on virtually any clinical or molecular feature by querying these BigQuery tables. A cohort is simply a list of samples, using the TCGA barcode system. Once you have created a cohort you can use it in any number of ways: you could further explore the data available for one cohort, or compare one cohort to another, for example.

In the rest of this tutorial, we will create several different cohorts based on different motivating research questions. We hope that these examples will provide you with a starting point from which you can build, to answer your own research questions.

Exploring the Clinical data table¶

Let's start by looking at the clinical data table. The TCGA dataset contains a few very basic clinical data elements for almost all patients, and contains additional information for some tumor types only. For example smoking history information is generally available only for lung cancer patients, and BMI (body mass index) is only available for tumor types where that is a known significant risk factor. Let's take a look at the clinical data table and see how many different pieces of information are available to us:

In [3]:

%bigquery schema --table isb-cgc:tcga_201607_beta.Clinical_data

Out[3]:

That's a lot of fields! We can also get at the schema programmatically:

In [4]:

table = bq.Table('isb-cgc:tcga_201607_beta.Clinical_data')
if ( table.exists() ):
    fieldNames = map(lambda tsf: tsf.name, table.schema)
    fieldTypes = map(lambda tsf: tsf.data_type, table.schema)
    print " This table has %d fields. " % ( len(fieldNames) )
    print " The first few field names and types are: " 
    print "     ", fieldNames[:5]
    print "     ", fieldTypes[:5]
else: 
    print " There is no existing table called %s:%s.%s" % ( table.name.project_id, table.name.dataset_id, table.name.table_id )

 This table has 70 fields. 
 The first few field names and types are: 
      [u'ParticipantBarcode', u'Study', u'Project', u'ParticipantUUID', u'TSSCode']
      [u'STRING', u'STRING', u'STRING', u'STRING', u'STRING']

Let's look at these fields and see which ones might be the most "interesting", by looking at how many times they are filled-in (not NULL), or how much variation exists in the values. If we wanted to look at just a single field, "tobacco_smoking_history" for example, we could use a very simple query to get a basic summary:

In [5]:

%%sql 

SELECT tobacco_smoking_history, COUNT(*) AS n
FROM [isb-cgc:tcga_201607_beta.Clinical_data]
GROUP BY tobacco_smoking_history
ORDER BY n DESC

Out[5]:

tobacco_smoking_history	n
	8161
1	865
4	799
2	710
3	568
5	57

(rows: 6, time: 1.3s, 8KB processed, job: job_Tp9_oGNMvJ7f0ZcTwdt2TirZePk)

But if we want to loop over all fields and get a sense of which fields might provide us with useful criteria for specifying a cohort, we'll want to automate that. We'll put a threshold on the minimum number of patients that we expect information for, and the maximum number of unique values (since fields such as the "ParticipantBarcode" will be unique for every patient and, although we will need that field later, it's probably not useful for defining a cohort).

In [15]:

numPatients = table.metadata.rows
print " The %s table describes a total of %d patients. " % ( table.name.table_id, numPatients )

# let's set a threshold for the minimum number of values that a field should have,
# and also the maximum number of unique values
minNumPatients = int(numPatients*0.80)
maxNumValues = 50

numInteresting = 0
iList = []
for iField in range(len(fieldNames)):
  aField = fieldNames[iField]
  aType = fieldTypes[iField]
  try:
    qString = "SELECT {0} FROM [{1}]".format(aField,table)
    query = bq.Query(qString)
    df = query.to_dataframe()
    summary = df[str(aField)].describe()
    if ( aType == "STRING" ):
      topFrac = float(summary['freq'])/float(summary['count'])
      if ( summary['count'] >= minNumPatients ):
        if ( summary['unique'] <= maxNumValues and summary['unique'] > 1 ):
          if ( topFrac < 0.90 ):
            numInteresting += 1
            iList += [aField]
            print "     > %s has %d values with %d unique (%s occurs %d times) " \
              % (str(aField), summary['count'], summary['unique'], summary['top'], summary['freq'])
    else:
      if ( summary['count'] >= minNumPatients ):
        if ( summary['std'] > 0.1 ):
          numInteresting += 1
          iList += [aField]
          print "     > %s has %d values (mean=%.0f, sigma=%.0f) " \
            % (str(aField), summary['count'], summary['mean'], summary['std'])
  except:
    pass

print " "
print " Found %d potentially interesting features: " % numInteresting
print "   ", iList

 The Clinical_data table describes a total of 11160 patients. 
     > Study has 11160 values with 33 unique (BRCA occurs 1097 times) 
     > age_at_initial_pathologic_diagnosis has 11109 values (mean=59, sigma=14) 
     > batch_number has 11160 values (mean=203, sigma=135) 
     > vital_status has 11156 values with 2 unique (Alive occurs 7534 times) 
     > days_to_birth has 11041 values (mean=-21763, sigma=5266) 
     > days_to_last_known_alive has 11102 values (mean=1037, sigma=1041) 
     > gender has 11160 values with 2 unique (FEMALE occurs 5815 times) 
     > year_of_initial_pathologic_diagnosis has 11030 values (mean=2008, sigma=4) 
     > person_neoplasm_cancer_status has 10236 values with 2 unique (TUMOR FREE occurs 6507 times) 
     > race has 9835 values with 5 unique (WHITE occurs 8186 times) 
 
 Found 10 potentially interesting features: 
    [u'Study', u'age_at_initial_pathologic_diagnosis', u'batch_number', u'vital_status', u'days_to_birth', u'days_to_last_known_alive', u'gender', u'year_of_initial_pathologic_diagnosis', u'person_neoplasm_cancer_status', u'race']

The above helps us narrow down on which fields are likely to be the most useful, but if you have a specific interest, for example in menopause or HPV status, you can still look at those in more detail very easily:

In [16]:

%%sql
SELECT menopause_status, COUNT(*) AS n
FROM [isb-cgc:tcga_201607_beta.Clinical_data]
WHERE menopause_status IS NOT NULL
GROUP BY menopause_status
ORDER BY n DESC

Out[16]:

menopause_status	n
Post (prior bilateral ovariectomy OR >12 mo since LMP with no prior hysterectomy)	1291
Pre (<6 months since LMP AND no prior bilateral ovariectomy AND not on estrogen replacement)	389
Peri (6-12 months since last menstrual period)	82
Indeterminate (neither Pre or Postmenopausal)	54

(rows: 4, time: 0.8s, cached, job: job_R6HF_RzYdXcntPKVIy06TNbDi8Q)

We might wonder which specific tumor types have menopause information:

In [17]:

%%sql
SELECT Study, COUNT(*) AS n
FROM [isb-cgc:tcga_201607_beta.Clinical_data]
WHERE menopause_status IS NOT NULL
GROUP BY Study
ORDER BY n DESC

Out[17]:

Study	n
BRCA	1007
UCEC	517
CESC	237
UCS	55

(rows: 4, time: 0.5s, cached, job: job_WIGSHA6zqYL1yChZi9jnwh-ZL-A)

In [18]:

%%sql
SELECT hpv_status, hpv_calls, COUNT(*) AS n
FROM [isb-cgc:tcga_201607_beta.Clinical_data]
WHERE hpv_status IS NOT NULL
GROUP BY hpv_status, hpv_calls
HAVING n > 20
ORDER BY n DESC

Out[18]:

hpv_status	hpv_calls	n
Negative		664
Positive	HPV16	238
Positive	HPV18	41
Positive	HPV33	25
Positive	HPV45	24

(rows: 5, time: 0.6s, cached, job: job_4BS0EXKldZBN8Y2QcbPA--vK6_g)

TCGA Annotations¶

An additional factor to consider, when creating a cohort is that there may be additional information that might lead one to exclude a particular patient from a cohort. In certain instances, patients have been redacted or excluded from analyses for reasons such as prior treatment, etc, but since different researchers may have different criteria for using or excluding certain patients or certain samples from their analyses, in many cases the data is still available while at the same time "annotations" may have been entered into a searchable database. These annotations have also been uploaded into a BigQuery table and can be used in conjuction with the other BigQuery tables.

Early-onset Breast Cancer¶

Now that we have a better idea of what types of information is available in the Clinical data table, let's create a cohort consisting of female breast-cancer patients, diagnosed at the age of 50 or younger.

In this next code cell, we define several queries within a module which allows us to use them both individually and by reference in the final, main query.

the first query, called select_on_annotations, finds all patients in the Annotations table which have either been 'redacted' or had 'unacceptable prior treatment';
the second query, select_on_clinical selects all female breast-cancer patients who were diagnosed at age 50 or younger, while also pulling out a few additional fields that might be of interest; and
the final query joins these two together and returns just those patients that meet the clinical-criteria and do not meet the exclusion-criteria.

In [19]:

%%sql --module createCohort_and_checkAnnotations

DEFINE QUERY select_on_annotations
SELECT
  ParticipantBarcode,
  annotationCategoryName AS categoryName,
  annotationClassification AS classificationName
FROM
  [isb-cgc:tcga_201607_beta.Annotations]
WHERE
  ( itemTypeName="Patient"
    AND (annotationCategoryName="History of unacceptable prior treatment related to a prior/other malignancy"
      OR annotationClassification="Redaction" ) )
GROUP BY
  ParticipantBarcode,
  categoryName,
  classificationName

DEFINE QUERY select_on_clinical
SELECT
  ParticipantBarcode,
  vital_status,
  days_to_last_known_alive,
  ethnicity,
  histological_type,
  menopause_status,
  race
FROM
  [isb-cgc:tcga_201607_beta.Clinical_data]
WHERE
  ( Study="BRCA"
    AND age_at_initial_pathologic_diagnosis<=50
    AND gender="FEMALE" )

SELECT
  c.ParticipantBarcode AS ParticipantBarcode
FROM (
  SELECT
    a.categoryName,
    a.classificationName,
    a.ParticipantBarcode,
    c.ParticipantBarcode,
  FROM ( $select_on_annotations ) AS a
  OUTER JOIN EACH 
       ( $select_on_clinical ) AS c
  ON
    a.ParticipantBarcode = c.ParticipantBarcode
  WHERE
    (a.ParticipantBarcode IS NOT NULL
      OR c.ParticipantBarcode IS NOT NULL)
  ORDER BY
    a.classificationName,
    a.categoryName,
    a.ParticipantBarcode,
    c.ParticipantBarcode )
WHERE
  ( a.categoryName IS NULL
    AND a.classificationName IS NULL
    AND c.ParticipantBarcode IS NOT NULL )
ORDER BY
  c.ParticipantBarcode

Here we explicitly call just the first query in the module, and we get a list of 212 patients with one of these disqualifying annotations:

In [20]:

bq.Query(createCohort_and_checkAnnotations.select_on_annotations).results().to_dataframe()

Out[20]:

	ParticipantBarcode	categoryName	classificationName
0	TCGA-01-0629	Tumor tissue origin incorrect	Redaction
1	TCGA-13-1479	Tumor tissue origin incorrect	Redaction
2	TCGA-01-0638	Tumor tissue origin incorrect	Redaction
3	TCGA-33-4579	Tumor tissue origin incorrect	Redaction
4	TCGA-GN-A261	Tumor tissue origin incorrect	Redaction
5	TCGA-66-2751	Genotype mismatch	Redaction
6	TCGA-66-2752	Genotype mismatch	Redaction
7	TCGA-66-2750	Genotype mismatch	Redaction
8	TCGA-66-2746	Genotype mismatch	Redaction
9	TCGA-66-2747	Genotype mismatch	Redaction
10	TCGA-35-3621	Genotype mismatch	Redaction
11	TCGA-02-0002	Genotype mismatch	Redaction
12	TCGA-02-0117	Genotype mismatch	Redaction
13	TCGA-08-0384	Genotype mismatch	Redaction
14	TCGA-E2-A1IP	Genotype mismatch	Redaction
15	TCGA-14-0784	Genotype mismatch	Redaction
16	TCGA-14-1036	Genotype mismatch	Redaction
17	TCGA-06-0748	Genotype mismatch	Redaction
18	TCGA-02-2488	Genotype mismatch	Redaction
19	TCGA-14-1824	Genotype mismatch	Redaction
20	TCGA-PN-A8M9	Genotype mismatch	Redaction
21	TCGA-12-1601	Subject withdrew consent	Redaction
22	TCGA-12-0653	Subject withdrew consent	Redaction
23	TCGA-32-2498	Subject withdrew consent	Redaction
24	TCGA-AF-3912	Subject withdrew consent	Redaction
25	TCGA-A6-2670	Subject withdrew consent	Redaction
26	TCGA-06-0131	Subject withdrew consent	Redaction
27	TCGA-AN-A0FG	Subject identity unknown	Redaction
28	TCGA-AN-A0FE	Subject identity unknown	Redaction
29	TCGA-F4-6857	Subject identity unknown	Redaction
...	...	...	...
182	TCGA-AP-A053	History of unacceptable prior treatment relate...	Notification
183	TCGA-AX-A06D	History of unacceptable prior treatment relate...	Notification
184	TCGA-AX-A1CP	History of unacceptable prior treatment relate...	Notification
185	TCGA-AX-A1CR	History of unacceptable prior treatment relate...	Notification
186	TCGA-AX-A2H8	History of unacceptable prior treatment relate...	Notification
187	TCGA-AX-A2HF	History of unacceptable prior treatment relate...	Notification
188	TCGA-AX-A3G3	History of unacceptable prior treatment relate...	Notification
189	TCGA-B5-A0KB	History of unacceptable prior treatment relate...	Notification
190	TCGA-BG-A221	History of unacceptable prior treatment relate...	Notification
191	TCGA-D1-A3JP	History of unacceptable prior treatment relate...	Notification
192	TCGA-EY-A1G8	History of unacceptable prior treatment relate...	Notification
193	TCGA-L5-A88T	History of unacceptable prior treatment relate...	Notification
194	TCGA-WB-A820	History of unacceptable prior treatment relate...	Notification
195	TCGA-XK-AAJ3	History of unacceptable prior treatment relate...	Notification
196	TCGA-EJ-7312	History of unacceptable prior treatment relate...	Notification
197	TCGA-96-A4JK	History of unacceptable prior treatment relate...	Notification
198	TCGA-2G-AAFE	History of unacceptable prior treatment relate...	Notification
199	TCGA-IC-A6RF	History of unacceptable prior treatment relate...	Notification
200	TCGA-BA-4075	History of unacceptable prior treatment relate...	Notification
201	TCGA-06-6391	History of unacceptable prior treatment relate...	Notification
202	TCGA-5L-AAT1	History of unacceptable prior treatment relate...	Notification
203	TCGA-HT-A619	History of unacceptable prior treatment relate...	Notification
204	TCGA-T1-A6J8	History of unacceptable prior treatment relate...	Notification
205	TCGA-BG-A0M8	History of unacceptable prior treatment relate...	Notification
206	TCGA-XK-AAK1	History of unacceptable prior treatment relate...	Notification
207	TCGA-BH-A0B6	History of unacceptable prior treatment relate...	Notification
208	TCGA-BG-A0MS	History of unacceptable prior treatment relate...	Notification
209	TCGA-AR-A2LR	History of unacceptable prior treatment relate...	Notification
210	TCGA-BH-A1F5	History of unacceptable prior treatment relate...	Notification
211	TCGA-DM-A286	Inadvertently shipped	Redaction

212 rows × 3 columns

and here we explicitly call just the second query, resulting in 329 patients:

In [21]:

bq.Query(createCohort_and_checkAnnotations.select_on_clinical).results().to_dataframe()

Out[21]:

	ParticipantBarcode	vital_status	days_to_last_known_alive	ethnicity	histological_type	menopause_status	race
0	TCGA-BH-A18M	Dead	2207	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	None	WHITE
1	TCGA-BH-A18V	Dead	1556	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	None	WHITE
2	TCGA-Z7-A8R6	Alive	3256	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
3	TCGA-3C-AALI	Alive	4005	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	BLACK OR AFRICAN AMERICAN
4	TCGA-4H-AAAK	Alive	348	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	WHITE
5	TCGA-5L-AAT0	Alive	1477	HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	WHITE
6	TCGA-A1-A0SN	Alive	1196	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	WHITE
7	TCGA-A1-A0SJ	Alive	416	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	None	BLACK OR AFRICAN AMERICAN
8	TCGA-A1-A0SQ	Alive	554	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
9	TCGA-A1-A0SP	Alive	584	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	None	None
10	TCGA-A1-A0SH	Alive	1437	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
11	TCGA-A2-A25E	Alive	3204	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
12	TCGA-A2-A25B	Alive	1291	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
13	TCGA-A2-A0SX	Alive	1534	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
14	TCGA-A2-A0YL	Alive	1474	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
15	TCGA-A2-A0D4	Alive	767	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
16	TCGA-A2-A3XZ	Alive	1532	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
17	TCGA-A2-A0D2	Alive	1027	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
18	TCGA-A2-A04U	Alive	2654	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Peri (6-12 months since last menstrual period)	WHITE
19	TCGA-A2-A04V	Dead	1920	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
20	TCGA-A2-A0CV	Alive	3011	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	ASIAN
21	TCGA-A2-A0T3	Alive	1516	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
22	TCGA-A2-A3XU	Dead	912	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	BLACK OR AFRICAN AMERICAN
23	TCGA-A2-A3XV	Alive	996	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
24	TCGA-A2-A0T6	Alive	575	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
25	TCGA-A2-A0D3	Alive	1873	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	WHITE
26	TCGA-A2-A0EX	Alive	752	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	WHITE
27	TCGA-A2-A25A	Alive	3276	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
28	TCGA-A2-A3XT	Alive	2770	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
29	TCGA-A2-A1G0	Alive	616	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
...	...	...	...	...	...	...	...
299	TCGA-GM-A2DA	Dead	6593	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	WHITE
300	TCGA-GM-A2DL	Alive	3519	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Peri (6-12 months since last menstrual period)	WHITE
301	TCGA-GM-A3XN	Alive	2019	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	WHITE
302	TCGA-GM-A3XL	Alive	2108	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	BLACK OR AFRICAN AMERICAN
303	TCGA-GM-A3XG	Alive	1330	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
304	TCGA-HN-A2OB	Dead	1900	None	Infiltrating Lobular Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	None
305	TCGA-JL-A3YX	Alive	352	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	ASIAN
306	TCGA-JL-A3YW	Alive	360	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	ASIAN
307	TCGA-LD-A66U	Alive	646	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	None	WHITE
308	TCGA-LL-A5YP	Alive	450	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	BLACK OR AFRICAN AMERICAN
309	TCGA-LL-A7SZ	Alive	594	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
310	TCGA-LL-A6FR	Alive	489	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	BLACK OR AFRICAN AMERICAN
311	TCGA-LL-A5YN	Alive	447	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	BLACK OR AFRICAN AMERICAN
312	TCGA-LL-A5YO	Alive	440	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
313	TCGA-MS-A51U	Alive	681	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	None	WHITE
314	TCGA-OL-A5RV	Alive	1062	None	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
315	TCGA-OL-A5RW	Alive	1106	None	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
316	TCGA-OL-A5D8	Alive	973	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
317	TCGA-OL-A6VQ	Alive	600	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
318	TCGA-OL-A6VO	Alive	858	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	None	BLACK OR AFRICAN AMERICAN
319	TCGA-OL-A66I	Alive	714	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	None	BLACK OR AFRICAN AMERICAN
320	TCGA-OL-A6VR	Alive	1220	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	None	BLACK OR AFRICAN AMERICAN
321	TCGA-OL-A66O	Alive	528	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
322	TCGA-PE-A5DE	Alive	2645	None	Infiltrating Lobular Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	WHITE
323	TCGA-PL-A8LZ	Alive	302	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
324	TCGA-PL-A8LY	Alive	8	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
325	TCGA-PL-A8LX	Alive	5	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Pre (<6 months since LMP AND no prior bilatera...	BLACK OR AFRICAN AMERICAN
326	TCGA-S3-AA14	Alive	529	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	BLACK OR AFRICAN AMERICAN
327	TCGA-S3-A6ZH	Alive	641	NOT HISPANIC OR LATINO	Infiltrating Ductal Carcinoma	None	BLACK OR AFRICAN AMERICAN
328	TCGA-XX-A899	Alive	467	NOT HISPANIC OR LATINO	Infiltrating Lobular Carcinoma	Post (prior bilateral ovariectomy OR >12 mo si...	WHITE

329 rows × 7 columns

and finally we call the main query:

In [22]:

bq.Query(createCohort_and_checkAnnotations).results().to_dataframe()

Out[22]:

	ParticipantBarcode
0	TCGA-3C-AALI
1	TCGA-4H-AAAK
2	TCGA-5L-AAT0
3	TCGA-A1-A0SH
4	TCGA-A1-A0SJ
5	TCGA-A1-A0SN
6	TCGA-A1-A0SP
7	TCGA-A1-A0SQ
8	TCGA-A2-A04P
9	TCGA-A2-A04Q
10	TCGA-A2-A04R
11	TCGA-A2-A04U
12	TCGA-A2-A04V
13	TCGA-A2-A04W
14	TCGA-A2-A04X
15	TCGA-A2-A0CL
16	TCGA-A2-A0CM
17	TCGA-A2-A0CV
18	TCGA-A2-A0CZ
19	TCGA-A2-A0D2
20	TCGA-A2-A0D3
21	TCGA-A2-A0D4
22	TCGA-A2-A0EX
23	TCGA-A2-A0SX
24	TCGA-A2-A0T3
25	TCGA-A2-A0T5
26	TCGA-A2-A0T6
27	TCGA-A2-A0YE
28	TCGA-A2-A0YJ
29	TCGA-A2-A0YL
...	...
297	TCGA-GM-A2DL
298	TCGA-GM-A3XG
299	TCGA-GM-A3XL
300	TCGA-GM-A3XN
301	TCGA-HN-A2OB
302	TCGA-JL-A3YW
303	TCGA-JL-A3YX
304	TCGA-LD-A66U
305	TCGA-LL-A5YN
306	TCGA-LL-A5YO
307	TCGA-LL-A5YP
308	TCGA-LL-A6FR
309	TCGA-LL-A7SZ
310	TCGA-MS-A51U
311	TCGA-OL-A5D8
312	TCGA-OL-A5RV
313	TCGA-OL-A5RW
314	TCGA-OL-A66I
315	TCGA-OL-A66O
316	TCGA-OL-A6VO
317	TCGA-OL-A6VQ
318	TCGA-OL-A6VR
319	TCGA-PE-A5DE
320	TCGA-PL-A8LX
321	TCGA-PL-A8LY
322	TCGA-PL-A8LZ
323	TCGA-S3-A6ZH
324	TCGA-S3-AA14
325	TCGA-XX-A899
326	TCGA-Z7-A8R6

327 rows × 1 columns

Note that we didn't need to call each sub-query individually, we could have just called the main query and gotten the same result. As you can see, two patients that met the clinical select criteria (which returned 329 patients) were excluded from the final result (which returned 327 patients).

Before we leave off, here are a few useful tricks for working with BigQuery in Cloud Datalab:

if you want to see the raw SQL, you can just build the query and then print it out (this might be useful, for example, in debugging a query -- you can copy and paste the SQL directly into the BigQuery Web UI);
if you want to see how much data and which tables are going to be touched by this data, you can use the "dry run" option. (Notice the "cacheHit" flag -- if you have recently done a particular query, you will not be charged to repeat it since it will have been cached.)

In [23]:

q = bq.Query(createCohort_and_checkAnnotations)
q

Out[23]:

SELECT
  c.ParticipantBarcode AS ParticipantBarcode
FROM (
  SELECT
    a.categoryName,
    a.classificationName,
    a.ParticipantBarcode,
    c.ParticipantBarcode,
  FROM ( (SELECT
  ParticipantBarcode,
  annotationCategoryName AS categoryName,
  annotationClassification AS classificationName
FROM
  [isb-cgc:tcga_201607_beta.Annotations]
WHERE
  ( itemTypeName="Patient"
    AND (annotationCategoryName="History of unacceptable prior treatment related to a prior/other malignancy"
      OR annotationClassification="Redaction" ) )
GROUP BY
  ParticipantBarcode,
  categoryName,
  classificationName) ) AS a
  OUTER JOIN EACH 
       ( (SELECT
  ParticipantBarcode,
  vital_status,
  days_to_last_known_alive,
  ethnicity,
  histological_type,
  menopause_status,
  race
FROM
  [isb-cgc:tcga_201607_beta.Clinical_data]
WHERE
  ( Study="BRCA"
    AND age_at_initial_pathologic_diagnosis<=50
    AND gender="FEMALE" )) ) AS c
  ON
    a.ParticipantBarcode = c.ParticipantBarcode
  WHERE
    (a.ParticipantBarcode IS NOT NULL
      OR c.ParticipantBarcode IS NOT NULL)
  ORDER BY
    a.classificationName,
    a.categoryName,
    a.ParticipantBarcode,
    c.ParticipantBarcode )
WHERE
  ( a.categoryName IS NULL
    AND a.classificationName IS NULL
    AND c.ParticipantBarcode IS NOT NULL )
ORDER BY
  c.ParticipantBarcode

In [24]:

q.execute_dry_run()

Out[24]:

{u'cacheHit': True,
 u'referencedTables': [{u'datasetId': u'tcga_201607_beta',
   u'projectId': u'isb-cgc',
   u'tableId': u'Annotations'},
  {u'datasetId': u'tcga_201607_beta',
   u'projectId': u'isb-cgc',
   u'tableId': u'Clinical_data'}],
 u'totalBytesBilled': u'0',
 u'totalBytesProcessed': u'785007'}

In [ ]: