This notebook will show you how to create a TCGA cohort using the publicly available TCGA BigQuery tables that the ISB-CGC project has produced based on the open-access TCGA data available at the Data Portal. You will need to have access to a Google Cloud Platform (GCP) project in order to use BigQuery. If you don't already have one, you can sign up for a free-trial or contact us and become part of the community evaluation phase of our Cancer Genomics Cloud pilot.
We are not attempting to provide a thorough BigQuery or IPython tutorial here, as a wealth of such information already exists. Here are some links to some resources that you might find useful:
There are also many tutorials and samples available on github (see, in particular, the datalab repo and the Google Genomics project).
OK then, let's get started! In order to work with BigQuery, the first thing you need to do is import the bigquery module:
import gcp.bigquery as bq
The next thing you need to know is how to access the specific tables you are interested in. BigQuery tables are organized into datasets, and datasets are owned by a specific GCP project. The tables we will be working with in this notebook are in a dataset called tcga_201607_beta
, owned by the isb-cgc
project. A full table identifier is of the form <project_id>:<dataset_id>.<table_id>
. Let's start by getting some basic information about the tables in this dataset:
d = bq.DataSet('isb-cgc:tcga_201607_beta')
for t in d.tables():
print '%10d rows %12d bytes %s' \
% (t.metadata.rows, t.metadata.size, t.name.table_id)
6322 rows 1729204 bytes Annotations 23797 rows 6382147 bytes Biospecimen_data 11160 rows 4201379 bytes Clinical_data 2646095 rows 333774244 bytes Copy_Number_segments 3944304319 rows 445303830985 bytes DNA_Methylation_betas 382335670 rows 43164264006 bytes DNA_Methylation_chr1 197519895 rows 22301345198 bytes DNA_Methylation_chr10 235823572 rows 26623975945 bytes DNA_Methylation_chr11 198050739 rows 22359642619 bytes DNA_Methylation_chr12 97301675 rows 10986815862 bytes DNA_Methylation_chr13 123239379 rows 13913712352 bytes DNA_Methylation_chr14 124566185 rows 14064712239 bytes DNA_Methylation_chr15 179772812 rows 20296128173 bytes DNA_Methylation_chr16 234003341 rows 26417830751 bytes DNA_Methylation_chr17 50216619 rows 5669139362 bytes DNA_Methylation_chr18 211386795 rows 23862583107 bytes DNA_Methylation_chr19 279668485 rows 31577200462 bytes DNA_Methylation_chr2 86858120 rows 9805923353 bytes DNA_Methylation_chr20 35410447 rows 3997986812 bytes DNA_Methylation_chr21 70676468 rows 7978947938 bytes DNA_Methylation_chr22 201119616 rows 22705358910 bytes DNA_Methylation_chr3 159148744 rows 17968482285 bytes DNA_Methylation_chr4 195864180 rows 22113162401 bytes DNA_Methylation_chr5 290275524 rows 32772371379 bytes DNA_Methylation_chr6 240010275 rows 27097948808 bytes DNA_Methylation_chr7 164810092 rows 18607886221 bytes DNA_Methylation_chr8 81260723 rows 9173717922 bytes DNA_Methylation_chr9 98082681 rows 11072059468 bytes DNA_Methylation_chrX 2330426 rows 263109775 bytes DNA_Methylation_chrY 1867233 rows 207365611 bytes Protein_RPPA_data 5356089 rows 5715538107 bytes Somatic_Mutation_calls 5738048 rows 657855993 bytes mRNA_BCGSC_GA_RPKM 38299138 rows 4459086535 bytes mRNA_BCGSC_HiSeq_RPKM 44037186 rows 5116942528 bytes mRNA_BCGSC_RPKM 16794358 rows 1934755686 bytes mRNA_UNC_GA_RSEM 211284521 rows 24942992190 bytes mRNA_UNC_HiSeq_RSEM 228078879 rows 26877747876 bytes mRNA_UNC_RSEM 11997545 rows 2000881026 bytes miRNA_BCGSC_GA_isoform 4503046 rows 527101917 bytes miRNA_BCGSC_GA_mirna 90237323 rows 15289326462 bytes miRNA_BCGSC_HiSeq_isoform 28207741 rows 3381212265 bytes miRNA_BCGSC_HiSeq_mirna 102234868 rows 17290207488 bytes miRNA_BCGSC_isoform 32710787 rows 3908314182 bytes miRNA_BCGSC_mirna 26763022 rows 3265303352 bytes miRNA_Expression
In this tutorial, we are going to look at a few different ways that we can use the information in these tables to create cohorts. Now, you maybe asking what we mean by "cohort" and why you might be interested in creating one, or maybe what it even means to "create" a cohort. The TCGA dataset includes clinical, biospecimen, and molecular data from over 10,000 cancer patients who agreed to be a part of this landmark research project to build The Cancer Genome Atlas. This large dataset was originally organized and studied according to cancer type but now that this multi-year project is nearing completion, with over 30 types of cancer and over 10,000 tumors analyzed, you have the opportunity to look at this dataset from whichever angle most interests you. Maybe you are particularly interested in early-onset cancers, or gastro-intestinal cancers, or a specific type of genetic mutation. This is where the idea of a "cohort" comes in. The original TCGA "cohorts" were based on cancer type (aka "study"), but now you can define a cohort based on virtually any clinical or molecular feature by querying these BigQuery tables. A cohort is simply a list of samples, using the TCGA barcode system. Once you have created a cohort you can use it in any number of ways: you could further explore the data available for one cohort, or compare one cohort to another, for example.
In the rest of this tutorial, we will create several different cohorts based on different motivating research questions. We hope that these examples will provide you with a starting point from which you can build, to answer your own research questions.
Let's start by looking at the clinical data table. The TCGA dataset contains a few very basic clinical data elements for almost all patients, and contains additional information for some tumor types only. For example smoking history information is generally available only for lung cancer patients, and BMI (body mass index) is only available for tumor types where that is a known significant risk factor. Let's take a look at the clinical data table and see how many different pieces of information are available to us:
%bigquery schema --table isb-cgc:tcga_201607_beta.Clinical_data
That's a lot of fields! We can also get at the schema programmatically:
table = bq.Table('isb-cgc:tcga_201607_beta.Clinical_data')
if ( table.exists() ):
fieldNames = map(lambda tsf: tsf.name, table.schema)
fieldTypes = map(lambda tsf: tsf.data_type, table.schema)
print " This table has %d fields. " % ( len(fieldNames) )
print " The first few field names and types are: "
print " ", fieldNames[:5]
print " ", fieldTypes[:5]
else:
print " There is no existing table called %s:%s.%s" % ( table.name.project_id, table.name.dataset_id, table.name.table_id )
This table has 70 fields. The first few field names and types are: [u'ParticipantBarcode', u'Study', u'Project', u'ParticipantUUID', u'TSSCode'] [u'STRING', u'STRING', u'STRING', u'STRING', u'STRING']
Let's look at these fields and see which ones might be the most "interesting", by looking at how many times they are filled-in (not NULL), or how much variation exists in the values. If we wanted to look at just a single field, "tobacco_smoking_history" for example, we could use a very simple query to get a basic summary:
%%sql
SELECT tobacco_smoking_history, COUNT(*) AS n
FROM [isb-cgc:tcga_201607_beta.Clinical_data]
GROUP BY tobacco_smoking_history
ORDER BY n DESC
tobacco_smoking_history | n |
---|---|
8161 | |
1 | 865 |
4 | 799 |
2 | 710 |
3 | 568 |
5 | 57 |
But if we want to loop over all fields and get a sense of which fields might provide us with useful criteria for specifying a cohort, we'll want to automate that. We'll put a threshold on the minimum number of patients that we expect information for, and the maximum number of unique values (since fields such as the "ParticipantBarcode" will be unique for every patient and, although we will need that field later, it's probably not useful for defining a cohort).
numPatients = table.metadata.rows
print " The %s table describes a total of %d patients. " % ( table.name.table_id, numPatients )
# let's set a threshold for the minimum number of values that a field should have,
# and also the maximum number of unique values
minNumPatients = int(numPatients*0.80)
maxNumValues = 50
numInteresting = 0
iList = []
for iField in range(len(fieldNames)):
aField = fieldNames[iField]
aType = fieldTypes[iField]
try:
qString = "SELECT {0} FROM [{1}]".format(aField,table)
query = bq.Query(qString)
df = query.to_dataframe()
summary = df[str(aField)].describe()
if ( aType == "STRING" ):
topFrac = float(summary['freq'])/float(summary['count'])
if ( summary['count'] >= minNumPatients ):
if ( summary['unique'] <= maxNumValues and summary['unique'] > 1 ):
if ( topFrac < 0.90 ):
numInteresting += 1
iList += [aField]
print " > %s has %d values with %d unique (%s occurs %d times) " \
% (str(aField), summary['count'], summary['unique'], summary['top'], summary['freq'])
else:
if ( summary['count'] >= minNumPatients ):
if ( summary['std'] > 0.1 ):
numInteresting += 1
iList += [aField]
print " > %s has %d values (mean=%.0f, sigma=%.0f) " \
% (str(aField), summary['count'], summary['mean'], summary['std'])
except:
pass
print " "
print " Found %d potentially interesting features: " % numInteresting
print " ", iList
The Clinical_data table describes a total of 11160 patients. > Study has 11160 values with 33 unique (BRCA occurs 1097 times) > age_at_initial_pathologic_diagnosis has 11109 values (mean=59, sigma=14) > batch_number has 11160 values (mean=203, sigma=135) > vital_status has 11156 values with 2 unique (Alive occurs 7534 times) > days_to_birth has 11041 values (mean=-21763, sigma=5266) > days_to_last_known_alive has 11102 values (mean=1037, sigma=1041) > gender has 11160 values with 2 unique (FEMALE occurs 5815 times) > year_of_initial_pathologic_diagnosis has 11030 values (mean=2008, sigma=4) > person_neoplasm_cancer_status has 10236 values with 2 unique (TUMOR FREE occurs 6507 times) > race has 9835 values with 5 unique (WHITE occurs 8186 times) Found 10 potentially interesting features: [u'Study', u'age_at_initial_pathologic_diagnosis', u'batch_number', u'vital_status', u'days_to_birth', u'days_to_last_known_alive', u'gender', u'year_of_initial_pathologic_diagnosis', u'person_neoplasm_cancer_status', u'race']
The above helps us narrow down on which fields are likely to be the most useful, but if you have a specific interest, for example in menopause or HPV status, you can still look at those in more detail very easily:
%%sql
SELECT menopause_status, COUNT(*) AS n
FROM [isb-cgc:tcga_201607_beta.Clinical_data]
WHERE menopause_status IS NOT NULL
GROUP BY menopause_status
ORDER BY n DESC
menopause_status | n |
---|---|
Post (prior bilateral ovariectomy OR >12 mo since LMP with no prior hysterectomy) | 1291 |
Pre (<6 months since LMP AND no prior bilateral ovariectomy AND not on estrogen replacement) | 389 |
Peri (6-12 months since last menstrual period) | 82 |
Indeterminate (neither Pre or Postmenopausal) | 54 |
We might wonder which specific tumor types have menopause information:
%%sql
SELECT Study, COUNT(*) AS n
FROM [isb-cgc:tcga_201607_beta.Clinical_data]
WHERE menopause_status IS NOT NULL
GROUP BY Study
ORDER BY n DESC
Study | n |
---|---|
BRCA | 1007 |
UCEC | 517 |
CESC | 237 |
UCS | 55 |
%%sql
SELECT hpv_status, hpv_calls, COUNT(*) AS n
FROM [isb-cgc:tcga_201607_beta.Clinical_data]
WHERE hpv_status IS NOT NULL
GROUP BY hpv_status, hpv_calls
HAVING n > 20
ORDER BY n DESC
hpv_status | hpv_calls | n |
---|---|---|
Negative | 664 | |
Positive | HPV16 | 238 |
Positive | HPV18 | 41 |
Positive | HPV33 | 25 |
Positive | HPV45 | 24 |
An additional factor to consider, when creating a cohort is that there may be additional information that might lead one to exclude a particular patient from a cohort. In certain instances, patients have been redacted or excluded from analyses for reasons such as prior treatment, etc, but since different researchers may have different criteria for using or excluding certain patients or certain samples from their analyses, in many cases the data is still available while at the same time "annotations" may have been entered into a searchable database. These annotations have also been uploaded into a BigQuery table and can be used in conjuction with the other BigQuery tables.
Now that we have a better idea of what types of information is available in the Clinical data table, let's create a cohort consisting of female breast-cancer patients, diagnosed at the age of 50 or younger.
In this next code cell, we define several queries within a module
which allows us to use them both individually and by reference in the final, main query.
select_on_annotations
, finds all patients in the Annotations table which have either been 'redacted' or had 'unacceptable prior treatment';select_on_clinical
selects all female breast-cancer patients who were diagnosed at age 50 or younger, while also pulling out a few additional fields that might be of interest; and%%sql --module createCohort_and_checkAnnotations
DEFINE QUERY select_on_annotations
SELECT
ParticipantBarcode,
annotationCategoryName AS categoryName,
annotationClassification AS classificationName
FROM
[isb-cgc:tcga_201607_beta.Annotations]
WHERE
( itemTypeName="Patient"
AND (annotationCategoryName="History of unacceptable prior treatment related to a prior/other malignancy"
OR annotationClassification="Redaction" ) )
GROUP BY
ParticipantBarcode,
categoryName,
classificationName
DEFINE QUERY select_on_clinical
SELECT
ParticipantBarcode,
vital_status,
days_to_last_known_alive,
ethnicity,
histological_type,
menopause_status,
race
FROM
[isb-cgc:tcga_201607_beta.Clinical_data]
WHERE
( Study="BRCA"
AND age_at_initial_pathologic_diagnosis<=50
AND gender="FEMALE" )
SELECT
c.ParticipantBarcode AS ParticipantBarcode
FROM (
SELECT
a.categoryName,
a.classificationName,
a.ParticipantBarcode,
c.ParticipantBarcode,
FROM ( $select_on_annotations ) AS a
OUTER JOIN EACH
( $select_on_clinical ) AS c
ON
a.ParticipantBarcode = c.ParticipantBarcode
WHERE
(a.ParticipantBarcode IS NOT NULL
OR c.ParticipantBarcode IS NOT NULL)
ORDER BY
a.classificationName,
a.categoryName,
a.ParticipantBarcode,
c.ParticipantBarcode )
WHERE
( a.categoryName IS NULL
AND a.classificationName IS NULL
AND c.ParticipantBarcode IS NOT NULL )
ORDER BY
c.ParticipantBarcode
Here we explicitly call just the first query in the module, and we get a list of 212 patients with one of these disqualifying annotations:
bq.Query(createCohort_and_checkAnnotations.select_on_annotations).results().to_dataframe()
ParticipantBarcode | categoryName | classificationName | |
---|---|---|---|
0 | TCGA-01-0629 | Tumor tissue origin incorrect | Redaction |
1 | TCGA-13-1479 | Tumor tissue origin incorrect | Redaction |
2 | TCGA-01-0638 | Tumor tissue origin incorrect | Redaction |
3 | TCGA-33-4579 | Tumor tissue origin incorrect | Redaction |
4 | TCGA-GN-A261 | Tumor tissue origin incorrect | Redaction |
5 | TCGA-66-2751 | Genotype mismatch | Redaction |
6 | TCGA-66-2752 | Genotype mismatch | Redaction |
7 | TCGA-66-2750 | Genotype mismatch | Redaction |
8 | TCGA-66-2746 | Genotype mismatch | Redaction |
9 | TCGA-66-2747 | Genotype mismatch | Redaction |
10 | TCGA-35-3621 | Genotype mismatch | Redaction |
11 | TCGA-02-0002 | Genotype mismatch | Redaction |
12 | TCGA-02-0117 | Genotype mismatch | Redaction |
13 | TCGA-08-0384 | Genotype mismatch | Redaction |
14 | TCGA-E2-A1IP | Genotype mismatch | Redaction |
15 | TCGA-14-0784 | Genotype mismatch | Redaction |
16 | TCGA-14-1036 | Genotype mismatch | Redaction |
17 | TCGA-06-0748 | Genotype mismatch | Redaction |
18 | TCGA-02-2488 | Genotype mismatch | Redaction |
19 | TCGA-14-1824 | Genotype mismatch | Redaction |
20 | TCGA-PN-A8M9 | Genotype mismatch | Redaction |
21 | TCGA-12-1601 | Subject withdrew consent | Redaction |
22 | TCGA-12-0653 | Subject withdrew consent | Redaction |
23 | TCGA-32-2498 | Subject withdrew consent | Redaction |
24 | TCGA-AF-3912 | Subject withdrew consent | Redaction |
25 | TCGA-A6-2670 | Subject withdrew consent | Redaction |
26 | TCGA-06-0131 | Subject withdrew consent | Redaction |
27 | TCGA-AN-A0FG | Subject identity unknown | Redaction |
28 | TCGA-AN-A0FE | Subject identity unknown | Redaction |
29 | TCGA-F4-6857 | Subject identity unknown | Redaction |
... | ... | ... | ... |
182 | TCGA-AP-A053 | History of unacceptable prior treatment relate... | Notification |
183 | TCGA-AX-A06D | History of unacceptable prior treatment relate... | Notification |
184 | TCGA-AX-A1CP | History of unacceptable prior treatment relate... | Notification |
185 | TCGA-AX-A1CR | History of unacceptable prior treatment relate... | Notification |
186 | TCGA-AX-A2H8 | History of unacceptable prior treatment relate... | Notification |
187 | TCGA-AX-A2HF | History of unacceptable prior treatment relate... | Notification |
188 | TCGA-AX-A3G3 | History of unacceptable prior treatment relate... | Notification |
189 | TCGA-B5-A0KB | History of unacceptable prior treatment relate... | Notification |
190 | TCGA-BG-A221 | History of unacceptable prior treatment relate... | Notification |
191 | TCGA-D1-A3JP | History of unacceptable prior treatment relate... | Notification |
192 | TCGA-EY-A1G8 | History of unacceptable prior treatment relate... | Notification |
193 | TCGA-L5-A88T | History of unacceptable prior treatment relate... | Notification |
194 | TCGA-WB-A820 | History of unacceptable prior treatment relate... | Notification |
195 | TCGA-XK-AAJ3 | History of unacceptable prior treatment relate... | Notification |
196 | TCGA-EJ-7312 | History of unacceptable prior treatment relate... | Notification |
197 | TCGA-96-A4JK | History of unacceptable prior treatment relate... | Notification |
198 | TCGA-2G-AAFE | History of unacceptable prior treatment relate... | Notification |
199 | TCGA-IC-A6RF | History of unacceptable prior treatment relate... | Notification |
200 | TCGA-BA-4075 | History of unacceptable prior treatment relate... | Notification |
201 | TCGA-06-6391 | History of unacceptable prior treatment relate... | Notification |
202 | TCGA-5L-AAT1 | History of unacceptable prior treatment relate... | Notification |
203 | TCGA-HT-A619 | History of unacceptable prior treatment relate... | Notification |
204 | TCGA-T1-A6J8 | History of unacceptable prior treatment relate... | Notification |
205 | TCGA-BG-A0M8 | History of unacceptable prior treatment relate... | Notification |
206 | TCGA-XK-AAK1 | History of unacceptable prior treatment relate... | Notification |
207 | TCGA-BH-A0B6 | History of unacceptable prior treatment relate... | Notification |
208 | TCGA-BG-A0MS | History of unacceptable prior treatment relate... | Notification |
209 | TCGA-AR-A2LR | History of unacceptable prior treatment relate... | Notification |
210 | TCGA-BH-A1F5 | History of unacceptable prior treatment relate... | Notification |
211 | TCGA-DM-A286 | Inadvertently shipped | Redaction |
212 rows × 3 columns
and here we explicitly call just the second query, resulting in 329 patients:
bq.Query(createCohort_and_checkAnnotations.select_on_clinical).results().to_dataframe()
ParticipantBarcode | vital_status | days_to_last_known_alive | ethnicity | histological_type | menopause_status | race | |
---|---|---|---|---|---|---|---|
0 | TCGA-BH-A18M | Dead | 2207 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | None | WHITE |
1 | TCGA-BH-A18V | Dead | 1556 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | None | WHITE |
2 | TCGA-Z7-A8R6 | Alive | 3256 | NOT HISPANIC OR LATINO | Infiltrating Lobular Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | WHITE |
3 | TCGA-3C-AALI | Alive | 4005 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Post (prior bilateral ovariectomy OR >12 mo si... | BLACK OR AFRICAN AMERICAN |
4 | TCGA-4H-AAAK | Alive | 348 | NOT HISPANIC OR LATINO | Infiltrating Lobular Carcinoma | Post (prior bilateral ovariectomy OR >12 mo si... | WHITE |
5 | TCGA-5L-AAT0 | Alive | 1477 | HISPANIC OR LATINO | Infiltrating Lobular Carcinoma | Post (prior bilateral ovariectomy OR >12 mo si... | WHITE |
6 | TCGA-A1-A0SN | Alive | 1196 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Post (prior bilateral ovariectomy OR >12 mo si... | WHITE |
7 | TCGA-A1-A0SJ | Alive | 416 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | None | BLACK OR AFRICAN AMERICAN |
8 | TCGA-A1-A0SQ | Alive | 554 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | WHITE |
9 | TCGA-A1-A0SP | Alive | 584 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | None | None |
10 | TCGA-A1-A0SH | Alive | 1437 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | WHITE |
11 | TCGA-A2-A25E | Alive | 3204 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | BLACK OR AFRICAN AMERICAN |
12 | TCGA-A2-A25B | Alive | 1291 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | BLACK OR AFRICAN AMERICAN |
13 | TCGA-A2-A0SX | Alive | 1534 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | WHITE |
14 | TCGA-A2-A0YL | Alive | 1474 | NOT HISPANIC OR LATINO | Infiltrating Lobular Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | WHITE |
15 | TCGA-A2-A0D4 | Alive | 767 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | BLACK OR AFRICAN AMERICAN |
16 | TCGA-A2-A3XZ | Alive | 1532 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | BLACK OR AFRICAN AMERICAN |
17 | TCGA-A2-A0D2 | Alive | 1027 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | WHITE |
18 | TCGA-A2-A04U | Alive | 2654 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Peri (6-12 months since last menstrual period) | WHITE |
19 | TCGA-A2-A04V | Dead | 1920 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | WHITE |
20 | TCGA-A2-A0CV | Alive | 3011 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Post (prior bilateral ovariectomy OR >12 mo si... | ASIAN |
21 | TCGA-A2-A0T3 | Alive | 1516 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | WHITE |
22 | TCGA-A2-A3XU | Dead | 912 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Post (prior bilateral ovariectomy OR >12 mo si... | BLACK OR AFRICAN AMERICAN |
23 | TCGA-A2-A3XV | Alive | 996 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | BLACK OR AFRICAN AMERICAN |
24 | TCGA-A2-A0T6 | Alive | 575 | NOT HISPANIC OR LATINO | Infiltrating Lobular Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | WHITE |
25 | TCGA-A2-A0D3 | Alive | 1873 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Post (prior bilateral ovariectomy OR >12 mo si... | WHITE |
26 | TCGA-A2-A0EX | Alive | 752 | NOT HISPANIC OR LATINO | Infiltrating Lobular Carcinoma | Post (prior bilateral ovariectomy OR >12 mo si... | WHITE |
27 | TCGA-A2-A25A | Alive | 3276 | NOT HISPANIC OR LATINO | Infiltrating Lobular Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | WHITE |
28 | TCGA-A2-A3XT | Alive | 2770 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | BLACK OR AFRICAN AMERICAN |
29 | TCGA-A2-A1G0 | Alive | 616 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | WHITE |
... | ... | ... | ... | ... | ... | ... | ... |
299 | TCGA-GM-A2DA | Dead | 6593 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Post (prior bilateral ovariectomy OR >12 mo si... | WHITE |
300 | TCGA-GM-A2DL | Alive | 3519 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Peri (6-12 months since last menstrual period) | WHITE |
301 | TCGA-GM-A3XN | Alive | 2019 | NOT HISPANIC OR LATINO | Infiltrating Lobular Carcinoma | Post (prior bilateral ovariectomy OR >12 mo si... | WHITE |
302 | TCGA-GM-A3XL | Alive | 2108 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Post (prior bilateral ovariectomy OR >12 mo si... | BLACK OR AFRICAN AMERICAN |
303 | TCGA-GM-A3XG | Alive | 1330 | NOT HISPANIC OR LATINO | Infiltrating Lobular Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | WHITE |
304 | TCGA-HN-A2OB | Dead | 1900 | None | Infiltrating Lobular Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | None |
305 | TCGA-JL-A3YX | Alive | 352 | NOT HISPANIC OR LATINO | Infiltrating Lobular Carcinoma | Post (prior bilateral ovariectomy OR >12 mo si... | ASIAN |
306 | TCGA-JL-A3YW | Alive | 360 | NOT HISPANIC OR LATINO | Infiltrating Lobular Carcinoma | Post (prior bilateral ovariectomy OR >12 mo si... | ASIAN |
307 | TCGA-LD-A66U | Alive | 646 | NOT HISPANIC OR LATINO | Infiltrating Lobular Carcinoma | None | WHITE |
308 | TCGA-LL-A5YP | Alive | 450 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Post (prior bilateral ovariectomy OR >12 mo si... | BLACK OR AFRICAN AMERICAN |
309 | TCGA-LL-A7SZ | Alive | 594 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | BLACK OR AFRICAN AMERICAN |
310 | TCGA-LL-A6FR | Alive | 489 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Post (prior bilateral ovariectomy OR >12 mo si... | BLACK OR AFRICAN AMERICAN |
311 | TCGA-LL-A5YN | Alive | 447 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Post (prior bilateral ovariectomy OR >12 mo si... | BLACK OR AFRICAN AMERICAN |
312 | TCGA-LL-A5YO | Alive | 440 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | BLACK OR AFRICAN AMERICAN |
313 | TCGA-MS-A51U | Alive | 681 | NOT HISPANIC OR LATINO | Infiltrating Lobular Carcinoma | None | WHITE |
314 | TCGA-OL-A5RV | Alive | 1062 | None | Infiltrating Ductal Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | BLACK OR AFRICAN AMERICAN |
315 | TCGA-OL-A5RW | Alive | 1106 | None | Infiltrating Ductal Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | BLACK OR AFRICAN AMERICAN |
316 | TCGA-OL-A5D8 | Alive | 973 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | BLACK OR AFRICAN AMERICAN |
317 | TCGA-OL-A6VQ | Alive | 600 | NOT HISPANIC OR LATINO | Infiltrating Lobular Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | WHITE |
318 | TCGA-OL-A6VO | Alive | 858 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | None | BLACK OR AFRICAN AMERICAN |
319 | TCGA-OL-A66I | Alive | 714 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | None | BLACK OR AFRICAN AMERICAN |
320 | TCGA-OL-A6VR | Alive | 1220 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | None | BLACK OR AFRICAN AMERICAN |
321 | TCGA-OL-A66O | Alive | 528 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | BLACK OR AFRICAN AMERICAN |
322 | TCGA-PE-A5DE | Alive | 2645 | None | Infiltrating Lobular Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | WHITE |
323 | TCGA-PL-A8LZ | Alive | 302 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | BLACK OR AFRICAN AMERICAN |
324 | TCGA-PL-A8LY | Alive | 8 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | BLACK OR AFRICAN AMERICAN |
325 | TCGA-PL-A8LX | Alive | 5 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Pre (<6 months since LMP AND no prior bilatera... | BLACK OR AFRICAN AMERICAN |
326 | TCGA-S3-AA14 | Alive | 529 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | Post (prior bilateral ovariectomy OR >12 mo si... | BLACK OR AFRICAN AMERICAN |
327 | TCGA-S3-A6ZH | Alive | 641 | NOT HISPANIC OR LATINO | Infiltrating Ductal Carcinoma | None | BLACK OR AFRICAN AMERICAN |
328 | TCGA-XX-A899 | Alive | 467 | NOT HISPANIC OR LATINO | Infiltrating Lobular Carcinoma | Post (prior bilateral ovariectomy OR >12 mo si... | WHITE |
329 rows × 7 columns
and finally we call the main query:
bq.Query(createCohort_and_checkAnnotations).results().to_dataframe()
ParticipantBarcode | |
---|---|
0 | TCGA-3C-AALI |
1 | TCGA-4H-AAAK |
2 | TCGA-5L-AAT0 |
3 | TCGA-A1-A0SH |
4 | TCGA-A1-A0SJ |
5 | TCGA-A1-A0SN |
6 | TCGA-A1-A0SP |
7 | TCGA-A1-A0SQ |
8 | TCGA-A2-A04P |
9 | TCGA-A2-A04Q |
10 | TCGA-A2-A04R |
11 | TCGA-A2-A04U |
12 | TCGA-A2-A04V |
13 | TCGA-A2-A04W |
14 | TCGA-A2-A04X |
15 | TCGA-A2-A0CL |
16 | TCGA-A2-A0CM |
17 | TCGA-A2-A0CV |
18 | TCGA-A2-A0CZ |
19 | TCGA-A2-A0D2 |
20 | TCGA-A2-A0D3 |
21 | TCGA-A2-A0D4 |
22 | TCGA-A2-A0EX |
23 | TCGA-A2-A0SX |
24 | TCGA-A2-A0T3 |
25 | TCGA-A2-A0T5 |
26 | TCGA-A2-A0T6 |
27 | TCGA-A2-A0YE |
28 | TCGA-A2-A0YJ |
29 | TCGA-A2-A0YL |
... | ... |
297 | TCGA-GM-A2DL |
298 | TCGA-GM-A3XG |
299 | TCGA-GM-A3XL |
300 | TCGA-GM-A3XN |
301 | TCGA-HN-A2OB |
302 | TCGA-JL-A3YW |
303 | TCGA-JL-A3YX |
304 | TCGA-LD-A66U |
305 | TCGA-LL-A5YN |
306 | TCGA-LL-A5YO |
307 | TCGA-LL-A5YP |
308 | TCGA-LL-A6FR |
309 | TCGA-LL-A7SZ |
310 | TCGA-MS-A51U |
311 | TCGA-OL-A5D8 |
312 | TCGA-OL-A5RV |
313 | TCGA-OL-A5RW |
314 | TCGA-OL-A66I |
315 | TCGA-OL-A66O |
316 | TCGA-OL-A6VO |
317 | TCGA-OL-A6VQ |
318 | TCGA-OL-A6VR |
319 | TCGA-PE-A5DE |
320 | TCGA-PL-A8LX |
321 | TCGA-PL-A8LY |
322 | TCGA-PL-A8LZ |
323 | TCGA-S3-A6ZH |
324 | TCGA-S3-AA14 |
325 | TCGA-XX-A899 |
326 | TCGA-Z7-A8R6 |
327 rows × 1 columns
Note that we didn't need to call each sub-query individually, we could have just called the main query and gotten the same result. As you can see, two patients that met the clinical select criteria (which returned 329 patients) were excluded from the final result (which returned 327 patients).
Before we leave off, here are a few useful tricks for working with BigQuery in Cloud Datalab:
q = bq.Query(createCohort_and_checkAnnotations)
q
SELECT c.ParticipantBarcode AS ParticipantBarcode FROM ( SELECT a.categoryName, a.classificationName, a.ParticipantBarcode, c.ParticipantBarcode, FROM ( (SELECT ParticipantBarcode, annotationCategoryName AS categoryName, annotationClassification AS classificationName FROM [isb-cgc:tcga_201607_beta.Annotations] WHERE ( itemTypeName="Patient" AND (annotationCategoryName="History of unacceptable prior treatment related to a prior/other malignancy" OR annotationClassification="Redaction" ) ) GROUP BY ParticipantBarcode, categoryName, classificationName) ) AS a OUTER JOIN EACH ( (SELECT ParticipantBarcode, vital_status, days_to_last_known_alive, ethnicity, histological_type, menopause_status, race FROM [isb-cgc:tcga_201607_beta.Clinical_data] WHERE ( Study="BRCA" AND age_at_initial_pathologic_diagnosis<=50 AND gender="FEMALE" )) ) AS c ON a.ParticipantBarcode = c.ParticipantBarcode WHERE (a.ParticipantBarcode IS NOT NULL OR c.ParticipantBarcode IS NOT NULL) ORDER BY a.classificationName, a.categoryName, a.ParticipantBarcode, c.ParticipantBarcode ) WHERE ( a.categoryName IS NULL AND a.classificationName IS NULL AND c.ParticipantBarcode IS NOT NULL ) ORDER BY c.ParticipantBarcode
q.execute_dry_run()
{u'cacheHit': True, u'referencedTables': [{u'datasetId': u'tcga_201607_beta', u'projectId': u'isb-cgc', u'tableId': u'Annotations'}, {u'datasetId': u'tcga_201607_beta', u'projectId': u'isb-cgc', u'tableId': u'Clinical_data'}], u'totalBytesBilled': u'0', u'totalBytesProcessed': u'785007'}